It is rare to find one dataset that provides you with all of the information you require to answer a problem. Often, we need to join datasets together to get the full picture. This article covers what joins are and how you might implement them, as well as a full walk through video tutorial on preparing and joining data.
<aside> ℹ️
Contents
</aside>
There are four main types of data join:
Diagram showing the four main join types
Sometimes, datasets will have identifiable columns in common:
Sometimes, you need to employ probabilistic matching to find the most likely match:
<aside> ℹ️
Address Matching is a complex task with a range of methodologies. For more information, see this article by GeoPlace:
Address matching methodologies | blog
</aside>
‘Fuzzy Matching’ refers to methods that assign a similarity score between two pieces of text. There are a range of fuzzy matching methodologies that produce a similarity score between 0 and 1, where 1 means the texts are identical and 0 means there is no similarity.
<aside> ℹ️
See below for a more detailed overview of some fuzzy matching algorithms.
</aside>
In the below walkthrough, we use fuzzy matching to match addresses. This is not as sophisticated a methodology as those mentioned in the articles above, and has it’s faults.
For example: “123 - Data House, Information Rd, SW1 1AA” will have a very high matching score to “12 - Data House, Information Rd, SW1 1AA”, despite being clearly different addresses to humans