It is rare to find one dataset that provides you with all of the information you require to answer a problem. Often, we need to join datasets together to get the full picture. This article covers what joins are and how you might implement them, as well as a full walk through video tutorial on preparing and joining data.

<aside> ℹ️

Contents

</aside>

How do data joins work?

There are four main types of data join:

Diagram showing the four main join types

Diagram showing the four main join types

What do I use to match the datasets?

Sometimes, datasets will have identifiable columns in common:

Sometimes, you need to employ probabilistic matching to find the most likely match:

<aside> ℹ️

Address Matching is a complex task with a range of methodologies. For more information, see this article by GeoPlace:

Address matching methodologies | blog

</aside>

Fuzzy Matching

‘Fuzzy Matching’ refers to methods that assign a similarity score between two pieces of text. There are a range of fuzzy matching methodologies that produce a similarity score between 0 and 1, where 1 means the texts are identical and 0 means there is no similarity.

<aside> ℹ️

See below for a more detailed overview of some fuzzy matching algorithms.

Fuzzy matching algorithms

</aside>

In the below walkthrough, we use fuzzy matching to match addresses. This is not as sophisticated a methodology as those mentioned in the articles above, and has it’s faults.

For example: “123 - Data House, Information Rd, SW1 1AA” will have a very high matching score to “12 - Data House, Information Rd, SW1 1AA”, despite being clearly different addresses to humans

Data Joining video walkthrough

Data Joining with titles.mp4