It is rare to find one dataset that provides you with all of the information you require to answer a problem. Often, we need to join datasets together to get the full picture. This article covers what joins are and how you might implement them, as well as a full walk through video tutorial on preparing and joining data.

<aside> ℹ️

Contents

</aside>

How do data joins work?

There are four main types of data join:

Inner Join: Keeps only the matching data from both datasets. If there’s no match, that row is left out.
Outer Join: Keeps all data from both datasets. If there’s no match, missing values are filled in.
Left Join: Keeps all data from the first (left) dataset and only matching data from the second. If there’s no match, missing values appear on the right.
Right Join: Keeps all data from the second (right) dataset and only matching data from the first. If there’s no match, missing values appear on the left.

Diagram showing the four main join types

What do I use to match the datasets?

Sometimes, datasets will have identifiable columns in common:

For example, if two datasets of people include email addresses, they could be joined on that.

Sometimes, you need to employ probabilistic matching to find the most likely match:

This occurs when two fields could mean the same thing, but are recorded in different ways
For example these are two representations of one address:
- Flat 123, Data House, Information Road, Westminster, London, SW1 1AA
- 123 - Data House, Information Rd, SW1 1AA

<aside> ℹ️

Address Matching is a complex task with a range of methodologies. For more information, see this article by GeoPlace:

Address matching methodologies | blog

</aside>

Fuzzy Matching

‘Fuzzy Matching’ refers to methods that assign a similarity score between two pieces of text. There are a range of fuzzy matching methodologies that produce a similarity score between 0 and 1, where 1 means the texts are identical and 0 means there is no similarity.

<aside> ℹ️

See below for a more detailed overview of some fuzzy matching algorithms.

Fuzzy matching algorithms

</aside>

In the below walkthrough, we use fuzzy matching to match addresses. This is not as sophisticated a methodology as those mentioned in the articles above, and has it’s faults.

For example: “123 - Data House, Information Rd, SW1 1AA” will have a very high matching score to “12 - Data House, Information Rd, SW1 1AA”, despite being clearly different addresses to humans

Data Joining video walkthrough

Data Joining with titles.mp4