What is it?

Inaccurate data is data that is wrong. It shows as errors, discrepancies, or inconsistencies within a dataset such as addresses with the wrong post codes, incorrect phone numbers, or misspelled names. Often, these errors are difficult to spot until an email bounces or a phone call results in a “wrong number”.

Why is it a Concern?

According to Gartner, inaccurate data costs organisations millions a year. Whatever the cause, whatever the issue, inaccurate data is unusable data as it can lead to faulty conclusions and misguided decisions.

What Causes Inaccurate Data?

Inaccuracies in your data set can originate from various sources including system malfunctions or issues with data integration. However, the main cause is human error which contributes to a substantial 75% of data loss. Given that data pipelines often hinge on human input, any missteps can render the data unusable.

For example, if customers are required to provide their contact details via a website form, there is a chance that they may input this information incorrectly.

Equally, organisations may be responsible. Suppose the records have been input by hand by many employees over time. Errors can arise through typos, misunderstood instructions, or failure to complete all required fields due to a number of factors (e.g. fatigue, carelessness, or lack of adequate training).

How to reduce the number of errors occurring?

It is significantly more cost and time effective to prevent bad data entering the system, than to fix it repeatedly, once it has been ingested and stored. Try to ‘shift left’, and improve safeguards as far as possible to the beginning of the process:

How can you identify and fix inaccurate data?

Resolving this issue often requires rigorous data validation and cleansing procedures

If we look at categorical data (data which may be divided into groups), examples are gender, sex, age group, and educational level. You can scan each type of item looking for obvious anomalies:

Look for entries outside what you view as a reasonable span of values (impossible negative or 0 values, age over 100, etc.)

If typical values are limited to a small number of acceptable answers, look for any other responses.

Ideally as the amount of data you hold grows, you may need to invest in a robust data quality monitoring solution to identify and isolate inaccurate data. You can then try to fix the flawed fields by comparing the inaccurate data with a known accurate dataset. If the data is still inaccurate, you will have to delete it to keep it from contaminating your data analysis.