What is a Data Pipeline?

A data pipeline is a means of moving data from one or more sources to a destination (such as a data warehouse or an excel spreadsheet) while simultaneously optimising and transforming the data.

Typically, this process includes loading raw data into an interim staging area, where is it cleaned and transformed, before uploading it to the destination tables.

The pipeline constitutes the series of steps involved in aggregating, organising, cleaning and moving data. This can of course be done manually but modern data pipelines aim to automate many of the manual steps involved in transforming data loads.

Common Steps involved in a Data Pipeline

  1. Data Ingestion: The process of gathering data from various sources (such as databases, business applications, APIs, IoT devices etc)
  2. Data Integration: The process of bringing together data from multiple sources to provide a complete and accurate dataset for business intelligence (BI), data analysis, use by other applications and business processes. Data is transformed and processed the to achieve the desired structure
  3. Data quality: Clean and apply data quality rules
  4. Data Storage: Save the transformed data into your preferred target storage solution

Benefits of Setting up an Automated Data Pipeline

Data Pipeline Disadvantages

That said, there are also a number of potential disadvantages to consider