<aside> <img src="/icons/command-line_purple.svg" alt="/icons/command-line_purple.svg" width="40px" /> This article is the second step in the data transformation pipeline of Energy Performance Certificate data from the EPC open data API. The first step is data ingestion.

The input of this transformation process is a JSON dictionary object, and the output is a Pandas DataFrame (that can be exported to CSV, JSON etc)

The complete Jupyter notebook for this article can be found on the Data Guidance GitHub Repository. This article covers PART 2: Data Transformation.

</aside>

About This Article

Once you have accessed data, you need to be able to transform it to suit your needs. This process converts your raw data into a format that is useful to you, containing the information that you need.

This article continues on from the Data Ingestion, where we accessed and saved raw Energy Performance Certificate data from the EPC open data API. Here we describe the data transformation steps required to collect the information required from the data. These steps are:

<aside> <img src="/icons/video-camera_purple.svg" alt="/icons/video-camera_purple.svg" width="40px" /> Video Walkthrough

Here is a video walkthrough of the the Data Transformation section of the Jupyter Notebook containing the example code.

TransformingEPCData.mp4

Timestamps

</aside>


1. Defining the Data Need

Defining a data need means identifying and clearly outlining the specific information we require from the data. This could involve specifying the variables of interest, the target population, the specific questions that the data should help answer, and any other specific requirements or constraints.

The question we want to answer here is:

For each property type (house, flat, etc), for each tenure, what is the average potential energy efficiency increase as a percentage of current energy efficiency?

Therefore our needs from the data are:

Defining your data needs upfront saves time by ensuring you only work with the data that is relevant to your question.

We can then use the API documentation to identify where this information will be in the dataset. In this case we used the glossary of terms to discover the important properties of our dataset: