Data preprocessing describes the preparation of data for analysis. This preparation consists of four core activities:
- Data Cleaning: Complete the data, that is, add missing values.
- Data Transformation: Modify and adapt the data (that is, normalize or aggregate data).
- Data Integration: Integrate different data sets.
- Data Reduction: Reduce data volume, for example, by reducing the dimensions or compressing data.
These operations take place after the data extraction, where data is retrieved from a system. Data transformation is an important part of the ETL process.
How does data preprocessing work and why is it so important?
This stage is where you identify incomplete, false, and irrelevant data. In the next step, you’d replace, modify, or delete unusable data from the data set. If nonsensical or missing data is included in your dataset, the whole data picture can be distorted, which can create unreasonable or misleading results.
Data transformation includes normalizations and aggregations to make datasets more meaningful for given analysis targets. In aggregations, for example, subsequent visualizations of the data can be made more useful and meaningful. This step harmonizes data from different sources and unifies units and data schemata.
After the data transformation, different data sets can be linked together to create a uniform picture for analysis. Thanks to the now-uniform nature of the data from the transformation, this can be done via a common attribute.
Removing unnecessary and unimportant data from the data set makes any calculations more efficient and removes factors that could distort the result.
Tip: Coordinate any steps involving the removal or modification of data with the relevant departments in advance. Arbitrary modification or removal can lead to distortions, too.