A Crucial Step in Machine Learning Projects
Data preparation lays the groundwork for successful machine learning initiatives, enabling models to extract meaningful insights and predictions from the data. This critical phase often consumes a substantial 80% of overall project efforts. This process ensures that data is complete and primed for optimal utilization by machine learning models.
The Two Major Steps of Data Preparation
Data Wrangling
This is the initial step that involves collecting, cleansing, and refining raw data. The primary objective is to curate a comprehensive dataset ready for analysis. This is a collaborative effort that engages data scientists, domain experts, stakeholders, and occasionally, machine learning engineers.
This phase encompasses tasks like:
- Labeling Data
- Selecting pertinent features based on domain expertise
- Eliminating duplicate entries
- Discarding incomplete data
- Removing incomplete features
- Addressing outliers
- Filling in missing data points through supplementary research or alternate data sources
Feature Engineering
This phase focuses on transforming the data into a format suitable for machine learning algorithms. Data scientists and machine learning engineers typically spearhead this stage.
This phase includes Techniques like:
- Normalization and standardization of values
- Dimensionality reduction to eliminate less relevant features
- Decomposing features into multiple components (e.g., breaking down date/time into day, month, year)
- Aggregating multiple features into a single metric (e.g., calculating area from width and length)
- Crafting synthetic features using mathematical functions (logarithm, sine, cosine)
- Encoding and Binning Categorical Features