Machine Learning Part 2: Understanding Data Preprocessing For Machine Learning
Please Subscribe Youtube| Like Facebook | Follow Twitter
Understanding Data Preprocessing For Machine Learning
Data preprocessing is a critical step in machine learning that involves preparing raw data for training models. In this article, we will explore various data preprocessing techniques, including data cleaning, handling missing values, feature scaling, normalization, and dealing with categorical variables. Each technique will be accompanied by Python code examples to demonstrate their implementation.
Data Cleaning and Handling Missing Values
Data cleaning ensures the dataset is free from inconsistencies and errors. Handling missing values is an essential part of this process.
Here’s an example that generates a sample dataset and demonstrates data cleaning and handling missing values using Python and Pandas
First please download pandas library/package first by running below command
pip install pandas
Code
import pandas as pd
import numpy as np
import random
# Generate sample data
np.random.seed(123) # For reproducibility
names = ["John", "Michael", "William", "James", "Benjamin"]
data = {
"ID": np.arange(1, 11),
"Name": [random.choice(names) for _ in range(10)],
"Age": np.random.randint(18, 65, size=10).astype(float), # Change data type to float
"Salary": [50000, np.nan, 60000, np.nan, 70000, np.nan, 80000, np.nan, 90000, np.nan]
}
# Create DataFrame
df = pd.DataFrame(data)
# Save DataFrame as CSV
df.to_csv("dataset.csv", index=False)
This code snippet generates a sample dataset with four columns: “ID,” “Name,” “Age,” and “Salary.” It contains missing values represented as np.nan from the NumPy library. Generated CSV File will have now some missing values in data.
Code Explanation
The code generates sample data, creates a DataFrame, and saves it as a CSV file. The following steps are executed:
Step 1: Importing the necessary libraries
- The pandas library is imported as pd for data manipulation.
- The numpy library is imported as np for numerical operations.
- The random module is imported to generate random names.
Step 2: Generate sample data
- Sample data is generated using the np.random and random.choice functions.
- The np.random.seed(123) statement sets a random seed for reproducibility.
- A list of names is created.
- A dictionary named ‘data’ is created to hold the sample data.
- The ‘ID’ column is generated using np.arange to create a sequence of numbers from 1 to 10.
- The ‘Name’ column is generated by randomly choosing a name from the list of names.
- The ‘Age’ column is generated using np.random.randint to generate random integers between 18 and 65.
- The ‘Salary’ column is created with some values as 50000 and others as NaN (missing values).
Step 3: Create DataFrame
- A DataFrame named ‘df’ is created using the pd.DataFrame function.
- The ‘data’ dictionary is passed as the data argument to create the DataFrame.
Step 4: Save DataFrame as CSV
- The DataFrame is saved as a CSV file named ‘dataset.csv’ using the to_csv method of the ‘df’ DataFrame.
- The index is set to False to exclude the index column from the CSV file.
Now, let’s move on to the data cleaning and handling missing values part:
Here are the options for handling missing values:
1. Dropping rows with missing values
This approach removes entire rows that contain missing values. It is suitable when the missing values are substantial and dropping rows does not significantly affect the dataset.
import pandas as pd
# Load the dataset
df = pd.read_csv("dataset.csv")
# Check for missing values
print(df.isnull().sum())
# Drop rows with missing values
df.dropna(inplace=True)
# Save the cleaned dataset as CSV
df.to_csv("cleaned_dataset.csv", index=False)
Code Explanation
The code loads a dataset from a CSV file, performs missing value handling, and saves the cleaned dataset as a new CSV file. The following steps are executed:
Step 1: Importing the necessary libraries
- The pandas library is imported as pd for data manipulation.
Step 2: Load the dataset
- The dataset is loaded from a CSV file named ‘dataset.csv’ using the pd.read_csv function.
- The DataFrame is assigned to a variable named ‘df’.
Step 3: Check for missing values
- Missing values in the dataset are checked using the isnull method of the DataFrame, chained with the sum method.
- The result is printed to the console using the print function.
- The sum of missing values for each column is displayed.
Step 4: Drop rows with missing values
- Rows with missing values in the dataset are dropped using the dropna method of the DataFrame.
- The inplace parameter is set to True to modify the DataFrame in place.
Step 5: Save the cleaned dataset as CSV
- The cleaned DataFrame is saved as a new CSV file named ‘cleaned_dataset.csv’ using the to_csv method of the DataFrame.
- The index is set to False to exclude the index column from the CSV file.
2. Filling missing values with mean
This approach replaces missing values with a statistical measure such as the mean, median, or mode. It is suitable when the missing values are relatively small in number and can be reasonably estimated.
import pandas as pd
# Load the dataset
df = pd.read_csv("dataset.csv")
# Check for missing values
print(df.isnull().sum())
# Fill missing values with mean
df["Age"].fillna(df["Age"].mean(), inplace=True)
df["Salary"].fillna(df["Salary"].mean(), inplace=True)
# Save the cleaned dataset as CSV
df.to_csv("cleaned_dataset.csv", index=False)
Code Explanation
The code loads a dataset from a CSV file, performs missing value handling by filling missing values with the mean, and saves the cleaned dataset as a new CSV file. The following steps are executed:
Step 1: Importing the necessary libraries
- The pandas library is imported as pd for data manipulation.
Step 2: Load the dataset
- The dataset is loaded from a CSV file named ‘dataset.csv’ using the pd.read_csv function.
- The DataFrame is assigned to a variable named ‘df’.
Step 3: Check for missing values
- Missing values in the dataset are checked using the isnull method of the DataFrame, chained with the sum method.
- The result is printed to the console using the print function.
- The sum of missing values for each column is displayed.
Step 4: Fill missing values with mean
- Missing values in the ‘Age’ and ‘Salary’ columns are filled with the mean values of their respective columns.
- The fillna method of the DataFrame is used to fill missing values.
- The mean value of each column is computed using the mean method of the DataFrame.
- The inplace parameter is set to True to modify the DataFrame in place.
Step 5: Save the cleaned dataset as CSV
- The cleaned DataFrame is saved as a new CSV file named ‘cleaned_dataset.csv’ using the to_csv method of the DataFrame.
- The index is set to False to exclude the index column from the CSV file.
Please choose the appropriate approach based on your specific data and requirements.
Feature Scaling and Normalization
Feature scaling ensures that all input features are on a similar scale, preventing certain features from dominating others. Normalization transforms the data to a common scale, often between 0 and 1.
First please download Scikit-learn library (sklearn) first by running below command
pip install scikit-learn
Here’s an example code that demonstrates feature scaling and normalization using the MinMaxScaler and Normalizer classes from the sklearn.preprocessing module:
from sklearn.preprocessing import MinMaxScaler, Normalizer
import numpy as np
# Create a sample dataset
data = np.array([[10, 20, 30],
[5, 15, 25],
[2, 8, 12]])
# Perform feature scaling
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data)
# Perform normalization
normalizer = Normalizer()
normalized_data = normalizer.fit_transform(data)
# Print the original, scaled, and normalized data
print("Original data:\n", data)
print("\nScaled data:\n", scaled_data)
print("\nNormalized data:\n", normalized_data)
Output
Original data: [[10 20 30] [ 5 15 25] [ 2 8 12]] Scaled data: [[1. 1. 1. ] [0.375 0.58333333 0.72222222] [0. 0. 0. ]] Normalized data: [[0.26726124 0.53452248 0.80178373] [0.16903085 0.50709255 0.84515425] [0.13736056 0.54944226 0.82416338]]
In this example, we create a sample dataset with three features represented by a NumPy array. The MinMaxScaler is used for feature scaling, which scales each feature between 0 and 1. The Normalizer is used for normalization, which scales each sample (row) to have a unit norm.
The output shows the original data, the scaled data (feature scaling), and the normalized data. Note that in the scaled data, each feature is scaled independently between 0 and 1. In the normalized data, each row is scaled to have a unit norm (Euclidean norm).
Code Explanation
The code performs feature scaling and normalization on a sample dataset using the MinMaxScaler and Normalizer classes from scikit-learn. The following steps are executed:
Step 1: Importing the necessary libraries
- The MinMaxScaler and Normalizer classes are imported from the sklearn.preprocessing module.
- The numpy library is imported as np for numerical operations.
Step 2: Create a sample dataset
- A sample dataset is created using the np.array function.
- The dataset is a 3×3 array with integer values.
Step 3: Perform feature scaling
- Feature scaling is performed on the sample dataset using the MinMaxScaler class.
- An instance of the MinMaxScaler class is created as scaler.
- The fit_transform method of the scaler instance is called on the data to perform scaling.
- The scaled data is assigned to the scaled_data variable.
Step 4: Perform normalization
- Normalization is performed on the sample dataset using the Normalizer class.
- An instance of the Normalizer class is created as normalizer.
- The fit_transform method of the normalizer instance is called on the data to perform normalization.
- The normalized data is assigned to the normalized_data variable.
Step 5: Print the original, scaled, and normalized data
- The original data, scaled data, and normalized data are printed to the console using the print function.
Dealing with Categorical Variables
Categorical variables represent discrete values and need to be converted into numerical representations for machine learning models. Two common techniques are one-hot encoding and label encoding.
Here is example that demonstrates one-hot encoding and label encoding using the OneHotEncoder and LabelEncoder classes
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
import numpy as np
# Example data
data = np.array(['apple', 'banana', 'cherry', 'banana', 'apple'])
# Label encoding
label_encoder = LabelEncoder()
label_encoded_data = label_encoder.fit_transform(data)
print("Label encoded data:", label_encoded_data)
# One-hot encoding
onehot_encoder = OneHotEncoder(sparse_output=False)
onehot_encoded_data = onehot_encoder.fit_transform(label_encoded_data.reshape(-1, 1))
print("One-hot encoded data:")
print(onehot_encoded_data)
Output
Label encoded data: [0 1 2 1 0] One-hot encoded data: [[1. 0. 0.] [0. 1. 0.] [0. 0. 1.] [0. 1. 0.] [1. 0. 0.]]
In the above example, we have an array data containing categorical values representing fruits. We first apply label encoding using the LabelEncoder class, which assigns a unique numeric label to each category. The resulting label encoded data is [0 1 2 1 0], where ‘apple’ is assigned label 0, ‘banana’ is assigned label 1, and ‘cherry’ is assigned label 2.
Next, we apply one-hot encoding using the OneHotEncoder class. It takes the label encoded data as input and converts it into a binary array representation, where each column represents a unique category and a value of 1 indicates the presence of that category. The resulting one-hot encoded data is a 2D array with three columns, where each row corresponds to the original data points. For example, the first row [1. 0. 0.] represents the one-hot encoded representation of ‘apple’, where the first column is 1 indicating the presence of ‘apple’, and the remaining columns are 0.
Code Explanation
The code performs label encoding and one-hot encoding on an example dataset using the LabelEncoder and OneHotEncoder classes from scikit-learn. The following steps are executed:
Step 1: Importing the necessary libraries
- The OneHotEncoder and LabelEncoder classes are imported from the sklearn.preprocessing module.
- The numpy library is imported as np for numerical operations.
Step 2: Example data
- An example dataset is created using the np.array function.
- The dataset is an array of string values.
Step 3: Label encoding
- Label encoding is performed on the example dataset using the LabelEncoder class.
- An instance of the LabelEncoder class is created as label_encoder.
- The fit_transform method of the label_encoder instance is called on the data to perform label encoding.
- The label encoded data is assigned to the label_encoded_data variable.
Step 4: Print the label encoded data
- The label encoded data is printed to the console using the print function.
Step 5: One-hot encoding
- One-hot encoding is performed on the label encoded data using the OneHotEncoder class.
- An instance of the OneHotEncoder class is created as onehot_encoder.
- The fit_transform method of the onehot_encoder instance is called on the label_encoded_data reshaped to a column vector to perform one-hot encoding.
- The one-hot encoded data is assigned to the onehot_encoded_data variable.
Step 6: Print the one-hot encoded data
- The one-hot encoded data is printed to the console using the print function.
Both label encoding and one-hot encoding are commonly used techniques in machine learning to represent categorical data numerically, enabling algorithms to process them effectively.
Conclusion
Data preprocessing plays a crucial role in machine learning as it helps prepare the data for effective model training and accurate predictions. Here are the key aspects of data preprocessing that we discussed:
- Data cleaning and handling missing values: This step involves identifying and addressing any errors, inconsistencies, or missing values in the dataset. Missing values can be filled using techniques like mean imputation or interpolation, or they can be completely removed depending on the extent of missingness and the nature of the data.
- Feature scaling and normalization: It is important to bring the features onto a similar scale to prevent certain variables from dominating others. Feature scaling techniques, such as standardization (subtracting mean and dividing by standard deviation) or normalization (scaling values to a specific range), help achieve this. Scaling also helps algorithms converge faster during training.
- Dealing with categorical variables: Categorical variables, such as gender, color, or type, cannot be directly used in most machine learning algorithms that expect numerical inputs. Two common techniques for handling categorical variables are one-hot encoding and label encoding. One-hot encoding creates binary columns for each category, while label encoding assigns a unique numeric label to each category.
Machine Learning In Python Beginner Tutorial Series
- Machine Learning Part 1: Introduction
- Machine Learning Part 2: Understanding Data Preprocessing For Machine Learning
- Machine Learning Part 3: Exploratory Data Analysis for Machine Learning
- Machine Learning Part 4: Introduction to Supervised Learning Algorithms
- Machine Learning Part 5: Introduction to Unsupervised Learning Algorithms
- Machine Learning Part 6: Evaluating Machine Learning Models
- Machine Learning Part 7: Deep Learning and Neural Networks
- Machine Learning Part 8: Natural Language Processing (NLP)
- Machine Learning Part 9: Recommender Systems
- Machine Learning Part 10: Model Deployment and Productionization