Introduction
ai_data_preparation.py
serves as the backbone for preparing raw datasets for ingestion into machine learning (ML) or AI pipelines. It encompasses various preprocessing tasks such as handling missing values, scaling features, encoding categorical variables, and splitting datasets for training and testing.
Purpose
- Data Cleaning: Ensure datasets are free from inconsistencies, missing values, or irrelevant features.
- Feature Transformation: Scale, normalize, or encode data for compatibility with machine learning models.
- Dataset Splitting: Divide data into training, validation, and testing subsets.
- Pipeline Compatibility: Output prepared data in formats suitable for downstream pipelines.
Key Features
- Missing Data Handling: Options to impute missing values or drop incomplete records.
- Feature Encoding: Encode categorical data using one-hot encoding, label encoding, or binary encoding.
- Data Scaling: Normalize numerical features using standardization or min-max scaling.
- Automated Splitting: Generate training, validation, and test datasets.
Logic and Implementation
The script leverages widely used data preprocessing techniques, using libraries such as pandas
and scikit-learn
for efficient operations. Below is an example of its core components:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
class DataPreparator:
def __init__(self, config):
"""
Initialize the data preparator with specific preprocessing configurations.
:param config: Dictionary of preprocessing options (e.g., scaling, encoding types).
"""
self.config = config
def handle_missing_data(self, dataframe):
"""
Handle missing data based on strategy (impute or drop).
"""
strategy = self.config.get('missing_data_strategy', 'drop')
if strategy == 'drop':
dataframe = dataframe.dropna()
elif strategy == 'mean':
dataframe = dataframe.fillna(dataframe.mean())
elif strategy == 'median':
dataframe = dataframe.fillna(dataframe.median())
return dataframe
def encode_features(self, dataframe):
"""
Encode categorical features based on the configuration.
"""
encoding = self.config.get('encoding', 'onehot')
if encoding == 'onehot':
return pd.get_dummies(dataframe)
elif encoding == 'label':
label_enc = LabelEncoder()
for col in dataframe.select_dtypes(include=['object']).columns:
dataframe[col] = label_enc.fit_transform(dataframe[col])
return dataframe
def scale_features(self, dataframe):
"""
Scale numerical features using standardization or min-max scaling.
"""
scaler = StandardScaler() if self.config.get('scaling') == 'standard' else MinMaxScaler()
numerical_cols = dataframe.select_dtypes(include=['float64', 'int64']).columns
dataframe[numerical_cols] = scaler.fit_transform(dataframe[numerical_cols])
return dataframe
def split_data(self, dataframe):
"""
Split data into training, validation, and testing sets.
"""
train, test = train_test_split(dataframe, test_size=0.2, random_state=42)
train, val = train_test_split(train, test_size=0.25, random_state=42)
return train, val, test
def prepare(self, dataframe):
"""
Pipeline to prepare the dataset by executing all preprocessing steps.
"""
dataframe = self.handle_missing_data(dataframe)
dataframe = self.encode_features(dataframe)
dataframe = self.scale_features(dataframe)
train, val, test = self.split_data(dataframe)
return train, val, test
if __name__ == "__main__":
# Example usage for preparing a dataset
config = {
'missing_data_strategy': 'mean',
'encoding': 'onehot',
'scaling': 'standard'
}
data = pd.read_csv('dataset.csv')
preparator = DataPreparator(config)
train_set, val_set, test_set = preparator.prepare(data)
print("Training Set:", train_set.head())
print("Validation Set:", val_set.head())
print("Testing Set:", test_set.head())
Dependencies
The following libraries are required for ai_data_preparation.py
:
pandas
: For data manipulation and cleaning.scikit-learn
: For data scaling, feature transformation, and dataset splitting.
How to Use This Script
Follow these steps to use ai_data_preparation.py
:
- Prepare a raw dataset in a structured format (e.g., CSV).
- Configure preprocessing settings (e.g., encoding type, missing data strategy).
- Run the script to produce clean and ready-to-use datasets.
# Example Usage
from ai_data_preparation import DataPreparator
raw_data = pd.read_csv('raw_data_file.csv')
config = {
'missing_data_strategy': 'drop',
'encoding': 'label',
'scaling': 'standard'
}
preparator = DataPreparator(config)
train, validation, test = preparator.prepare(raw_data)
Role in the G.O.D. Framework
- Automated Data Pipelines: Works within
ai_automated_data_pipeline.py
to process raw data for pipeline tasks. - Compatibility: Ensures datasets adhere to the requirements of downstream scripts like
ai_training_model.py
. - Error Prevention: Helps avoid runtime errors in model training caused by inconsistent or invalid data.
Future Enhancements
- Feature Selection: Automatically select the most relevant features based on correlation or feature importance.
- Advanced Imputation: Implement machine learning-based methods for imputing missing values.
- Real-Time Support: Support streaming data preprocessing for real-time AI systems.