G.O.D. Framework

Script: ai_data_preparation.py - Preparing Data for Machine Learning Pipelines

Introduction

ai_data_preparation.py serves as the backbone for preparing raw datasets for ingestion into machine learning (ML) or AI pipelines. It encompasses various preprocessing tasks such as handling missing values, scaling features, encoding categorical variables, and splitting datasets for training and testing.

Purpose

Key Features

Logic and Implementation

The script leverages widely used data preprocessing techniques, using libraries such as pandas and scikit-learn for efficient operations. Below is an example of its core components:


            import pandas as pd
            from sklearn.model_selection import train_test_split
            from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder

            class DataPreparator:
                def __init__(self, config):
                    """
                    Initialize the data preparator with specific preprocessing configurations.
                    :param config: Dictionary of preprocessing options (e.g., scaling, encoding types).
                    """
                    self.config = config

                def handle_missing_data(self, dataframe):
                    """
                    Handle missing data based on strategy (impute or drop).
                    """
                    strategy = self.config.get('missing_data_strategy', 'drop')
                    if strategy == 'drop':
                        dataframe = dataframe.dropna()
                    elif strategy == 'mean':
                        dataframe = dataframe.fillna(dataframe.mean())
                    elif strategy == 'median':
                        dataframe = dataframe.fillna(dataframe.median())
                    return dataframe

                def encode_features(self, dataframe):
                    """
                    Encode categorical features based on the configuration.
                    """
                    encoding = self.config.get('encoding', 'onehot')
                    if encoding == 'onehot':
                        return pd.get_dummies(dataframe)
                    elif encoding == 'label':
                        label_enc = LabelEncoder()
                        for col in dataframe.select_dtypes(include=['object']).columns:
                            dataframe[col] = label_enc.fit_transform(dataframe[col])
                    return dataframe

                def scale_features(self, dataframe):
                    """
                    Scale numerical features using standardization or min-max scaling.
                    """
                    scaler = StandardScaler() if self.config.get('scaling') == 'standard' else MinMaxScaler()
                    numerical_cols = dataframe.select_dtypes(include=['float64', 'int64']).columns
                    dataframe[numerical_cols] = scaler.fit_transform(dataframe[numerical_cols])
                    return dataframe

                def split_data(self, dataframe):
                    """
                    Split data into training, validation, and testing sets.
                    """
                    train, test = train_test_split(dataframe, test_size=0.2, random_state=42)
                    train, val = train_test_split(train, test_size=0.25, random_state=42)
                    return train, val, test

                def prepare(self, dataframe):
                    """
                    Pipeline to prepare the dataset by executing all preprocessing steps.
                    """
                    dataframe = self.handle_missing_data(dataframe)
                    dataframe = self.encode_features(dataframe)
                    dataframe = self.scale_features(dataframe)
                    train, val, test = self.split_data(dataframe)
                    return train, val, test

            if __name__ == "__main__":
                # Example usage for preparing a dataset
                config = {
                    'missing_data_strategy': 'mean',
                    'encoding': 'onehot',
                    'scaling': 'standard'
                }
                data = pd.read_csv('dataset.csv')
                preparator = DataPreparator(config)
                train_set, val_set, test_set = preparator.prepare(data)
                print("Training Set:", train_set.head())
                print("Validation Set:", val_set.head())
                print("Testing Set:", test_set.head())
            

Dependencies

The following libraries are required for ai_data_preparation.py:

How to Use This Script

Follow these steps to use ai_data_preparation.py:

  1. Prepare a raw dataset in a structured format (e.g., CSV).
  2. Configure preprocessing settings (e.g., encoding type, missing data strategy).
  3. Run the script to produce clean and ready-to-use datasets.

            # Example Usage
            from ai_data_preparation import DataPreparator

            raw_data = pd.read_csv('raw_data_file.csv')
            config = {
                'missing_data_strategy': 'drop',
                'encoding': 'label',
                'scaling': 'standard'
            }
            preparator = DataPreparator(config)
            train, validation, test = preparator.prepare(raw_data)
            

Role in the G.O.D. Framework

Future Enhancements