G.O.D Framework

Script: ai_training_data.py

Purpose-built for the preparation and management of training data across AI pipelines.

Introduction

The ai_training_data.py script is an essential component of the G.O.D Framework. It focuses on managing and preparing data for training AI models. The script is designed to handle diverse datasets, perform preprocessing tasks, ensure data integrity, and create optimized pipelines to feed data into training algorithms.

Purpose

The primary objectives of this script include:

Key Features

Logic and Implementation

The script integrates libraries like pandas, sklearn, and tensorflow to streamline training data preparation. Below is an implementation example:


import pandas as pd
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.image import ImageDataGenerator

class TrainingDataManager:
    """
    Handles training data preparation, including loading, cleaning, augmentation, and splitting.
    """

    def load_data(self, file_path, file_format="csv"):
        """
        Load dataset into a pandas DataFrame.

        Args:
            file_path (str): Path to the data file.
            file_format (str): Format of the file (csv, json, etc.).

        Returns:
            DataFrame: Pandas DataFrame containing the dataset.
        """
        if file_format == 'csv':
            data = pd.read_csv(file_path)
        elif file_format == 'json':
            data = pd.read_json(file_path)
        else:
            raise ValueError("Unsupported file format")
        return data

    def clean_data(self, df):
        """
        Cleans raw data by handling null values and duplicates.

        Args:
            df (DataFrame): Input data.

        Returns:
            DataFrame: Cleaned data.
        """
        df = df.dropna()  # Remove missing values
        df = df.drop_duplicates()  # Remove duplicates
        return df

    def split_data(self, df, target_column, test_size=0.2, val_size=0.1, seed=42):
        """
        Split the dataset into training, validation, and testing subsets.

        Args:
            df (DataFrame): Input dataset.
            target_column (str): Target column name for ML training.
            test_size (float): Proportion of the dataset for testing.
            val_size (float): Proportion of the training set for validation.
            seed (int): Random state for reproducibility.

        Returns:
            dict: A dictionary with train, validation, and test sets.
        """
        train, test = train_test_split(df, test_size=test_size, random_state=seed, stratify=df[target_column])
        train, val = train_test_split(train, test_size=val_size, random_state=seed, stratify=train[target_column])
        return {"train": train, "validation": val, "test": test}

    def augment_images(self, image_dir, save_dir, target_size=(224, 224), batch_size=32):
        """
        Perform image data augmentation using Keras's ImageDataGenerator.

        Args:
            image_dir (str): Directory of raw images.
            save_dir (str): Directory to save augmented images.
            target_size (tuple): Image dimensions for resizing.
            batch_size (int): Batch size for data generator.

        Returns:
            ImageDataGenerator: Configured image data generator object.
        """
        datagen = ImageDataGenerator(
            rescale=1./255,           # Normalize images
            rotation_range=30,       # Random rotation
            width_shift_range=0.2,   # Horizontal shift
            height_shift_range=0.2,  # Vertical shift
            shear_range=0.2,         # Shear transformation
            zoom_range=0.2,          # Zoom
            horizontal_flip=True,    # Horizontal flip
            fill_mode='nearest'      # Filling strategy
        )
        datagen.flow_from_directory(
            image_dir,
            target_size=target_size,
            batch_size=batch_size,
            save_to_dir=save_dir,
            class_mode='categorical'
        )
        return datagen

# Example Usage
if __name__ == "__main__":
    manager = TrainingDataManager()
    data = manager.load_data("data/dataset.csv")
    cleaned_data = manager.clean_data(data)
    splits = manager.split_data(cleaned_data, target_column="label")
    print("Training Data:", splits['train'].shape)
        

Dependencies

Integration with the G.O.D Framework

Future Enhancements