Introduction
The ai_training_data.py
script is an essential component of the G.O.D Framework. It focuses on
managing and preparing data for training AI models. The script is designed to handle diverse datasets, perform
preprocessing tasks, ensure data integrity, and create optimized pipelines to feed data into training algorithms.
Purpose
The primary objectives of this script include:
- Loading and parsing structured and unstructured training datasets.
- Performing data cleansing, normalization, and augmentation for training workflows.
- Splitting data into training, validation, and testing subsets in an optimized manner.
- Creating reusable data pipelines for scalable AI/ML model training processes.
- Ensuring compatible data formats for ML libraries and frameworks.
Key Features
- Data Loading: Supports multiple input formats such as CSV, JSON, SQL databases, and parquet files.
- Augmentation: Provides augmentation techniques to generate diversified training data.
- Splitting: Supports automated data splitting into training, validation, and testing sets.
- Validation: Performs data validation checks to flag null values, duplicates, or inconsistencies.
- Streaming Pipelines: Builds streaming pipelines to preprocess data in-memory for large datasets.
Logic and Implementation
The script integrates libraries like pandas
, sklearn
, and tensorflow
to streamline training data preparation. Below is an implementation example:
import pandas as pd
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.image import ImageDataGenerator
class TrainingDataManager:
"""
Handles training data preparation, including loading, cleaning, augmentation, and splitting.
"""
def load_data(self, file_path, file_format="csv"):
"""
Load dataset into a pandas DataFrame.
Args:
file_path (str): Path to the data file.
file_format (str): Format of the file (csv, json, etc.).
Returns:
DataFrame: Pandas DataFrame containing the dataset.
"""
if file_format == 'csv':
data = pd.read_csv(file_path)
elif file_format == 'json':
data = pd.read_json(file_path)
else:
raise ValueError("Unsupported file format")
return data
def clean_data(self, df):
"""
Cleans raw data by handling null values and duplicates.
Args:
df (DataFrame): Input data.
Returns:
DataFrame: Cleaned data.
"""
df = df.dropna() # Remove missing values
df = df.drop_duplicates() # Remove duplicates
return df
def split_data(self, df, target_column, test_size=0.2, val_size=0.1, seed=42):
"""
Split the dataset into training, validation, and testing subsets.
Args:
df (DataFrame): Input dataset.
target_column (str): Target column name for ML training.
test_size (float): Proportion of the dataset for testing.
val_size (float): Proportion of the training set for validation.
seed (int): Random state for reproducibility.
Returns:
dict: A dictionary with train, validation, and test sets.
"""
train, test = train_test_split(df, test_size=test_size, random_state=seed, stratify=df[target_column])
train, val = train_test_split(train, test_size=val_size, random_state=seed, stratify=train[target_column])
return {"train": train, "validation": val, "test": test}
def augment_images(self, image_dir, save_dir, target_size=(224, 224), batch_size=32):
"""
Perform image data augmentation using Keras's ImageDataGenerator.
Args:
image_dir (str): Directory of raw images.
save_dir (str): Directory to save augmented images.
target_size (tuple): Image dimensions for resizing.
batch_size (int): Batch size for data generator.
Returns:
ImageDataGenerator: Configured image data generator object.
"""
datagen = ImageDataGenerator(
rescale=1./255, # Normalize images
rotation_range=30, # Random rotation
width_shift_range=0.2, # Horizontal shift
height_shift_range=0.2, # Vertical shift
shear_range=0.2, # Shear transformation
zoom_range=0.2, # Zoom
horizontal_flip=True, # Horizontal flip
fill_mode='nearest' # Filling strategy
)
datagen.flow_from_directory(
image_dir,
target_size=target_size,
batch_size=batch_size,
save_to_dir=save_dir,
class_mode='categorical'
)
return datagen
# Example Usage
if __name__ == "__main__":
manager = TrainingDataManager()
data = manager.load_data("data/dataset.csv")
cleaned_data = manager.clean_data(data)
splits = manager.split_data(cleaned_data, target_column="label")
print("Training Data:", splits['train'].shape)
Dependencies
pandas
: For flexible DataFrame operations on tabular data.sklearn
: For automated dataset splitting and preprocessing.tensorflow
: For handling and augmenting image data.
Integration with the G.O.D Framework
- ai_training_model.py: Feeds prepared data directly into training pipelines.
- ai_data_preparation.py: Acts as a preprocessing engine for raw datasets.
- ai_model_validation.py: Provides comprehensively prepared datasets for validation workflows.
Future Enhancements
- Upgrade data augmentation with Generative Adversarial Networks (GANs) for synthetic data generation.
- Implement automated schema inference for raw, unstructured datasets.
- Extend support for streaming data and real-time preprocessing pipelines.