G.O.D. Framework

Script: ai_data_validation.py - Ensuring Data Integrity and Cleanup

Introduction

The ai_data_validation.py script focuses on ensuring data integrity across the G.O.D. framework. It implements comprehensive validation mechanisms to sanitize incoming data streams, ensuring they adhere to the expected schema and quality requirements before feeding into the system's core processes.

Purpose

Key Features

Logic and Implementation

ai_data_validation.py uses modular validation functions, operating on data features such as schema validation, domain constraint enforcement, and missing value handling. Below is a sample implementation:


            import pandas as pd
            import logging

            class DataValidator:
                def __init__(self, schema):
                    """
                    Initialize the DataValidator with a predefined schema.
                    :param schema: Dictionary mapping columns to their expected types.
                    """
                    self.schema = schema

                def validate_schema(self, dataframe):
                    """
                    Validate the dataframe's schema.
                    :param dataframe: Input pandas DataFrame.
                    :return: Boolean, indicating success or failure of validation.
                    """
                    for column, dtype in self.schema.items():
                        if column not in dataframe:
                            logging.warning(f"Missing column: {column}")
                            return False
                        if not self._check_column_type(dataframe[column], dtype):
                            logging.error(f"Column '{column}' failed type check. Expected type: {dtype}.")
                            return False
                    return True

                def _check_column_type(self, column, expected_type):
                    """
                    Internal function to check the data type of a column.
                    :param column: DataFrame column.
                    :param expected_type: Expected data type as string.
                    :return: Boolean, indicating if the type matches.
                    """
                    try:
                        if expected_type == 'numeric':
                            return pd.api.types.is_numeric_dtype(column)
                        elif expected_type == 'string':
                            return pd.api.types.is_string_dtype(column)
                        elif expected_type == 'datetime':
                            return pd.api.types.is_datetime64_any_dtype(column)
                    except Exception as e:
                        logging.error(f"Error when checking column type: {e}")
                        return False
                    return False

                def drop_missing(self, dataframe):
                    """
                    Remove rows with missing values.
                    :param dataframe: Input pandas DataFrame.
                    :return: Cleaned DataFrame with rows containing missing values dropped.
                    """
                    return dataframe.dropna()

                def handle_nulls(self, dataframe, method='mean'):
                    """
                    Handle missing values by imputing with a specified method.
                    :param dataframe: Input pandas DataFrame.
                    :param method: String. Options: 'mean', 'median', or 'mode'.
                    :return: Modified DataFrame with imputations applied.
                    """
                    for column in dataframe.columns:
                        if dataframe[column].isnull().sum() > 0:
                            if method == 'mean' and pd.api.types.is_numeric_dtype(dataframe[column]):
                                dataframe[column].fillna(dataframe[column].mean(), inplace=True)
                            elif method == 'median' and pd.api.types.is_numeric_dtype(dataframe[column]):
                                dataframe[column].fillna(dataframe[column].median(), inplace=True)
                            elif method == 'mode':
                                dataframe[column].fillna(dataframe[column].mode()[0], inplace=True)
                    return dataframe

            if __name__ == "__main__":
                schema = {'id': 'numeric', 'name': 'string', 'age': 'numeric'}
                validator = DataValidator(schema)

                # Example dataframe
                df = pd.DataFrame({
                    'id': [1, 2, None],
                    'name': ['Alice', 'Bob', None],
                    'age': [25, None, 30]
                })
                print("Original DataFrame:")
                print(df)

                # Validate schema
                print("\nSchema validation:")
                print(validator.validate_schema(df))

                # Handle nulls
                print("\nHandling null values with mean:")
                clean_df = validator.handle_nulls(df, method='mean')
                print(clean_df)
            

Dependencies

The script primarily relies on the following libraries:

How to Use This Script

The script can be used as follows:

  1. Define a schema specifying the expected column names and types.
  2. Use validation functions to ensure the schema is met.
  3. Apply null handling techniques to clean the dataset.

            from ai_data_validation import DataValidator

            schema = {'id': 'numeric', 'score': 'numeric', 'timestamp': 'datetime'}
            validator = DataValidator(schema)

            # Example Data Validation
            df = load_some_dataframe()
            if validator.validate_schema(df):
                cleaned_df = validator.drop_missing(df)
            

Role in the G.O.D. Framework

Future Enhancements