Introduction
The ai_data_validation.py
script focuses on ensuring data integrity across the G.O.D. framework.
It implements comprehensive validation mechanisms to sanitize incoming data streams, ensuring they adhere to the expected schema and quality requirements before feeding into the system's core processes.
Purpose
- Data Schema Validation: Ensure datasets comply with predefined column types, formats, or shapes.
- Data Sanitization: Cleanse data by removing or correcting invalid or incomplete entries.
- Error Reporting: Provide detailed logs or feedback for datasets that fail validation.
- Integration Ready: Validate data pre-processing readiness for integration with other modules like training or monitoring.
Key Features
- Schema Enforcement: Verify that received data complies with the schema (e.g., column types).
- Null Handling: Identify and handle missing data via imputation or removal.
- Type Checking: Validate expected data types such as numeric, categorical, or string.
- Domain Constraints: Enforce rules like value ranges, specific enumerations, or unique row attributes.
Logic and Implementation
ai_data_validation.py
uses modular validation functions, operating on data features such as schema validation, domain constraint enforcement, and missing value handling.
Below is a sample implementation:
import pandas as pd
import logging
class DataValidator:
def __init__(self, schema):
"""
Initialize the DataValidator with a predefined schema.
:param schema: Dictionary mapping columns to their expected types.
"""
self.schema = schema
def validate_schema(self, dataframe):
"""
Validate the dataframe's schema.
:param dataframe: Input pandas DataFrame.
:return: Boolean, indicating success or failure of validation.
"""
for column, dtype in self.schema.items():
if column not in dataframe:
logging.warning(f"Missing column: {column}")
return False
if not self._check_column_type(dataframe[column], dtype):
logging.error(f"Column '{column}' failed type check. Expected type: {dtype}.")
return False
return True
def _check_column_type(self, column, expected_type):
"""
Internal function to check the data type of a column.
:param column: DataFrame column.
:param expected_type: Expected data type as string.
:return: Boolean, indicating if the type matches.
"""
try:
if expected_type == 'numeric':
return pd.api.types.is_numeric_dtype(column)
elif expected_type == 'string':
return pd.api.types.is_string_dtype(column)
elif expected_type == 'datetime':
return pd.api.types.is_datetime64_any_dtype(column)
except Exception as e:
logging.error(f"Error when checking column type: {e}")
return False
return False
def drop_missing(self, dataframe):
"""
Remove rows with missing values.
:param dataframe: Input pandas DataFrame.
:return: Cleaned DataFrame with rows containing missing values dropped.
"""
return dataframe.dropna()
def handle_nulls(self, dataframe, method='mean'):
"""
Handle missing values by imputing with a specified method.
:param dataframe: Input pandas DataFrame.
:param method: String. Options: 'mean', 'median', or 'mode'.
:return: Modified DataFrame with imputations applied.
"""
for column in dataframe.columns:
if dataframe[column].isnull().sum() > 0:
if method == 'mean' and pd.api.types.is_numeric_dtype(dataframe[column]):
dataframe[column].fillna(dataframe[column].mean(), inplace=True)
elif method == 'median' and pd.api.types.is_numeric_dtype(dataframe[column]):
dataframe[column].fillna(dataframe[column].median(), inplace=True)
elif method == 'mode':
dataframe[column].fillna(dataframe[column].mode()[0], inplace=True)
return dataframe
if __name__ == "__main__":
schema = {'id': 'numeric', 'name': 'string', 'age': 'numeric'}
validator = DataValidator(schema)
# Example dataframe
df = pd.DataFrame({
'id': [1, 2, None],
'name': ['Alice', 'Bob', None],
'age': [25, None, 30]
})
print("Original DataFrame:")
print(df)
# Validate schema
print("\nSchema validation:")
print(validator.validate_schema(df))
# Handle nulls
print("\nHandling null values with mean:")
clean_df = validator.handle_nulls(df, method='mean')
print(clean_df)
Dependencies
The script primarily relies on the following libraries:
pandas
: Essential library for tabular data manipulation.logging
: For logging validation warnings and errors.
How to Use This Script
The script can be used as follows:
- Define a schema specifying the expected column names and types.
- Use validation functions to ensure the schema is met.
- Apply null handling techniques to clean the dataset.
from ai_data_validation import DataValidator
schema = {'id': 'numeric', 'score': 'numeric', 'timestamp': 'datetime'}
validator = DataValidator(schema)
# Example Data Validation
df = load_some_dataframe()
if validator.validate_schema(df):
cleaned_df = validator.drop_missing(df)
Role in the G.O.D. Framework
- Data Input Integrity: Ensures data entering systems like
ai_training_model.py
orai_data_preparation.py
meets quality standards. - Error Isolation: Works with
ai_error_tracker.py
to identify and fix problematic datasets. - Real-Time Cleanup: Prepares real-time incoming data for
ai_real_time_learner.py
.
Future Enhancements
- Streaming Data Validation: Incorporate real-time schema checks for streaming datasets.
- Advanced Imputation Techniques: Use machine learning models for imputing missing values.
- Custom Rules: Allow users to define and apply custom validation rules.