ai_data_preparation
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| ai_data_preparation [2025/05/25 17:46] – [Dependencies] eagleeyenebula | ai_data_preparation [2025/05/25 18:13] (current) – [Future Enhancements] eagleeyenebula | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| ====== AI Data Preparation ====== | ====== AI Data Preparation ====== | ||
| - | * **[[https:// | + | **[[https:// |
| ===== Overview ===== | ===== Overview ===== | ||
| The **AI Data Preparation** module provides a robust framework for preparing raw datasets for further analysis, feature engineering, | The **AI Data Preparation** module provides a robust framework for preparing raw datasets for further analysis, feature engineering, | ||
| Line 40: | Line 40: | ||
| * **Data Cleaning:** | * **Data Cleaning:** | ||
| - | * Removes | + | * Removes |
| * Provides extensibility to define custom cleaning logic. | * Provides extensibility to define custom cleaning logic. | ||
| * **Data Normalization: | * **Data Normalization: | ||
| - | * Scales numerical data to a standard range (e.g., 0 to 1) with Min-Max normalization, | + | * Scales numerical data to a standard range (e.g., |
| * **Error Handling and Logging:** | * **Error Handling and Logging:** | ||
| Line 115: | Line 115: | ||
| ==== Required Libraries ==== | ==== Required Libraries ==== | ||
| - | * **`logging`:** For tracking preprocessing activities via logs. | + | * **logging: |
| - | * **`pandas` (optional): | + | * **pandas (optional): |
| ==== Installation ==== | ==== Installation ==== | ||
| - | To install optional dependencies, | + | To install optional dependencies, |
| - | ```bash | + | < |
| + | bash | ||
| pip install pandas | pip install pandas | ||
| - | ``` | + | </ |
| ---- | ---- | ||
| Line 133: | Line 134: | ||
| Cleansing and normalizing a simple dataset: | Cleansing and normalizing a simple dataset: | ||
| - | ```python | + | < |
| + | python | ||
| from ai_data_preparation import DataPreparation | from ai_data_preparation import DataPreparation | ||
| - | + | </ | |
| - | # Example dataset | + | # **Example dataset** |
| + | < | ||
| data = [1, None, 2, 3, None, 5] | data = [1, None, 2, 3, None, 5] | ||
| - | + | </ | |
| - | # Step 1: Clean the dataset | + | # **Step 1: Clean the dataset** |
| + | < | ||
| cleaned_data = DataPreparation.clean_data(data) | cleaned_data = DataPreparation.clean_data(data) | ||
| print(" | print(" | ||
| - | + | </ | |
| - | # Step 2: Normalize the cleaned dataset | + | # **Step 2: Normalize the cleaned dataset** |
| + | < | ||
| normalized_data = DataPreparation.normalize_data(cleaned_data) | normalized_data = DataPreparation.normalize_data(cleaned_data) | ||
| print(" | print(" | ||
| - | ``` | + | </ |
| **Output:** | **Output:** | ||
| - | ```plaintext | + | < |
| + | plaintext | ||
| Cleaned Data: [1, 2, 3, 5] | Cleaned Data: [1, 2, 3, 5] | ||
| Normalized Data: [0.0, 0.25, 0.5, 1.0] | Normalized Data: [0.0, 0.25, 0.5, 1.0] | ||
| - | ``` | + | </ |
| ---- | ---- | ||
| Line 159: | Line 165: | ||
| === 1. Min-Max Normalization Extension === | === 1. Min-Max Normalization Extension === | ||
| - | Extend the `normalize_data` method to specify custom normalization ranges. | + | Extend the **normalize_data** method to specify custom normalization ranges. |
| - | ```python | + | < |
| + | python | ||
| class ExtendedNormalization(DataPreparation): | class ExtendedNormalization(DataPreparation): | ||
| @staticmethod | @staticmethod | ||
| Line 176: | Line 183: | ||
| custom_range_data = ExtendedNormalization.normalize_data([10, | custom_range_data = ExtendedNormalization.normalize_data([10, | ||
| print(" | print(" | ||
| - | ``` | + | </ |
| **Output:** | **Output:** | ||
| - | ```plaintext | + | |
| + | < | ||
| + | plaintext | ||
| Normalized Data with Custom Range: [-1.0, 0.0, 1.0] | Normalized Data with Custom Range: [-1.0, 0.0, 1.0] | ||
| - | ``` | + | </ |
| --- | --- | ||
| === 2. Custom Cleaning Rules === | === 2. Custom Cleaning Rules === | ||
| - | Modify the `clean_data` method to remove additional invalid entries, such as negative values. | + | Modify the **clean_data** method to remove additional invalid entries, such as negative values. |
| - | ```python | + | < |
| + | python | ||
| class ExtendedCleaning(DataPreparation): | class ExtendedCleaning(DataPreparation): | ||
| @staticmethod | @staticmethod | ||
| Line 198: | Line 208: | ||
| cleaned_data = ExtendedCleaning.clean_data([1, | cleaned_data = ExtendedCleaning.clean_data([1, | ||
| print(" | print(" | ||
| - | ``` | + | </ |
| **Output:** | **Output:** | ||
| - | ```plaintext | + | < |
| + | plaintext | ||
| Cleaned Data with Custom Rules: [1, 4, 5] | Cleaned Data with Custom Rules: [1, 4, 5] | ||
| - | ``` | + | </ |
| --- | --- | ||
| === 3. Integration with Scikit-learn Pipelines === | === 3. Integration with Scikit-learn Pipelines === | ||
| - | Integrate the `DataPreparation` module into a Scikit-learn pipeline for end-to-end preprocessing. | + | Integrate the **DataPreparation** module into a Scikit-learn pipeline for end-to-end preprocessing. |
| - | ```python | + | < |
| + | python | ||
| from sklearn.pipeline import Pipeline | from sklearn.pipeline import Pipeline | ||
| Line 229: | Line 241: | ||
| processed_data = pipeline.fit_transform(data) | processed_data = pipeline.fit_transform(data) | ||
| print(" | print(" | ||
| - | ``` | + | </ |
| **Output:** | **Output:** | ||
| - | ```plaintext | + | < |
| + | plaintext | ||
| Processed Data via Pipeline: [0.0, 0.6666666666666666, | Processed Data via Pipeline: [0.0, 0.6666666666666666, | ||
| - | ``` | + | </ |
| ---- | ---- | ||
| Line 240: | Line 253: | ||
| ===== Best Practices ===== | ===== Best Practices ===== | ||
| 1. **Analyze Data Before Preparation: | 1. **Analyze Data Before Preparation: | ||
| - | - Inspect datasets for unique issues (e.g., outliers) before applying generalized cleaning rules. | + | - Inspect datasets for unique issues (e.g., |
| 2. **Normalize for ML Algorithms: | 2. **Normalize for ML Algorithms: | ||
| Line 256: | Line 269: | ||
| The **DataPreparation** module can be extended for advanced preprocessing tasks: | The **DataPreparation** module can be extended for advanced preprocessing tasks: | ||
| * **Custom Outlier Removal:** | * **Custom Outlier Removal:** | ||
| - | - Add logic to discard outliers based on statistical bounds (e.g., Z-scores, IQR). | + | - Add logic to discard outliers based on statistical bounds (e.g., |
| * **Feature Engineering: | * **Feature Engineering: | ||
| - Extract derived metrics from datasets, such as mean, variance, or ratios. | - Extract derived metrics from datasets, such as mean, variance, or ratios. | ||
| Line 273: | Line 286: | ||
| ===== Future Enhancements ===== | ===== Future Enhancements ===== | ||
| + | |||
| The following additions would extend the functionality of the module: | The following additions would extend the functionality of the module: | ||
| - | 1. **Support for Tabular Data:** | + | |
| - | - Add preprocessing for structured/ | + | - **Support for Tabular Data** |
| - | | + | Add preprocessing for structured/ |
| - | - Support | + | |
| - | | + | - **Advanced Scaling Options** |
| - | - Enable parallelized data preparation for large datasets using frameworks like Dask. | + | Support |
| + | |||
| + | - **Distributed Processing** | ||
| + | Enable parallelized data preparation for large datasets using frameworks like Dask. | ||
| ---- | ---- | ||
| ===== Conclusion ===== | ===== Conclusion ===== | ||
| - | The **`AI Data Preparation`** module provides easy-to-use and extensible tools for preparing datasets for machine learning pipelines and data workflows. With its robust cleaning, normalization, | + | The **AI Data Preparation** module provides easy-to-use and extensible tools for preparing datasets for machine learning pipelines and data workflows. With its robust cleaning, normalization, |
ai_data_preparation.1748195190.txt.gz · Last modified: 2025/05/25 17:46 by eagleeyenebula
