ai_insert_training_data
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| ai_insert_training_data [2025/05/27 19:49] – [Example 1: Basic Data Injection] eagleeyenebula | ai_insert_training_data [2025/05/27 19:56] (current) – [AI Insert Training Data] eagleeyenebula | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| ====== AI Insert Training Data ====== | ====== AI Insert Training Data ====== | ||
| - | * **[[https:// | + | **[[https:// |
| The TrainingDataInsert class facilitates adding new data into existing training datasets seamlessly. It serves as a foundational tool for managing, updating, and extending datasets in machine learning pipelines. The class ensures logging and modularity for integration into larger AI systems. | The TrainingDataInsert class facilitates adding new data into existing training datasets seamlessly. It serves as a foundational tool for managing, updating, and extending datasets in machine learning pipelines. The class ensures logging and modularity for integration into larger AI systems. | ||
| Line 108: | Line 108: | ||
| This example highlights how logging ensures transparency in data insertion. | This example highlights how logging ensures transparency in data insertion. | ||
| - | ```python | + | < |
| + | python | ||
| import logging | import logging | ||
| from ai_insert_training_data import TrainingDataInsert | from ai_insert_training_data import TrainingDataInsert | ||
| - | + | </ | |
| - | # Enable logging | + | **Enable logging** |
| + | < | ||
| logging.basicConfig(level=logging.INFO) | logging.basicConfig(level=logging.INFO) | ||
| - | + | </ | |
| - | # Datasets | + | **Datasets** |
| + | < | ||
| existing_data = [1, 2, 3] | existing_data = [1, 2, 3] | ||
| new_data = [4, 5, 6] | new_data = [4, 5, 6] | ||
| - | + | </ | |
| - | # Add new data while reviewing logging information in real-time | + | **Add new data while reviewing logging information in real-time** |
| + | < | ||
| TrainingDataInsert.add_data(new_data, | TrainingDataInsert.add_data(new_data, | ||
| Line 125: | Line 129: | ||
| # INFO: | # INFO: | ||
| # INFO: | # INFO: | ||
| - | ``` | + | </ |
| **Explanation**: | **Explanation**: | ||
| - | - Logs are automatically generated to indicate when data insertion starts and successfully completes. | + | |
| - | + | ||
| - | --- | + | |
| ==== Example 3: Extension - Validation of Data ==== | ==== Example 3: Extension - Validation of Data ==== | ||
| This example expands the functionality by adding validation to ensure data integrity. | This example expands the functionality by adding validation to ensure data integrity. | ||
| - | ```python | + | < |
| + | python | ||
| class ValidatingTrainingDataInsert(TrainingDataInsert): | class ValidatingTrainingDataInsert(TrainingDataInsert): | ||
| """ | """ | ||
| Line 156: | Line 158: | ||
| logging.info(" | logging.info(" | ||
| return TrainingDataInsert.add_data(new_data, | return TrainingDataInsert.add_data(new_data, | ||
| + | </ | ||
| - | + | **Example validation function** | |
| - | # Example validation function | + | < |
| def validate_data(data_point): | def validate_data(data_point): | ||
| return isinstance(data_point, | return isinstance(data_point, | ||
| - | + | </ | |
| - | # Example Usage | + | **Example Usage** |
| + | < | ||
| existing_set = [10, 20, 30] | existing_set = [10, 20, 30] | ||
| new_set = [40, 50, -10] # Invalid data included | new_set = [40, 50, -10] # Invalid data included | ||
| Line 170: | Line 174: | ||
| except ValueError as e: | except ValueError as e: | ||
| print(e) | print(e) | ||
| - | ``` | + | </ |
| **Explanation**: | **Explanation**: | ||
| - | - The validation logic ensures only positive integer data points are added. | + | * The validation logic ensures only positive integer data points are added. |
| - | - Invalid data triggers exceptions, preserving dataset integrity. | + | |
| - | + | ||
| - | --- | + | |
| ==== Example 4: Extension - Avoiding Duplicate Data ==== | ==== Example 4: Extension - Avoiding Duplicate Data ==== | ||
| This example prevents duplication in the updated dataset. | This example prevents duplication in the updated dataset. | ||
| - | ```python | + | < |
| + | python | ||
| class UniqueTrainingDataInsert(TrainingDataInsert): | class UniqueTrainingDataInsert(TrainingDataInsert): | ||
| """ | """ | ||
| Line 199: | Line 201: | ||
| return TrainingDataInsert.add_data(unique_new_data, | return TrainingDataInsert.add_data(unique_new_data, | ||
| - | + | </ | |
| - | # Example | + | **Example** |
| + | < | ||
| existing_dataset = [" | existing_dataset = [" | ||
| new_dataset = [" | new_dataset = [" | ||
| - | + | </ | |
| - | # Add unique data only | + | **Add unique data only** |
| + | < | ||
| updated_dataset = UniqueTrainingDataInsert.add_unique_data(new_dataset, | updated_dataset = UniqueTrainingDataInsert.add_unique_data(new_dataset, | ||
| - | |||
| print(" | print(" | ||
| + | </ | ||
| - | # Output: | + | **Output:** |
| + | < | ||
| # Unique Updated Dataset: [' | # Unique Updated Dataset: [' | ||
| - | ``` | + | </ |
| **Explanation**: | **Explanation**: | ||
| - | - Ensures no duplicate data points are added to the dataset. | + | |
| - | + | ||
| - | --- | + | |
| ==== Example 5: Persistent Dataset Updates ==== | ==== Example 5: Persistent Dataset Updates ==== | ||
| This example saves the updated dataset for future use or offline storage. | This example saves the updated dataset for future use or offline storage. | ||
| - | ```python | + | < |
| + | python | ||
| import json | import json | ||
| Line 251: | Line 254: | ||
| return json.load(file) | return json.load(file) | ||
| + | </ | ||
| - | # Example Usage | + | **Example Usage** |
| + | < | ||
| dataset = [" | dataset = [" | ||
| PersistentDataInsert.save_dataset(dataset, | PersistentDataInsert.save_dataset(dataset, | ||
| - | + | </ | |
| - | # Load and verify | + | **Load and verify** |
| + | < | ||
| loaded_data = PersistentDataInsert.load_dataset(" | loaded_data = PersistentDataInsert.load_dataset(" | ||
| print(" | print(" | ||
| Line 263: | Line 269: | ||
| # INFO: | # INFO: | ||
| # Loaded Dataset: [' | # Loaded Dataset: [' | ||
| - | ``` | + | </ |
| **Explanation**: | **Explanation**: | ||
| - | - Allows datasets to be saved and retrieved for persistent storage and long-term use. | + | |
| - | + | ||
| - | --- | + | |
| ===== Use Cases ===== | ===== Use Cases ===== | ||
| 1. **Incremental Data Updates for ML Training**: | 1. **Incremental Data Updates for ML Training**: | ||
| - | | + | * Append data during active training to improve accuracy and adaptability. |
| 2. **Dynamic Data Pipelines**: | 2. **Dynamic Data Pipelines**: | ||
| - | Use logging and insertion to build real-time data pipelines that grow dynamically based on user input or live feedback. | + | * Use logging and insertion to build real-time data pipelines that grow dynamically based on user input or live feedback. |
| 3. **Data Validation and Cleanup**: | 3. **Data Validation and Cleanup**: | ||
| - | | + | * Integrate validation or deduplication logic to maintain high-quality datasets while scaling. |
| 4. **Persistent Dataset Management**: | 4. **Persistent Dataset Management**: | ||
| - | | + | * Enable training workflows to store and retrieve datasets across sessions. |
| 5. **Integration with Pre-Processing Frameworks**: | 5. **Integration with Pre-Processing Frameworks**: | ||
| - | | + | * Combine with tools for data formatting or augmentation prior to ML workflows. |
| - | + | ||
| - | --- | + | |
| ===== Best Practices ===== | ===== Best Practices ===== | ||
| 1. **Validate New Data**: | 1. **Validate New Data**: | ||
| - | | + | * Always validate and sanitize input data before appending it to your datasets. |
| 2. **Monitor Logs**: | 2. **Monitor Logs**: | ||
| - | | + | * Enable logging to debug and audit data injection processes effectively. |
| 3. **Avoid Duplicates**: | 3. **Avoid Duplicates**: | ||
| - | | + | * Ensure no redundant data is added to the training set. |
| 4. **Persist Critical Datasets**: | 4. **Persist Critical Datasets**: | ||
| - | Save updates to datasets regularly to prevent loss during crashes or interruptions. | + | * Save updates to datasets regularly to prevent loss during crashes or interruptions. |
| 5. **Scalable Design**: | 5. **Scalable Design**: | ||
| - | | + | * Extend or combine `TrainingDataInsert` with larger ML pipeline components for end-to-end coverage. |
| + | ===== Conclusion ===== | ||
| - | --- | + | The **TrainingDataInsert** class offers a lightweight and modular solution for managing and updating training datasets. With extensibility options such as validation, deduplication, |
| - | ===== Conclusion ===== | + | Built to accommodate both batch and incremental data updates, the class simplifies the process of maintaining dynamic datasets in production environments. Developers can define pre-processing hooks, enforce schema consistency, |
| - | The **TrainingDataInsert** class offers a lightweight and modular solution for managing and updating training datasets. With extensibility options such as validation, deduplication, and persistence, | + | Furthermore, its integration-ready structure supports embedding into automated MLops pipelines, active |
ai_insert_training_data.1748375344.txt.gz · Last modified: 2025/05/27 19:49 by eagleeyenebula
