ai_data_detection
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| ai_data_detection [2025/05/25 15:01] – [Required Libraries] eagleeyenebula | ai_data_detection [2025/05/25 15:09] (current) – [Best Practices] eagleeyenebula | ||
|---|---|---|---|
| Line 42: | Line 42: | ||
| * **Missing Value Detection: | * **Missing Value Detection: | ||
| - | - Identifies any `NaN` or `Null` values present in the dataset. | + | - Identifies any **NaN** or **Null** values present in the dataset. |
| * **Duplicate Row Detection: | * **Duplicate Row Detection: | ||
| Line 105: | Line 105: | ||
| ==== Installation ==== | ==== Installation ==== | ||
| To install the required libraries, use the following: | To install the required libraries, use the following: | ||
| - | ```bash | + | < |
| + | bash | ||
| pip install pandas | pip install pandas | ||
| - | ``` | + | </ |
| ---- | ---- | ||
| Line 116: | Line 117: | ||
| ==== Basic Example ==== | ==== Basic Example ==== | ||
| - | Using the `has_issues` method to detect data issues. | + | Using the **has_issues** method to detect data issues. |
| - | ```python | + | < |
| + | python | ||
| import pandas as pd | import pandas as pd | ||
| from ai_data_detection import DataDetection | from ai_data_detection import DataDetection | ||
| - | + | </ | |
| - | # Create a sample dataset | + | # **Create a sample dataset** |
| + | < | ||
| data = pd.DataFrame({ | data = pd.DataFrame({ | ||
| ' | ' | ||
| Line 128: | Line 131: | ||
| ' | ' | ||
| }) | }) | ||
| - | + | </ | |
| - | # Create an instance of DataDetection | + | # **Create an instance of DataDetection** |
| + | < | ||
| detector = DataDetection() | detector = DataDetection() | ||
| - | + | </ | |
| - | # Check for data issues | + | # **Check for data issues** |
| + | < | ||
| if detector.has_issues(data): | if detector.has_issues(data): | ||
| print(" | print(" | ||
| else: | else: | ||
| print(" | print(" | ||
| - | ``` | + | </ |
| **Output:** | **Output:** | ||
| - | ```plaintext | + | < |
| + | plaintext | ||
| WARNING: | WARNING: | ||
| The dataset has quality issues. | The dataset has quality issues. | ||
| - | ``` | + | </ |
| ---- | ---- | ||
| Line 150: | Line 155: | ||
| === 1. Customizing Data Checks === | === 1. Customizing Data Checks === | ||
| - | You can extend the `DataDetection` class to add checks for other data quality metrics, such as outliers or invalid values. | + | You can extend the **DataDetection** class to add checks for other data quality metrics, such as outliers or invalid values. |
| **Example: Adding Outlier Detection** | **Example: Adding Outlier Detection** | ||
| - | ```python | + | < |
| + | python | ||
| import numpy as np | import numpy as np | ||
| Line 179: | Line 185: | ||
| if extended_detector.has_outliers(data): | if extended_detector.has_outliers(data): | ||
| print(" | print(" | ||
| - | ``` | + | </ |
| **Output:** | **Output:** | ||
| - | ```plaintext | + | < |
| + | plaintext | ||
| WARNING: | WARNING: | ||
| Outliers detected in the dataset. | Outliers detected in the dataset. | ||
| - | ``` | + | </ |
| --- | --- | ||
| Line 191: | Line 198: | ||
| === 2. Integrating DataDetection into a Pipeline === | === 2. Integrating DataDetection into a Pipeline === | ||
| This module can be integrated as part of a Scikit-learn pipeline for preprocessing. | This module can be integrated as part of a Scikit-learn pipeline for preprocessing. | ||
| - | + | < | |
| - | ```python | + | python |
| from sklearn.pipeline import Pipeline | from sklearn.pipeline import Pipeline | ||
| Line 207: | Line 214: | ||
| (' | (' | ||
| ]) | ]) | ||
| - | ``` | + | </ |
| --- | --- | ||
| === 3. Handling Large Datasets === | === 3. Handling Large Datasets === | ||
| For large datasets, optimize checks using chunk-based processing in Pandas: | For large datasets, optimize checks using chunk-based processing in Pandas: | ||
| - | ```python | + | < |
| + | python | ||
| def has_issues_in_chunks(file_path, | def has_issues_in_chunks(file_path, | ||
| detector = DataDetection() | detector = DataDetection() | ||
| Line 220: | Line 227: | ||
| return True | return True | ||
| return False | return False | ||
| - | ``` | + | </ |
| ---- | ---- | ||
| Line 226: | Line 233: | ||
| ===== Best Practices ===== | ===== Best Practices ===== | ||
| 1. **Use Incremental Checks:** Perform quality checks at different stages of the pipeline (e.g., after loading raw data and after preprocessing steps). | 1. **Use Incremental Checks:** Perform quality checks at different stages of the pipeline (e.g., after loading raw data and after preprocessing steps). | ||
| + | |||
| 2. **Automate Logging:** Set up centralized logging for tracking data issues across multiple datasets. | 2. **Automate Logging:** Set up centralized logging for tracking data issues across multiple datasets. | ||
| + | |||
| 3. **Adapt Custom Methods:** Extend the module for domain-specific checks, such as outlier detection, range checks, or invalid category detection. | 3. **Adapt Custom Methods:** Extend the module for domain-specific checks, such as outlier detection, range checks, or invalid category detection. | ||
| + | |||
| 4. **Handle Issues Early:** Address identified data issues before training machine learning models. | 4. **Handle Issues Early:** Address identified data issues before training machine learning models. | ||
| Line 239: | Line 249: | ||
| **Example: Adding Invalid Category Detection** | **Example: Adding Invalid Category Detection** | ||
| - | ```python | + | < |
| + | python | ||
| def has_invalid_categories(data, | def has_invalid_categories(data, | ||
| for col in data.select_dtypes(include=[' | for col in data.select_dtypes(include=[' | ||
| Line 247: | Line 258: | ||
| return True | return True | ||
| return False | return False | ||
| - | ``` | + | </ |
| ---- | ---- | ||
ai_data_detection.1748185277.txt.gz · Last modified: 2025/05/25 15:01 by eagleeyenebula
