User Tools

Site Tools


ai_data_detection

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
ai_data_detection [2025/05/25 14:25] – [Overview] eagleeyenebulaai_data_detection [2025/05/25 15:09] (current) – [Best Practices] eagleeyenebula
Line 4: Line 4:
 The **ai_data_detection.py** module provides tools for identifying data quality issues in datasets. It includes functionality to detect common problems such as missing values, duplicate rows, and other potential anomalies. Designed to integrate seamlessly with data pipelines, this module ensures that machine learning models are trained and evaluated on high-quality data. The **ai_data_detection.py** module provides tools for identifying data quality issues in datasets. It includes functionality to detect common problems such as missing values, duplicate rows, and other potential anomalies. Designed to integrate seamlessly with data pipelines, this module ensures that machine learning models are trained and evaluated on high-quality data.
  
 +{{youtube>Und3EEBK_gM?large}}
 +
 +-------------------------------------------------------------
 The associated **ai_data_detection.html** complements the module by offering interactive examples, use-case scenarios, and tutorials for understanding the importance of data quality validation in AI workflows. The associated **ai_data_detection.html** complements the module by offering interactive examples, use-case scenarios, and tutorials for understanding the importance of data quality validation in AI workflows.
  
Line 25: Line 28:
  
 ===== Purpose ===== ===== Purpose =====
-The **`ai_data_detection.py`** module addresses crucial challenges in AI and data workflows, including: +The **ai_data_detection.py** module addresses crucial challenges in AI and data workflows, including: 
-  1. Automating data validation processes, ensuring the integrity of datasets before analysis. +1. Automating data validation processes, ensuring the integrity of datasets before analysis. 
-  2. Identifying and reporting common issues like **missing values**, **duplicated rows**, or any anomalies. +2. Identifying and reporting common issues like **missing values**, **duplicated rows**, or any anomalies. 
-  3. Reducing manual effort required for data quality checks and preprocessing pipeline development. +3. Reducing manual effort required for data quality checks and preprocessing pipeline development. 
-  4. Logging detailed information about any detected issues for easy debugging and resolution.+4. Logging detailed information about any detected issues for easy debugging and resolution.
  
 Use this module when working with tabular datasets such as Pandas DataFrames to enforce strict preprocessing and avoid costly errors in model pipeline creation. Use this module when working with tabular datasets such as Pandas DataFrames to enforce strict preprocessing and avoid costly errors in model pipeline creation.
Line 39: Line 42:
  
   * **Missing Value Detection:**   * **Missing Value Detection:**
-    - Identifies any `NaNor `Nullvalues present in the dataset.+    - Identifies any **NaN** or **Null** values present in the dataset.
  
   * **Duplicate Row Detection:**   * **Duplicate Row Detection:**
Line 57: Line 60:
 ===== How It Works ===== ===== How It Works =====
  
-The **DataDetection** class provides an easy-to-use method for detecting data quality issues: `has_issues(data)`. Below is a breakdown of the logic and workflow:+The **DataDetection** class provides an easy-to-use method for detecting data quality issues: **has_issues(data)**. Below is a breakdown of the logic and workflow:
  
 ==== 1. Checks Performed by the Module ==== ==== 1. Checks Performed by the Module ====
-The `has_issuesmethod runs a series of checks on the provided dataset and logs appropriate warnings for each issue:+The **has_issues** method runs a series of checks on the provided dataset and logs appropriate warnings for each issue:
   * **Check 1: Missing Values**   * **Check 1: Missing Values**
-    - Scans the dataset for null values (`NaN`) using `data.isnull().values.any()`.+    - Scans the dataset for null values (**NaN**) using **data.isnull().values.any()**.
     - Logs a warning when missing values are detected.     - Logs a warning when missing values are detected.
  
   * **Check 2: Duplicate Rows**   * **Check 2: Duplicate Rows**
-    - Detects duplicated rows using `data.duplicated().any()`.+    - Detects duplicated rows using **data.duplicated().any()**.
     - Logs a warning when duplicates are found.     - Logs a warning when duplicates are found.
  
   * **Return Values:**   * **Return Values:**
-    - Returns `Trueif any issues are detected, `Falseotherwise.+    - Returns **True** if any issues are detected, **False** otherwise.
  
 ==== 2. Error Handling and Logging ==== ==== 2. Error Handling and Logging ====
-The `has_issuesmethod includes:+The **has_issues** method includes:
   * **Exception Handling:**   * **Exception Handling:**
     - Catches errors (e.g., if an invalid input is passed) and logs an appropriate error message.     - Catches errors (e.g., if an invalid input is passed) and logs an appropriate error message.
Line 82: Line 85:
  
 Example Log Output: Example Log Output:
-```plaintext+<code> 
 +plaintext
 WARNING:root:Data contains missing values. WARNING:root:Data contains missing values.
 WARNING:root:Data contains duplicate rows. WARNING:root:Data contains duplicate rows.
 INFO:root:No data quality issues detected. INFO:root:No data quality issues detected.
 ERROR:root:Error during data quality checks: Invalid input type ERROR:root:Error during data quality checks: Invalid input type
-```+</code>
  
 ---- ----
Line 96: Line 100:
  
 ==== Required Libraries ==== ==== Required Libraries ====
-* **`pandas`:** For handling tabular data (DataFrames) and performing checks like missing values and duplicates. + **pandas:** For handling tabular data (DataFrames) and performing checks like missing values and duplicates. 
-* **`logging`:** For capturing warnings, errors, and other information during the execution of data checks.+ **logging:** For capturing warnings, errors, and other information during the execution of data checks.
  
 ==== Installation ==== ==== Installation ====
 To install the required libraries, use the following: To install the required libraries, use the following:
-```bash+<code> 
 +bash
 pip install pandas pip install pandas
-```+</code>
  
 ---- ----
Line 112: Line 117:
  
 ==== Basic Example ==== ==== Basic Example ====
-Using the `has_issuesmethod to detect data issues.+Using the **has_issues** method to detect data issues.
  
-```python+<code> 
 +python
 import pandas as pd import pandas as pd
 from ai_data_detection import DataDetection from ai_data_detection import DataDetection
- +</code> 
-# Create a sample dataset+**Create a sample dataset** 
 +<code>
 data = pd.DataFrame({ data = pd.DataFrame({
     'A': [1, 2, None, 4],     'A': [1, 2, None, 4],
Line 124: Line 131:
     'C': [1.1, 2.2, 2.2, None]     'C': [1.1, 2.2, 2.2, None]
 }) })
- +</code> 
-# Create an instance of DataDetection+**Create an instance of DataDetection** 
 +<code>
 detector = DataDetection() detector = DataDetection()
- +</code> 
-# Check for data issues+**Check for data issues** 
 +<code>
 if detector.has_issues(data): if detector.has_issues(data):
     print("The dataset has quality issues.")     print("The dataset has quality issues.")
 else: else:
     print("The dataset is clean.")     print("The dataset is clean.")
-``` +</code>
 **Output:** **Output:**
-```plaintext+<code> 
 +plaintext
 WARNING:root:Data contains missing values. WARNING:root:Data contains missing values.
 The dataset has quality issues. The dataset has quality issues.
-```+</code>
  
 ---- ----
Line 146: Line 155:
  
 === 1. Customizing Data Checks === === 1. Customizing Data Checks ===
-You can extend the `DataDetectionclass to add checks for other data quality metrics, such as outliers or invalid values.+You can extend the **DataDetection** class to add checks for other data quality metrics, such as outliers or invalid values.
  
 **Example: Adding Outlier Detection** **Example: Adding Outlier Detection**
-```python+<code> 
 +python
 import numpy as np import numpy as np
  
Line 175: Line 185:
 if extended_detector.has_outliers(data): if extended_detector.has_outliers(data):
     print("Outliers detected in the dataset.")     print("Outliers detected in the dataset.")
-```+</code>
  
 **Output:** **Output:**
-```plaintext+<code> 
 +plaintext
 WARNING:root:Outliers detected in dataset. WARNING:root:Outliers detected in dataset.
 Outliers detected in the dataset. Outliers detected in the dataset.
-```+</code>
  
 --- ---
Line 187: Line 198:
 === 2. Integrating DataDetection into a Pipeline === === 2. Integrating DataDetection into a Pipeline ===
 This module can be integrated as part of a Scikit-learn pipeline for preprocessing. This module can be integrated as part of a Scikit-learn pipeline for preprocessing.
- +<code> 
-```python+python
 from sklearn.pipeline import Pipeline from sklearn.pipeline import Pipeline
  
Line 203: Line 214:
     ('model', LogisticRegression())     ('model', LogisticRegression())
 ]) ])
-``` +</code>
 --- ---
  
 === 3. Handling Large Datasets === === 3. Handling Large Datasets ===
 For large datasets, optimize checks using chunk-based processing in Pandas: For large datasets, optimize checks using chunk-based processing in Pandas:
-```python+<code> 
 +python
 def has_issues_in_chunks(file_path, chunk_size=1000): def has_issues_in_chunks(file_path, chunk_size=1000):
     detector = DataDetection()     detector = DataDetection()
Line 216: Line 227:
             return True             return True
     return False     return False
-```+</code>
  
 ---- ----
Line 222: Line 233:
 ===== Best Practices ===== ===== Best Practices =====
 1. **Use Incremental Checks:** Perform quality checks at different stages of the pipeline (e.g., after loading raw data and after preprocessing steps). 1. **Use Incremental Checks:** Perform quality checks at different stages of the pipeline (e.g., after loading raw data and after preprocessing steps).
 +
 2. **Automate Logging:** Set up centralized logging for tracking data issues across multiple datasets. 2. **Automate Logging:** Set up centralized logging for tracking data issues across multiple datasets.
 +
 3. **Adapt Custom Methods:** Extend the module for domain-specific checks, such as outlier detection, range checks, or invalid category detection. 3. **Adapt Custom Methods:** Extend the module for domain-specific checks, such as outlier detection, range checks, or invalid category detection.
 +
 4. **Handle Issues Early:** Address identified data issues before training machine learning models. 4. **Handle Issues Early:** Address identified data issues before training machine learning models.
  
Line 235: Line 249:
  
 **Example: Adding Invalid Category Detection** **Example: Adding Invalid Category Detection**
-```python+<code> 
 +python
 def has_invalid_categories(data, valid_categories): def has_invalid_categories(data, valid_categories):
     for col in data.select_dtypes(include=['object']):     for col in data.select_dtypes(include=['object']):
Line 243: Line 258:
             return True             return True
     return False     return False
-```+</code>
  
 ---- ----
ai_data_detection.1748183152.txt.gz · Last modified: 2025/05/25 14:25 by eagleeyenebula