Differences

This shows you the differences between two versions of the page.

--- ai_data_detection [2025/05/25 14:25] – [Overview] eagleeyenebula
+++ ai_data_detection [2025/05/25 15:09] (current) – [Best Practices] eagleeyenebula
@@ Line 4: / Line 4: @@
 The **ai_data_detection.py** module provides tools for identifying data quality issues in datasets. It includes functionality to detect common problems such as missing values, duplicate rows, and other potential anomalies. Designed to integrate seamlessly with data pipelines, this module ensures that machine learning models are trained and evaluated on high-quality data.
+{{youtube>Und3EEBK_gM?large}}
+-------------------------------------------------------------
 The associated **ai_data_detection.html** complements the module by offering interactive examples, use-case scenarios, and tutorials for understanding the importance of data quality validation in AI workflows.
@@ Line 25: / Line 28: @@
 ===== Purpose =====
-The **`ai_data_detection.py`** module addresses crucial challenges in AI and data workflows, including:
+The **ai_data_detection.py** module addresses crucial challenges in AI and data workflows, including:
 . Automating data validation processes, ensuring the integrity of datasets before analysis.
 . Identifying and reporting common issues like **missing values**, **duplicated rows**, or any anomalies.
 . Reducing manual effort required for data quality checks and preprocessing pipeline development.
 . Logging detailed information about any detected issues for easy debugging and resolution.
 Use this module when working with tabular datasets such as Pandas DataFrames to enforce strict preprocessing and avoid costly errors in model pipeline creation.
@@ Line 39: / Line 42: @@
   * **Missing Value Detection:**
-    - Identifies any `NaN` or `Null` values present in the dataset.
+    - Identifies any **NaN** or **Null** values present in the dataset.
   * **Duplicate Row Detection:**
@@ Line 57: / Line 60: @@
 ===== How It Works =====
-The **DataDetection** class provides an easy-to-use method for detecting data quality issues: `has_issues(data)`. Below is a breakdown of the logic and workflow:
+The **DataDetection** class provides an easy-to-use method for detecting data quality issues: **has_issues(data)**. Below is a breakdown of the logic and workflow:
 ==== 1. Checks Performed by the Module ====
-The `has_issues` method runs a series of checks on the provided dataset and logs appropriate warnings for each issue:
+The **has_issues** method runs a series of checks on the provided dataset and logs appropriate warnings for each issue:
   * **Check 1: Missing Values**
-    - Scans the dataset for null values (`NaN`) using `data.isnull().values.any()`.
+    - Scans the dataset for null values (**NaN**) using **data.isnull().values.any()**.
     - Logs a warning when missing values are detected.
   * **Check 2: Duplicate Rows**
-    - Detects duplicated rows using `data.duplicated().any()`.
+    - Detects duplicated rows using **data.duplicated().any()**.
     - Logs a warning when duplicates are found.
   * **Return Values:**
-    - Returns `True` if any issues are detected, `False` otherwise.
+    - Returns **True** if any issues are detected, **False** otherwise.
 ==== 2. Error Handling and Logging ====
-The `has_issues` method includes:
+The **has_issues** method includes:
   * **Exception Handling:**
     - Catches errors (e.g., if an invalid input is passed) and logs an appropriate error message.
@@ Line 82: / Line 85: @@
 Example Log Output:
-```plaintext
+<code>
+plaintext
 WARNING:root:Data contains missing values.
 WARNING:root:Data contains duplicate rows.
 INFO:root:No data quality issues detected.
 ERROR:root:Error during data quality checks: Invalid input type
-```
+</code>
 ----
@@ Line 96: / Line 100: @@
 ==== Required Libraries ====
-* **`pandas`:** For handling tabular data (DataFrames) and performing checks like missing values and duplicates.
+ **pandas:** For handling tabular data (DataFrames) and performing checks like missing values and duplicates.
-* **`logging`:** For capturing warnings, errors, and other information during the execution of data checks.
+ **logging:** For capturing warnings, errors, and other information during the execution of data checks.
 ==== Installation ====
 To install the required libraries, use the following:
-```bash
+<code>
+bash
 pip install pandas
-```
+</code>
 ----
@@ Line 112: / Line 117: @@
 ==== Basic Example ====
-Using the `has_issues` method to detect data issues.
+Using the **has_issues** method to detect data issues.
-```python
+<code>
+python
 import pandas as pd
 from ai_data_detection import DataDetection
+</code>
-# Create a sample dataset
+# **Create a sample dataset**
+<code>
 data = pd.DataFrame({
     'A': [1, 2, None, 4],
@@ Line 124: / Line 131: @@
     'C': [1.1, 2.2, 2.2, None]
 })
+</code>
-# Create an instance of DataDetection
+# **Create an instance of DataDetection**
+<code>
 detector = DataDetection()
+</code>
-# Check for data issues
+# **Check for data issues**
+<code>
 if detector.has_issues(data):
     print("The dataset has quality issues.")
 else:
     print("The dataset is clean.")
-```
+</code>
 **Output:**
-```plaintext
+<code>
+plaintext
 WARNING:root:Data contains missing values.
 The dataset has quality issues.
-```
+</code>
 ----
@@ Line 146: / Line 155: @@
 === 1. Customizing Data Checks ===
-You can extend the `DataDetection` class to add checks for other data quality metrics, such as outliers or invalid values.
+You can extend the **DataDetection** class to add checks for other data quality metrics, such as outliers or invalid values.
 **Example: Adding Outlier Detection**
-```python
+<code>
+python
 import numpy as np
@@ Line 175: / Line 185: @@
 if extended_detector.has_outliers(data):
     print("Outliers detected in the dataset.")
-```
+</code>
 **Output:**
-```plaintext
+<code>
+plaintext
 WARNING:root:Outliers detected in dataset.
 Outliers detected in the dataset.
-```
+</code>
 ---
@@ Line 187: / Line 198: @@
 === 2. Integrating DataDetection into a Pipeline ===
 This module can be integrated as part of a Scikit-learn pipeline for preprocessing.
+<code>
-```python
+python
 from sklearn.pipeline import Pipeline
@@ Line 203: / Line 214: @@
     ('model', LogisticRegression())
 ])
-```
+</code>
 ---
 === 3. Handling Large Datasets ===
 For large datasets, optimize checks using chunk-based processing in Pandas:
-```python
+<code>
+python
 def has_issues_in_chunks(file_path, chunk_size=1000):
     detector = DataDetection()
@@ Line 216: / Line 227: @@
             return True
     return False
-```
+</code>
 ----
@@ Line 222: / Line 233: @@
 ===== Best Practices =====
 . **Use Incremental Checks:** Perform quality checks at different stages of the pipeline (e.g., after loading raw data and after preprocessing steps).
 . **Automate Logging:** Set up centralized logging for tracking data issues across multiple datasets.
 . **Adapt Custom Methods:** Extend the module for domain-specific checks, such as outlier detection, range checks, or invalid category detection.
 . **Handle Issues Early:** Address identified data issues before training machine learning models.
@@ Line 235: / Line 249: @@
 **Example: Adding Invalid Category Detection**
-```python
+<code>
+python
 def has_invalid_categories(data, valid_categories):
     for col in data.select_dtypes(include=['object']):
@@ Line 243: / Line 258: @@
             return True
     return False
-```
+</code>
 ----