User Tools

Site Tools


ai_data_detection

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
ai_data_detection [2025/05/25 14:59] – [1. Checks Performed by the Module] eagleeyenebulaai_data_detection [2025/05/25 15:09] (current) – [Best Practices] eagleeyenebula
Line 42: Line 42:
  
   * **Missing Value Detection:**   * **Missing Value Detection:**
-    - Identifies any `NaNor `Nullvalues present in the dataset.+    - Identifies any **NaN** or **Null** values present in the dataset.
  
   * **Duplicate Row Detection:**   * **Duplicate Row Detection:**
Line 60: Line 60:
 ===== How It Works ===== ===== How It Works =====
  
-The **DataDetection** class provides an easy-to-use method for detecting data quality issues: `has_issues(data)`. Below is a breakdown of the logic and workflow:+The **DataDetection** class provides an easy-to-use method for detecting data quality issues: **has_issues(data)**. Below is a breakdown of the logic and workflow:
  
 ==== 1. Checks Performed by the Module ==== ==== 1. Checks Performed by the Module ====
Line 76: Line 76:
  
 ==== 2. Error Handling and Logging ==== ==== 2. Error Handling and Logging ====
-The `has_issuesmethod includes:+The **has_issues** method includes:
   * **Exception Handling:**   * **Exception Handling:**
     - Catches errors (e.g., if an invalid input is passed) and logs an appropriate error message.     - Catches errors (e.g., if an invalid input is passed) and logs an appropriate error message.
Line 85: Line 85:
  
 Example Log Output: Example Log Output:
-```plaintext+<code> 
 +plaintext
 WARNING:root:Data contains missing values. WARNING:root:Data contains missing values.
 WARNING:root:Data contains duplicate rows. WARNING:root:Data contains duplicate rows.
 INFO:root:No data quality issues detected. INFO:root:No data quality issues detected.
 ERROR:root:Error during data quality checks: Invalid input type ERROR:root:Error during data quality checks: Invalid input type
-```+</code>
  
 ---- ----
Line 99: Line 100:
  
 ==== Required Libraries ==== ==== Required Libraries ====
-* **`pandas`:** For handling tabular data (DataFrames) and performing checks like missing values and duplicates. + **pandas:** For handling tabular data (DataFrames) and performing checks like missing values and duplicates. 
-* **`logging`:** For capturing warnings, errors, and other information during the execution of data checks.+ **logging:** For capturing warnings, errors, and other information during the execution of data checks.
  
 ==== Installation ==== ==== Installation ====
 To install the required libraries, use the following: To install the required libraries, use the following:
-```bash+<code> 
 +bash
 pip install pandas pip install pandas
-```+</code>
  
 ---- ----
Line 115: Line 117:
  
 ==== Basic Example ==== ==== Basic Example ====
-Using the `has_issuesmethod to detect data issues.+Using the **has_issues** method to detect data issues.
  
-```python+<code> 
 +python
 import pandas as pd import pandas as pd
 from ai_data_detection import DataDetection from ai_data_detection import DataDetection
- +</code> 
-# Create a sample dataset+**Create a sample dataset** 
 +<code>
 data = pd.DataFrame({ data = pd.DataFrame({
     'A': [1, 2, None, 4],     'A': [1, 2, None, 4],
Line 127: Line 131:
     'C': [1.1, 2.2, 2.2, None]     'C': [1.1, 2.2, 2.2, None]
 }) })
- +</code> 
-# Create an instance of DataDetection+**Create an instance of DataDetection** 
 +<code>
 detector = DataDetection() detector = DataDetection()
- +</code> 
-# Check for data issues+**Check for data issues** 
 +<code>
 if detector.has_issues(data): if detector.has_issues(data):
     print("The dataset has quality issues.")     print("The dataset has quality issues.")
 else: else:
     print("The dataset is clean.")     print("The dataset is clean.")
-``` +</code>
 **Output:** **Output:**
-```plaintext+<code> 
 +plaintext
 WARNING:root:Data contains missing values. WARNING:root:Data contains missing values.
 The dataset has quality issues. The dataset has quality issues.
-```+</code>
  
 ---- ----
Line 149: Line 155:
  
 === 1. Customizing Data Checks === === 1. Customizing Data Checks ===
-You can extend the `DataDetectionclass to add checks for other data quality metrics, such as outliers or invalid values.+You can extend the **DataDetection** class to add checks for other data quality metrics, such as outliers or invalid values.
  
 **Example: Adding Outlier Detection** **Example: Adding Outlier Detection**
-```python+<code> 
 +python
 import numpy as np import numpy as np
  
Line 178: Line 185:
 if extended_detector.has_outliers(data): if extended_detector.has_outliers(data):
     print("Outliers detected in the dataset.")     print("Outliers detected in the dataset.")
-```+</code>
  
 **Output:** **Output:**
-```plaintext+<code> 
 +plaintext
 WARNING:root:Outliers detected in dataset. WARNING:root:Outliers detected in dataset.
 Outliers detected in the dataset. Outliers detected in the dataset.
-```+</code>
  
 --- ---
Line 190: Line 198:
 === 2. Integrating DataDetection into a Pipeline === === 2. Integrating DataDetection into a Pipeline ===
 This module can be integrated as part of a Scikit-learn pipeline for preprocessing. This module can be integrated as part of a Scikit-learn pipeline for preprocessing.
- +<code> 
-```python+python
 from sklearn.pipeline import Pipeline from sklearn.pipeline import Pipeline
  
Line 206: Line 214:
     ('model', LogisticRegression())     ('model', LogisticRegression())
 ]) ])
-``` +</code>
 --- ---
  
 === 3. Handling Large Datasets === === 3. Handling Large Datasets ===
 For large datasets, optimize checks using chunk-based processing in Pandas: For large datasets, optimize checks using chunk-based processing in Pandas:
-```python+<code> 
 +python
 def has_issues_in_chunks(file_path, chunk_size=1000): def has_issues_in_chunks(file_path, chunk_size=1000):
     detector = DataDetection()     detector = DataDetection()
Line 219: Line 227:
             return True             return True
     return False     return False
-```+</code>
  
 ---- ----
Line 225: Line 233:
 ===== Best Practices ===== ===== Best Practices =====
 1. **Use Incremental Checks:** Perform quality checks at different stages of the pipeline (e.g., after loading raw data and after preprocessing steps). 1. **Use Incremental Checks:** Perform quality checks at different stages of the pipeline (e.g., after loading raw data and after preprocessing steps).
 +
 2. **Automate Logging:** Set up centralized logging for tracking data issues across multiple datasets. 2. **Automate Logging:** Set up centralized logging for tracking data issues across multiple datasets.
 +
 3. **Adapt Custom Methods:** Extend the module for domain-specific checks, such as outlier detection, range checks, or invalid category detection. 3. **Adapt Custom Methods:** Extend the module for domain-specific checks, such as outlier detection, range checks, or invalid category detection.
 +
 4. **Handle Issues Early:** Address identified data issues before training machine learning models. 4. **Handle Issues Early:** Address identified data issues before training machine learning models.
  
Line 238: Line 249:
  
 **Example: Adding Invalid Category Detection** **Example: Adding Invalid Category Detection**
-```python+<code> 
 +python
 def has_invalid_categories(data, valid_categories): def has_invalid_categories(data, valid_categories):
     for col in data.select_dtypes(include=['object']):     for col in data.select_dtypes(include=['object']):
Line 246: Line 258:
             return True             return True
     return False     return False
-```+</code>
  
 ---- ----
ai_data_detection.1748185183.txt.gz · Last modified: 2025/05/25 14:59 by eagleeyenebula