Differences

This shows you the differences between two versions of the page.

--- ai_data_detection [2025/05/25 14:59] – [1. Checks Performed by the Module] eagleeyenebula
+++ ai_data_detection [2025/05/25 15:09] (current) – [Best Practices] eagleeyenebula
@@ Line 42: / Line 42: @@
   * **Missing Value Detection:**
-    - Identifies any `NaN` or `Null` values present in the dataset.
+    - Identifies any **NaN** or **Null** values present in the dataset.
   * **Duplicate Row Detection:**
@@ Line 60: / Line 60: @@
 ===== How It Works =====
-The **DataDetection** class provides an easy-to-use method for detecting data quality issues: `has_issues(data)`. Below is a breakdown of the logic and workflow:
+The **DataDetection** class provides an easy-to-use method for detecting data quality issues: **has_issues(data)**. Below is a breakdown of the logic and workflow:
 ==== 1. Checks Performed by the Module ====
@@ Line 76: / Line 76: @@
 ==== 2. Error Handling and Logging ====
-The `has_issues` method includes:
+The **has_issues** method includes:
   * **Exception Handling:**
     - Catches errors (e.g., if an invalid input is passed) and logs an appropriate error message.
@@ Line 85: / Line 85: @@
 Example Log Output:
-```plaintext
+<code>
+plaintext
 WARNING:root:Data contains missing values.
 WARNING:root:Data contains duplicate rows.
 INFO:root:No data quality issues detected.
 ERROR:root:Error during data quality checks: Invalid input type
-```
+</code>
 ----
@@ Line 99: / Line 100: @@
 ==== Required Libraries ====
-* **`pandas`:** For handling tabular data (DataFrames) and performing checks like missing values and duplicates.
+ **pandas:** For handling tabular data (DataFrames) and performing checks like missing values and duplicates.
-* **`logging`:** For capturing warnings, errors, and other information during the execution of data checks.
+ **logging:** For capturing warnings, errors, and other information during the execution of data checks.
 ==== Installation ====
 To install the required libraries, use the following:
-```bash
+<code>
+bash
 pip install pandas
-```
+</code>
 ----
@@ Line 115: / Line 117: @@
 ==== Basic Example ====
-Using the `has_issues` method to detect data issues.
+Using the **has_issues** method to detect data issues.
-```python
+<code>
+python
 import pandas as pd
 from ai_data_detection import DataDetection
+</code>
-# Create a sample dataset
+# **Create a sample dataset**
+<code>
 data = pd.DataFrame({
     'A': [1, 2, None, 4],
@@ Line 127: / Line 131: @@
     'C': [1.1, 2.2, 2.2, None]
 })
+</code>
-# Create an instance of DataDetection
+# **Create an instance of DataDetection**
+<code>
 detector = DataDetection()
+</code>
-# Check for data issues
+# **Check for data issues**
+<code>
 if detector.has_issues(data):
     print("The dataset has quality issues.")
 else:
     print("The dataset is clean.")
-```
+</code>
 **Output:**
-```plaintext
+<code>
+plaintext
 WARNING:root:Data contains missing values.
 The dataset has quality issues.
-```
+</code>
 ----
@@ Line 149: / Line 155: @@
 === 1. Customizing Data Checks ===
-You can extend the `DataDetection` class to add checks for other data quality metrics, such as outliers or invalid values.
+You can extend the **DataDetection** class to add checks for other data quality metrics, such as outliers or invalid values.
 **Example: Adding Outlier Detection**
-```python
+<code>
+python
 import numpy as np
@@ Line 178: / Line 185: @@
 if extended_detector.has_outliers(data):
     print("Outliers detected in the dataset.")
-```
+</code>
 **Output:**
-```plaintext
+<code>
+plaintext
 WARNING:root:Outliers detected in dataset.
 Outliers detected in the dataset.
-```
+</code>
 ---
@@ Line 190: / Line 198: @@
 === 2. Integrating DataDetection into a Pipeline ===
 This module can be integrated as part of a Scikit-learn pipeline for preprocessing.
+<code>
-```python
+python
 from sklearn.pipeline import Pipeline
@@ Line 206: / Line 214: @@
     ('model', LogisticRegression())
 ])
-```
+</code>
 ---
 === 3. Handling Large Datasets ===
 For large datasets, optimize checks using chunk-based processing in Pandas:
-```python
+<code>
+python
 def has_issues_in_chunks(file_path, chunk_size=1000):
     detector = DataDetection()
@@ Line 219: / Line 227: @@
             return True
     return False
-```
+</code>
 ----
@@ Line 225: / Line 233: @@
 ===== Best Practices =====
 . **Use Incremental Checks:** Perform quality checks at different stages of the pipeline (e.g., after loading raw data and after preprocessing steps).
 . **Automate Logging:** Set up centralized logging for tracking data issues across multiple datasets.
 . **Adapt Custom Methods:** Extend the module for domain-specific checks, such as outlier detection, range checks, or invalid category detection.
 . **Handle Issues Early:** Address identified data issues before training machine learning models.
@@ Line 238: / Line 249: @@
 **Example: Adding Invalid Category Detection**
-```python
+<code>
+python
 def has_invalid_categories(data, valid_categories):
     for col in data.select_dtypes(include=['object']):
@@ Line 246: / Line 258: @@
             return True
     return False
-```
+</code>
 ----