User Tools

Site Tools


ai_data_detection

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
ai_data_detection [2025/05/25 15:04] – [Advanced Examples] eagleeyenebulaai_data_detection [2025/05/25 15:09] (current) – [Best Practices] eagleeyenebula
Line 42: Line 42:
  
   * **Missing Value Detection:**   * **Missing Value Detection:**
-    - Identifies any `NaNor `Nullvalues present in the dataset.+    - Identifies any **NaN** or **Null** values present in the dataset.
  
   * **Duplicate Row Detection:**   * **Duplicate Row Detection:**
Line 233: Line 233:
 ===== Best Practices ===== ===== Best Practices =====
 1. **Use Incremental Checks:** Perform quality checks at different stages of the pipeline (e.g., after loading raw data and after preprocessing steps). 1. **Use Incremental Checks:** Perform quality checks at different stages of the pipeline (e.g., after loading raw data and after preprocessing steps).
 +
 2. **Automate Logging:** Set up centralized logging for tracking data issues across multiple datasets. 2. **Automate Logging:** Set up centralized logging for tracking data issues across multiple datasets.
 +
 3. **Adapt Custom Methods:** Extend the module for domain-specific checks, such as outlier detection, range checks, or invalid category detection. 3. **Adapt Custom Methods:** Extend the module for domain-specific checks, such as outlier detection, range checks, or invalid category detection.
 +
 4. **Handle Issues Early:** Address identified data issues before training machine learning models. 4. **Handle Issues Early:** Address identified data issues before training machine learning models.
  
Line 246: Line 249:
  
 **Example: Adding Invalid Category Detection** **Example: Adding Invalid Category Detection**
-```python+<code> 
 +python
 def has_invalid_categories(data, valid_categories): def has_invalid_categories(data, valid_categories):
     for col in data.select_dtypes(include=['object']):     for col in data.select_dtypes(include=['object']):
Line 254: Line 258:
             return True             return True
     return False     return False
-```+</code>
  
 ---- ----
ai_data_detection.1748185473.txt.gz · Last modified: 2025/05/25 15:04 by eagleeyenebula