User Tools

Site Tools


ai_data_masking

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
ai_data_masking [2025/05/25 15:43] – [1. Masking Sensitive Columns] eagleeyenebulaai_data_masking [2025/05/25 16:03] (current) – [Basic Example] eagleeyenebula
Line 29: Line 29:
 The **ai_data_masking.py** module helps developers and data scientists: The **ai_data_masking.py** module helps developers and data scientists:
  
-1. Implement column masking in pandas-style datasets with a flexible and extensible API.+   Implement column masking in pandas-style datasets with a flexible and extensible API.
  
-2. Simplify the process of anonymizing sensitive data such as **SSNs**, **credit card details**, +   Simplify the process of anonymizing sensitive data such as **SSNs**, **credit card details**, 
  
 **emails**, or **phone numbers**. **emails**, or **phone numbers**.
  
-3. Prevent accidental data leaks during data transfers, modeling, or exploratory analysis.+   Prevent accidental data leaks during data transfers, modeling, or exploratory analysis.
  
-4. Meet strict privacy and data protection standards with minimal effort.+   Meet strict privacy and data protection standards with minimal effort.
  
 ---- ----
Line 68: Line 68:
 The **mask_columns** method performs the following operations: The **mask_columns** method performs the following operations:
  
-1. Verifies that each input column exists in the provided DataFrame. +  * Verifies that each input column exists in the provided DataFrame. 
-2. Replaces the values in each specified column with the **placeholder string** (`"[MASKED]"by default). + 
-3. Returns the masked DataFrame.+  * Replaces the values in each specified column with the **placeholder string** ("[MASKED]" by default). 
 + 
 +  * Returns the masked DataFrame.
  
 **Input Parameters:** **Input Parameters:**
-1. **data:** A Pandas DataFrame containing the data to mask. + 
-2. **columns:** A list of columns in the DataFrame that need to be masked.+  * **data:** A Pandas DataFrame containing the data to mask. 
 + 
 +  * **columns:** A list of columns in the DataFrame that need to be masked. 
  
 **Output:** **Output:**
-1. Returns the modified DataFrame with the specified columns masked. 
  
-Example Workflow: +  * Returns the modified DataFrame with the specified columns masked. 
-1. Import the **DataMasking** class. + 
-2. Specify the sensitive columns (**columns**). +**Example Workflow:** 
-3. Use the **mask_columns** method to replace sensitive values.+ 
 +  * Import the **DataMasking** class. 
 + 
 +  * Specify the sensitive columns (**columns**). 
 + 
 +  * Use the **mask_columns** method to replace sensitive values.
  
 ==== 2. Error Handling and Logging ==== ==== 2. Error Handling and Logging ====
Line 90: Line 99:
  
 **Logging Examples:** **Logging Examples:**
-```plaintext+<code> 
 +plaintext
 INFO:root:Masking sensitive columns... INFO:root:Masking sensitive columns...
 WARNING:root:Column 'SSN' not found in DataFrame. WARNING:root:Column 'SSN' not found in DataFrame.
 ERROR:root:Failed to mask data: Invalid DataFrame input ERROR:root:Failed to mask data: Invalid DataFrame input
-```+</code>
  
 ---- ----
Line 103: Line 113:
  
 ==== Required Libraries ==== ==== Required Libraries ====
-  * **`pandas`:** For managing structured datasets. +  * **pandas:** For managing structured datasets. 
-  * **`logging`:** For capturing warning, error, and info messages.+  * **logging:** For capturing warning, error, and info messages.
  
 ==== Installation ==== ==== Installation ====
 To install the required dependencies, run: To install the required dependencies, run:
-```bash+<code> 
 +bash
 pip install pandas pip install pandas
-```+</code>
  
 ---- ----
Line 116: Line 127:
 ===== Usage ===== ===== Usage =====
  
-Below are examples of how to use the `DataMaskingmodule for masking sensitive columns.+Below are examples of how to use the **DataMasking** module for masking sensitive columns.
  
 ==== Basic Example ==== ==== Basic Example ====
-Mask specific columns using the default placeholder `"[MASKED]"`.+Mask specific columns using the default placeholder **"[MASKED]"**.
  
-```python+<code> 
 +python
 import pandas as pd import pandas as pd
 from ai_data_masking import DataMasking from ai_data_masking import DataMasking
Line 131: Line 143:
     'SSN': ['123-45-6789', '987-65-4321', '567-89-1234']     'SSN': ['123-45-6789', '987-65-4321', '567-89-1234']
 }) })
- +</code> 
-# Mask sensitive columns+**Mask sensitive columns** 
 +<code>
 masked_data = DataMasking.mask_columns(data, columns=['Email', 'SSN']) masked_data = DataMasking.mask_columns(data, columns=['Email', 'SSN'])
  
 print(masked_data) print(masked_data)
-```+</code>
  
 **Output:** **Output:**
-```plaintext+ 
 +<code> 
 +plaintext
       Name            Email          SSN       Name            Email          SSN
 0    Alice         [MASKED]     [MASKED] 0    Alice         [MASKED]     [MASKED]
 1      Bob         [MASKED]     [MASKED] 1      Bob         [MASKED]     [MASKED]
 2  Charlie         [MASKED]     [MASKED] 2  Charlie         [MASKED]     [MASKED]
-```+</code>
  
 ---- ----
Line 151: Line 166:
  
 === 1. Using Custom Placeholders === === 1. Using Custom Placeholders ===
-Replace sensitive values with a custom placeholder string instead of `[MASKED]`.+Replace sensitive values with a custom placeholder string instead of **[MASKED]**.
  
-```python+<code> 
 +python
 # Custom placeholder # Custom placeholder
 def mask_columns_with_custom_placeholder(data, columns, placeholder="***REDACTED***"): def mask_columns_with_custom_placeholder(data, columns, placeholder="***REDACTED***"):
Line 164: Line 180:
  
 print(masked_data) print(masked_data)
-```+</code>
  
 **Output:** **Output:**
-```plaintext+<code> 
 +plaintext
       Name            Email         SSN       Name            Email         SSN
 0    Alice  alice@example.com  ***REDACTED*** 0    Alice  alice@example.com  ***REDACTED***
 1      Bob    bob@example.com  ***REDACTED*** 1      Bob    bob@example.com  ***REDACTED***
 2  Charlie  charlie@example.com  ***REDACTED*** 2  Charlie  charlie@example.com  ***REDACTED***
-```+</code>
  
 --- ---
Line 179: Line 196:
 Mask data based on a condition (e.g., SSNs starting with specific prefixes). Mask data based on a condition (e.g., SSNs starting with specific prefixes).
  
-```python+<code> 
 +python
 # Condition-based masking # Condition-based masking
 def mask_conditionally(data, column, condition): def mask_conditionally(data, column, condition):
Line 189: Line 207:
  
 print(masked_data) print(masked_data)
-```+</code>
  
 **Output:** **Output:**
-```plaintext+<code> 
 +plaintext
       Name            Email                     SSN       Name            Email                     SSN
 0    Alice  alice@example.com  CONDITIONALLY MASKED 0    Alice  alice@example.com  CONDITIONALLY MASKED
 1      Bob    bob@example.com            987-65-4321 1      Bob    bob@example.com            987-65-4321
 2  Charlie  charlie@example.com            567-89-1234 2  Charlie  charlie@example.com            567-89-1234
-```+</code>
  
 --- ---
  
 === 3. Integrating Data Masking into Pipelines === === 3. Integrating Data Masking into Pipelines ===
-Integrate `DataMaskinginto a larger data transformation pipeline.+Integrate **DataMasking** into a larger data transformation pipeline.
  
-```python+<code> 
 +python
 from sklearn.pipeline import Pipeline from sklearn.pipeline import Pipeline
  
Line 213: Line 233:
     def transform(self, data):     def transform(self, data):
         return DataMasking.mask_columns(data, self.columns)         return DataMasking.mask_columns(data, self.columns)
- +</code> 
-# Example pipeline+**Example pipeline** 
 +<code>
 pipeline = Pipeline([ pipeline = Pipeline([
     ('masking', MaskingTransformer(columns=['Email', 'SSN']))     ('masking', MaskingTransformer(columns=['Email', 'SSN']))
 ]) ])
- +</code> 
-# Apply masking and other preprocessing+**Apply masking and other preprocessing** 
 +<code>
 masked_data = pipeline.named_steps['masking'].transform(data) masked_data = pipeline.named_steps['masking'].transform(data)
 print(masked_data) print(masked_data)
-```+</code>
  
 ---- ----
Line 230: Line 252:
    - Perform exploratory analysis to identify sensitive columns requiring masking.    - Perform exploratory analysis to identify sensitive columns requiring masking.
 2. **Use Custom Placeholders:** 2. **Use Custom Placeholders:**
-   - Replace sensitive values with descriptive placeholders for better clarity (e.g., `"[ANONYMIZED SSN]"`).+   - Replace sensitive values with descriptive placeholders for better clarity (e.g., **"[ANONYMIZED SSN]"**).
 3. **Mask Early:** 3. **Mask Early:**
    - Mask sensitive columns before any export or sharing of the dataset.    - Mask sensitive columns before any export or sharing of the dataset.
Line 242: Line 264:
  
 1. **Hashing for Obfuscation:** 1. **Hashing for Obfuscation:**
-   - Replace sensitive values with hashed tokens using libraries like `hashlib`.+   - Replace sensitive values with hashed tokens using libraries like **hashlib**.
  
 **Example: Hashing Columns** **Example: Hashing Columns**
-```python+<code> 
 +python
 import hashlib import hashlib
  
Line 255: Line 278:
  
 hashed_data = hash_columns(data, columns=["Email"]) hashed_data = hash_columns(data, columns=["Email"])
-```+</code>
  
 2. **Encrypting Sensitive Columns:** 2. **Encrypting Sensitive Columns:**
-   - Use encryption for reversible masking (e.g., library `cryptography`).+   - Use encryption for reversible masking (e.g., library **cryptography**).
  
 3. **Handling Multilingual Text in Datasets:** 3. **Handling Multilingual Text in Datasets:**
Line 290: Line 313:
  
 ===== Conclusion ===== ===== Conclusion =====
-The **`ai_data_masking.py`** module provides fast, flexible, and secure masking capabilities for sensitive data. With its Pandas DataFrame integration, logging, and extensibility, this module is a powerful tool for ensuring data privacy in modern AI and data science workflows. Use it to safeguard sensitive columns and create secure datasets that meet the highest privacy standards.+The **ai_data_masking.py** module provides fast, flexible, and secure masking capabilities for sensitive data. With its Pandas DataFrame integration, logging, and extensibility, this module is a powerful tool for ensuring data privacy in modern AI and data science workflows. Use it to safeguard sensitive columns and create secure datasets that meet the highest privacy standards.
ai_data_masking.1748187808.txt.gz · Last modified: 2025/05/25 15:43 by eagleeyenebula