ai_data_masking
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| ai_data_masking [2025/05/25 15:43] – [1. Masking Sensitive Columns] eagleeyenebula | ai_data_masking [2025/05/25 16:03] (current) – [Basic Example] eagleeyenebula | ||
|---|---|---|---|
| Line 29: | Line 29: | ||
| The **ai_data_masking.py** module helps developers and data scientists: | The **ai_data_masking.py** module helps developers and data scientists: | ||
| - | 1. Implement column masking in pandas-style datasets with a flexible and extensible API. | + | |
| - | 2. Simplify the process of anonymizing sensitive data such as **SSNs**, **credit card details**, | + | |
| **emails**, or **phone numbers**. | **emails**, or **phone numbers**. | ||
| - | 3. Prevent accidental data leaks during data transfers, modeling, or exploratory analysis. | + | |
| - | 4. Meet strict privacy and data protection standards with minimal effort. | + | |
| ---- | ---- | ||
| Line 68: | Line 68: | ||
| The **mask_columns** method performs the following operations: | The **mask_columns** method performs the following operations: | ||
| - | 1. Verifies that each input column exists in the provided DataFrame. | + | * Verifies that each input column exists in the provided DataFrame. |
| - | 2. Replaces the values in each specified column with the **placeholder string** (`" | + | |
| - | 3. Returns the masked DataFrame. | + | * Replaces the values in each specified column with the **placeholder string** (" |
| + | |||
| + | * Returns the masked DataFrame. | ||
| **Input Parameters: | **Input Parameters: | ||
| - | 1. **data:** A Pandas DataFrame containing the data to mask. | + | |
| - | 2. **columns: | + | * **data:** A Pandas DataFrame containing the data to mask. |
| + | |||
| + | * **columns: | ||
| **Output:** | **Output:** | ||
| - | 1. Returns the modified DataFrame with the specified columns masked. | ||
| - | Example Workflow: | + | * Returns the modified DataFrame with the specified columns masked. |
| - | 1. Import the **DataMasking** class. | + | |
| - | 2. Specify the sensitive columns (**columns**). | + | **Example Workflow:** |
| - | 3. Use the **mask_columns** method to replace sensitive values. | + | |
| + | * Import the **DataMasking** class. | ||
| + | |||
| + | * Specify the sensitive columns (**columns**). | ||
| + | |||
| + | * Use the **mask_columns** method to replace sensitive values. | ||
| ==== 2. Error Handling and Logging ==== | ==== 2. Error Handling and Logging ==== | ||
| Line 90: | Line 99: | ||
| **Logging Examples:** | **Logging Examples:** | ||
| - | ```plaintext | + | < |
| + | plaintext | ||
| INFO: | INFO: | ||
| WARNING: | WARNING: | ||
| ERROR: | ERROR: | ||
| - | ``` | + | </ |
| ---- | ---- | ||
| Line 103: | Line 113: | ||
| ==== Required Libraries ==== | ==== Required Libraries ==== | ||
| - | * **`pandas`:** For managing structured datasets. | + | * **pandas:** For managing structured datasets. |
| - | * **`logging`:** For capturing warning, error, and info messages. | + | * **logging: |
| ==== Installation ==== | ==== Installation ==== | ||
| To install the required dependencies, | To install the required dependencies, | ||
| - | ```bash | + | < |
| + | bash | ||
| pip install pandas | pip install pandas | ||
| - | ``` | + | </ |
| ---- | ---- | ||
| Line 116: | Line 127: | ||
| ===== Usage ===== | ===== Usage ===== | ||
| - | Below are examples of how to use the `DataMasking` module for masking sensitive columns. | + | Below are examples of how to use the **DataMasking** module for masking sensitive columns. |
| ==== Basic Example ==== | ==== Basic Example ==== | ||
| - | Mask specific columns using the default placeholder | + | Mask specific columns using the default placeholder |
| - | ```python | + | < |
| + | python | ||
| import pandas as pd | import pandas as pd | ||
| from ai_data_masking import DataMasking | from ai_data_masking import DataMasking | ||
| Line 131: | Line 143: | ||
| ' | ' | ||
| }) | }) | ||
| - | + | </ | |
| - | # Mask sensitive columns | + | # **Mask sensitive columns** |
| + | < | ||
| masked_data = DataMasking.mask_columns(data, | masked_data = DataMasking.mask_columns(data, | ||
| print(masked_data) | print(masked_data) | ||
| - | ``` | + | </ |
| **Output:** | **Output:** | ||
| - | ```plaintext | + | |
| + | < | ||
| + | plaintext | ||
| Name Email SSN | Name Email SSN | ||
| 0 Alice | 0 Alice | ||
| 1 Bob | 1 Bob | ||
| 2 Charlie | 2 Charlie | ||
| - | ``` | + | </ |
| ---- | ---- | ||
| Line 151: | Line 166: | ||
| === 1. Using Custom Placeholders === | === 1. Using Custom Placeholders === | ||
| - | Replace sensitive values with a custom placeholder string instead of `[MASKED]`. | + | Replace sensitive values with a custom placeholder string instead of **[MASKED]**. |
| - | ```python | + | < |
| + | python | ||
| # Custom placeholder | # Custom placeholder | ||
| def mask_columns_with_custom_placeholder(data, | def mask_columns_with_custom_placeholder(data, | ||
| Line 164: | Line 180: | ||
| print(masked_data) | print(masked_data) | ||
| - | ``` | + | </ |
| **Output:** | **Output:** | ||
| - | ```plaintext | + | < |
| + | plaintext | ||
| Name Email SSN | Name Email SSN | ||
| 0 Alice alice@example.com | 0 Alice alice@example.com | ||
| 1 Bob bob@example.com | 1 Bob bob@example.com | ||
| 2 Charlie | 2 Charlie | ||
| - | ``` | + | </ |
| --- | --- | ||
| Line 179: | Line 196: | ||
| Mask data based on a condition (e.g., SSNs starting with specific prefixes). | Mask data based on a condition (e.g., SSNs starting with specific prefixes). | ||
| - | ```python | + | < |
| + | python | ||
| # Condition-based masking | # Condition-based masking | ||
| def mask_conditionally(data, | def mask_conditionally(data, | ||
| Line 189: | Line 207: | ||
| print(masked_data) | print(masked_data) | ||
| - | ``` | + | </ |
| **Output:** | **Output:** | ||
| - | ```plaintext | + | < |
| + | plaintext | ||
| Name Email SSN | Name Email SSN | ||
| 0 Alice alice@example.com | 0 Alice alice@example.com | ||
| 1 Bob bob@example.com | 1 Bob bob@example.com | ||
| 2 Charlie | 2 Charlie | ||
| - | ``` | + | </ |
| --- | --- | ||
| === 3. Integrating Data Masking into Pipelines === | === 3. Integrating Data Masking into Pipelines === | ||
| - | Integrate | + | Integrate |
| - | ```python | + | < |
| + | python | ||
| from sklearn.pipeline import Pipeline | from sklearn.pipeline import Pipeline | ||
| Line 213: | Line 233: | ||
| def transform(self, | def transform(self, | ||
| return DataMasking.mask_columns(data, | return DataMasking.mask_columns(data, | ||
| - | + | </ | |
| - | # Example pipeline | + | # **Example pipeline** |
| + | < | ||
| pipeline = Pipeline([ | pipeline = Pipeline([ | ||
| (' | (' | ||
| ]) | ]) | ||
| - | + | </ | |
| - | # Apply masking and other preprocessing | + | # **Apply masking and other preprocessing** |
| + | < | ||
| masked_data = pipeline.named_steps[' | masked_data = pipeline.named_steps[' | ||
| print(masked_data) | print(masked_data) | ||
| - | ``` | + | </ |
| ---- | ---- | ||
| Line 230: | Line 252: | ||
| - Perform exploratory analysis to identify sensitive columns requiring masking. | - Perform exploratory analysis to identify sensitive columns requiring masking. | ||
| 2. **Use Custom Placeholders: | 2. **Use Custom Placeholders: | ||
| - | - Replace sensitive values with descriptive placeholders for better clarity (e.g., | + | - Replace sensitive values with descriptive placeholders for better clarity (e.g., |
| 3. **Mask Early:** | 3. **Mask Early:** | ||
| - Mask sensitive columns before any export or sharing of the dataset. | - Mask sensitive columns before any export or sharing of the dataset. | ||
| Line 242: | Line 264: | ||
| 1. **Hashing for Obfuscation: | 1. **Hashing for Obfuscation: | ||
| - | - Replace sensitive values with hashed tokens using libraries like `hashlib`. | + | - Replace sensitive values with hashed tokens using libraries like **hashlib**. |
| **Example: Hashing Columns** | **Example: Hashing Columns** | ||
| - | ```python | + | < |
| + | python | ||
| import hashlib | import hashlib | ||
| Line 255: | Line 278: | ||
| hashed_data = hash_columns(data, | hashed_data = hash_columns(data, | ||
| - | ``` | + | </ |
| 2. **Encrypting Sensitive Columns:** | 2. **Encrypting Sensitive Columns:** | ||
| - | - Use encryption for reversible masking (e.g., library | + | - Use encryption for reversible masking (e.g., library |
| 3. **Handling Multilingual Text in Datasets:** | 3. **Handling Multilingual Text in Datasets:** | ||
| Line 290: | Line 313: | ||
| ===== Conclusion ===== | ===== Conclusion ===== | ||
| - | The **`ai_data_masking.py`** module provides fast, flexible, and secure masking capabilities for sensitive data. With its Pandas DataFrame integration, | + | The **ai_data_masking.py** module provides fast, flexible, and secure masking capabilities for sensitive data. With its Pandas DataFrame integration, |
ai_data_masking.1748187808.txt.gz · Last modified: 2025/05/25 15:43 by eagleeyenebula
