Differences

This shows you the differences between two versions of the page.

--- ai_data_masking [2025/05/25 15:43] – [1. Masking Sensitive Columns] eagleeyenebula
+++ ai_data_masking [2025/05/25 16:03] (current) – [Basic Example] eagleeyenebula
@@ Line 29: / Line 29: @@
 The **ai_data_masking.py** module helps developers and data scientists:
-. Implement column masking in pandas-style datasets with a flexible and extensible API.
+   * Implement column masking in pandas-style datasets with a flexible and extensible API.
-. Simplify the process of anonymizing sensitive data such as **SSNs**, **credit card details**,
+   * Simplify the process of anonymizing sensitive data such as **SSNs**, **credit card details**,
 **emails**, or **phone numbers**.
-. Prevent accidental data leaks during data transfers, modeling, or exploratory analysis.
+   * Prevent accidental data leaks during data transfers, modeling, or exploratory analysis.
-. Meet strict privacy and data protection standards with minimal effort.
+   * Meet strict privacy and data protection standards with minimal effort.
 ----
@@ Line 68: / Line 68: @@
 The **mask_columns** method performs the following operations:
-. Verifies that each input column exists in the provided DataFrame.
+  * Verifies that each input column exists in the provided DataFrame.
-. Replaces the values in each specified column with the **placeholder string** (`"[MASKED]"` by default).
-. Returns the masked DataFrame.
+  * Replaces the values in each specified column with the **placeholder string** ("[MASKED]" by default).
+  * Returns the masked DataFrame.
 **Input Parameters:**
-. **data:** A Pandas DataFrame containing the data to mask.
-. **columns:** A list of columns in the DataFrame that need to be masked.
+  * **data:** A Pandas DataFrame containing the data to mask.
+  * **columns:** A list of columns in the DataFrame that need to be masked.
 **Output:**
-. Returns the modified DataFrame with the specified columns masked.
-Example Workflow:
+  * Returns the modified DataFrame with the specified columns masked.
-. Import the **DataMasking** class.
-. Specify the sensitive columns (**columns**).
+**Example Workflow:**
-. Use the **mask_columns** method to replace sensitive values.
+  * Import the **DataMasking** class.
+  * Specify the sensitive columns (**columns**).
+  * Use the **mask_columns** method to replace sensitive values.
 ==== 2. Error Handling and Logging ====
@@ Line 90: / Line 99: @@
 **Logging Examples:**
-```plaintext
+<code>
+plaintext
 INFO:root:Masking sensitive columns...
 WARNING:root:Column 'SSN' not found in DataFrame.
 ERROR:root:Failed to mask data: Invalid DataFrame input
-```
+</code>
 ----
@@ Line 103: / Line 113: @@
 ==== Required Libraries ====
-  * **`pandas`:** For managing structured datasets.
+  * **pandas:** For managing structured datasets.
-  * **`logging`:** For capturing warning, error, and info messages.
+  * **logging:** For capturing warning, error, and info messages.
 ==== Installation ====
 To install the required dependencies, run:
-```bash
+<code>
+bash
 pip install pandas
-```
+</code>
 ----
@@ Line 116: / Line 127: @@
 ===== Usage =====
-Below are examples of how to use the `DataMasking` module for masking sensitive columns.
+Below are examples of how to use the **DataMasking** module for masking sensitive columns.
 ==== Basic Example ====
-Mask specific columns using the default placeholder `"[MASKED]"`.
+Mask specific columns using the default placeholder **"[MASKED]"**.
-```python
+<code>
+python
 import pandas as pd
 from ai_data_masking import DataMasking
@@ Line 131: / Line 143: @@
     'SSN': ['123-45-6789', '987-65-4321', '567-89-1234']
 })
+</code>
-# Mask sensitive columns
+# **Mask sensitive columns**
+<code>
 masked_data = DataMasking.mask_columns(data, columns=['Email', 'SSN'])
 print(masked_data)
-```
+</code>
 **Output:**
-```plaintext
+<code>
+plaintext
       Name            Email          SSN
     Alice         [MASKED]     [MASKED]
       Bob         [MASKED]     [MASKED]
   Charlie         [MASKED]     [MASKED]
-```
+</code>
 ----
@@ Line 151: / Line 166: @@
 === 1. Using Custom Placeholders ===
-Replace sensitive values with a custom placeholder string instead of `[MASKED]`.
+Replace sensitive values with a custom placeholder string instead of **[MASKED]**.
-```python
+<code>
+python
 # Custom placeholder
 def mask_columns_with_custom_placeholder(data, columns, placeholder="***REDACTED***"):
@@ Line 164: / Line 180: @@
 print(masked_data)
-```
+</code>
 **Output:**
-```plaintext
+<code>
+plaintext
       Name            Email         SSN
     Alice  alice@example.com  ***REDACTED***
       Bob    bob@example.com  ***REDACTED***
   Charlie  charlie@example.com  ***REDACTED***
-```
+</code>
 ---
@@ Line 179: / Line 196: @@
 Mask data based on a condition (e.g., SSNs starting with specific prefixes).
-```python
+<code>
+python
 # Condition-based masking
 def mask_conditionally(data, column, condition):
@@ Line 189: / Line 207: @@
 print(masked_data)
-```
+</code>
 **Output:**
-```plaintext
+<code>
+plaintext
       Name            Email                     SSN
     Alice  alice@example.com  CONDITIONALLY MASKED
       Bob    bob@example.com            987-65-4321
   Charlie  charlie@example.com            567-89-1234
-```
+</code>
 ---
 === 3. Integrating Data Masking into Pipelines ===
-Integrate `DataMasking` into a larger data transformation pipeline.
+Integrate **DataMasking** into a larger data transformation pipeline.
-```python
+<code>
+python
 from sklearn.pipeline import Pipeline
@@ Line 213: / Line 233: @@
     def transform(self, data):
         return DataMasking.mask_columns(data, self.columns)
+</code>
-# Example pipeline
+# **Example pipeline**
+<code>
 pipeline = Pipeline([
     ('masking', MaskingTransformer(columns=['Email', 'SSN']))
 ])
+</code>
-# Apply masking and other preprocessing
+# **Apply masking and other preprocessing**
+<code>
 masked_data = pipeline.named_steps['masking'].transform(data)
 print(masked_data)
-```
+</code>
 ----
@@ Line 230: / Line 252: @@
    - Perform exploratory analysis to identify sensitive columns requiring masking.
 . **Use Custom Placeholders:**
-   - Replace sensitive values with descriptive placeholders for better clarity (e.g., `"[ANONYMIZED SSN]"`).
+   - Replace sensitive values with descriptive placeholders for better clarity (e.g., **"[ANONYMIZED SSN]"**).
 . **Mask Early:**
    - Mask sensitive columns before any export or sharing of the dataset.
@@ Line 242: / Line 264: @@
 . **Hashing for Obfuscation:**
-   - Replace sensitive values with hashed tokens using libraries like `hashlib`.
+   - Replace sensitive values with hashed tokens using libraries like **hashlib**.
 **Example: Hashing Columns**
-```python
+<code>
+python
 import hashlib
@@ Line 255: / Line 278: @@
 hashed_data = hash_columns(data, columns=["Email"])
-```
+</code>
 . **Encrypting Sensitive Columns:**
-   - Use encryption for reversible masking (e.g., library `cryptography`).
+   - Use encryption for reversible masking (e.g., library **cryptography**).
 . **Handling Multilingual Text in Datasets:**
@@ Line 290: / Line 313: @@
 ===== Conclusion =====
-The **`ai_data_masking.py`** module provides fast, flexible, and secure masking capabilities for sensitive data. With its Pandas DataFrame integration, logging, and extensibility, this module is a powerful tool for ensuring data privacy in modern AI and data science workflows. Use it to safeguard sensitive columns and create secure datasets that meet the highest privacy standards.
+The **ai_data_masking.py** module provides fast, flexible, and secure masking capabilities for sensitive data. With its Pandas DataFrame integration, logging, and extensibility, this module is a powerful tool for ensuring data privacy in modern AI and data science workflows. Use it to safeguard sensitive columns and create secure datasets that meet the highest privacy standards.