The ai_data_masking.py module provides a robust and flexible framework for masking or encrypting sensitive data within datasets. It aims to prevent data leakage, secure sensitive columns, and ensure compliance with data protection and privacy regulations such as GDPR, HIPAA, and CCPA.
This module allows users to:
The associated ai_data_masking.html file provides additional functionalities such as interactive tutorials, advanced encryption strategies, and visual demonstrations.
Data security is a critical concern in modern data workflows. The DataMasking class provides tools to safeguard sensitive columns in structured datasets by replacing their original values with masked placeholders. Masking sensitive data is especially useful in:
The ai_data_masking.py module helps developers and data scientists:
emails, or phone numbers.
The DataMasking module offers the following key features:
1. Masks specified columns in a dataset by replacing their values with a fixed placeholder (“[MASKED]” by default).
2. Accepts user-defined placeholders for more tailored anonymization needs.
3. Advanced masking rules can be implemented using condition-based masking techniques.
4. Logs errors gracefully if masking fails (e.g., invalid column names).
5. Works directly with Pandas DataFrames, the most commonly used format for structured data analysis in Python.
The DataMasking class provides a single core method: mask_columns(data, columns).
The mask_columns method performs the following operations:
Input Parameters:
Output:
Example Workflow:
The module includes robust error handling:
Logging Examples:
plaintext INFO:root:Masking sensitive columns... WARNING:root:Column 'SSN' not found in DataFrame. ERROR:root:Failed to mask data: Invalid DataFrame input
The module requires the following Python packages:
To install the required dependencies, run:
bash pip install pandas
Below are examples of how to use the DataMasking module for masking sensitive columns.
Mask specific columns using the default placeholder “[MASKED]“.
python
import pandas as pd
from ai_data_masking import DataMasking
# Create a sample dataset
data = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Email': ['alice@example.com', 'bob@example.com', 'charlie@example.com'],
'SSN': ['123-45-6789', '987-65-4321', '567-89-1234']
})
# Mask sensitive columns
masked_data = DataMasking.mask_columns(data, columns=['Email', 'SSN']) print(masked_data)
Output:
plaintext
Name Email SSN
0 Alice [MASKED] [MASKED]
1 Bob [MASKED] [MASKED]
2 Charlie [MASKED] [MASKED]
Replace sensitive values with a custom placeholder string instead of [MASKED].
python
# Custom placeholder
def mask_columns_with_custom_placeholder(data, columns, placeholder="***REDACTED***"):
for col in columns:
if col in data.columns:
data[col] = placeholder
return data
masked_data = mask_columns_with_custom_placeholder(data, columns=["SSN"], placeholder="***REDACTED***")
print(masked_data)
Output:
plaintext
Name Email SSN
0 Alice alice@example.com ***REDACTED***
1 Bob bob@example.com ***REDACTED***
2 Charlie charlie@example.com ***REDACTED***
—
Mask data based on a condition (e.g., SSNs starting with specific prefixes).
python
# Condition-based masking
def mask_conditionally(data, column, condition):
data.loc[condition(data[column]), column] = "CONDITIONALLY MASKED"
return data
# Mask all SSNs starting with '123'
masked_data = mask_conditionally(data, "SSN", lambda col: col.str.startswith("123"))
print(masked_data)
Output:
plaintext
Name Email SSN
0 Alice alice@example.com CONDITIONALLY MASKED
1 Bob bob@example.com 987-65-4321
2 Charlie charlie@example.com 567-89-1234
—
Integrate DataMasking into a larger data transformation pipeline.
python
from sklearn.pipeline import Pipeline
class MaskingTransformer:
def __init__(self, columns):
self.columns = columns
def transform(self, data):
return DataMasking.mask_columns(data, self.columns)
# Example pipeline
pipeline = Pipeline([
('masking', MaskingTransformer(columns=['Email', 'SSN']))
])
# Apply masking and other preprocessing
masked_data = pipeline.named_steps['masking'].transform(data) print(masked_data)
1. Analyze Data Before Masking:
2. Use Custom Placeholders:
3. Mask Early:
4. Test the Masking Code on Subsets:
The DataMasking module can be extended to handle advanced requirements:
1. Hashing for Obfuscation:
Example: Hashing Columns
python
import hashlib
def hash_columns(data, columns):
for col in columns:
if col in data.columns:
data[col] = data[col].apply(lambda x: hashlib.sha256(x.encode()).hexdigest())
return data
hashed_data = hash_columns(data, columns=["Email"])
2. Encrypting Sensitive Columns:
3. Handling Multilingual Text in Datasets:
The DataMasking module can be used with:
The following improvements are planned or could enhance this module: 1. Dynamic Masking:
2. Reversible Encryption Masking:
3. Integration with Emerging Privacy Libraries:
The ai_data_masking.py module provides fast, flexible, and secure masking capabilities for sensitive data. With its Pandas DataFrame integration, logging, and extensibility, this module is a powerful tool for ensuring data privacy in modern AI and data science workflows. Use it to safeguard sensitive columns and create secure datasets that meet the highest privacy standards.