Introduction
ai_data_masking.py
is a dedicated script in the G.O.D. Framework engineered to protect sensitive data by applying masking techniques. This script ensures compliance with data privacy laws and security best practices by replacing or obfuscating personal identifiable information (PII) and other confidential data elements.
Purpose
- Data Privacy: Comply with regulations like GDPR, HIPAA, and CCPA by securing sensitive data.
- Risk Mitigation: Minimize the exposure of sensitive data in non-secure environments.
- Data Transformation: Enable the use of production-like data in non-production environments while ensuring no data breaches.
- Enhanced Security: Safeguard confidential data from unauthorized access.
Key Features
- Attribute-Based Masking: Mask certain attributes (e.g., SSN, credit card numbers, email addresses) selectively.
- Format Preservation: Replace sensitive information while preserving its original structure (e.g., masking phone numbers as
XXX-XXX-1234
). - Custom Rules: Allow users to define masking rules for domain-specific data.
- Pseudonymization: Transform sensitive data into non-identifiable tokens that can be used instead.
Logic and Implementation
The script operates by detecting target attributes in a dataset, applying specified masking rules to obfuscate sensitive information, and returning the masked dataset. Below are the key steps:
- Identify attributes requiring masking (e.g., PII fields).
- Apply masking or pseudonymization techniques based on predefined rules.
- Ensure mask values retain compatibility with downstream systems.
import re
import random
class DataMasker:
def __init__(self, rules=None):
"""
Initialize the DataMasker with custom or default rules.
:param rules: A dictionary specifying masking rules for different attributes.
"""
self.rules = rules if rules else {
"email": self.mask_email,
"phone": self.mask_phone,
"credit_card": self.mask_credit_card
}
def mask_email(self, value):
"""
Mask an email address, retaining the domain.
"""
domain = value.split("@")[-1]
return "masked_user@" + domain
def mask_phone(self, value):
"""
Mask a phone number while keeping the last four digits.
Example: '123-456-7890' -> 'XXX-XXX-7890'
"""
return re.sub(r"\d(?=\d{4})", "X", value)
def mask_credit_card(self, value):
"""
Mask all digits except the last four in a credit card number.
Example: '1234-5678-1234-5678' -> 'XXXX-XXXX-XXXX-5678'
"""
return re.sub(r"\d(?=\d{4})", "X", value)
def mask_dataset(self, dataset, columns_to_mask):
"""
Apply masking rules to the specified columns in the dataset.
:param dataset: A dictionary or pandas DataFrame.
:param columns_to_mask: A list of columns to mask.
:return: Masked dataset.
"""
masked_data = dataset.copy()
for column in columns_to_mask:
if column in masked_data:
masked_data[column] = masked_data[column].apply(self.rules[column])
return masked_data
if __name__ == "__main__":
# Sample dataset
sample_data = {
"email": ["user1@example.com", "user2@domain.com"],
"phone": ["123-456-7890", "987-654-3210"],
"credit_card": ["1234-5678-1234-5678", "4321-8765-4321-8765"]
}
masker = DataMasker()
masked_result = masker.mask_dataset(sample_data, ["email", "phone", "credit_card"])
print(masked_result)
Dependencies
This script uses Python's standard library modules for efficient implementation:
re
: For regular expressions to format and mask strings.random
: Generates pseudonymized or randomized values.pandas
(optional): Can be used for tabular datasets requiring masking.
How to Use This Script
Follow these steps to use ai_data_masking.py
effectively:
- Integrate the script into your preprocessing pipeline.
- Define masking rules for specific attributes or use the provided defaults.
- Load the dataset and specify which columns require masking.
- Run the
mask_dataset
function to generate a masked version of the dataset.
# Example Usage
from ai_data_masking import DataMasker
data = {
"email": ["user3@gmail.com", "sample@yahoo.com"],
"phone": ["345-678-9012", "789-123-4567"],
"credit_card": ["5678-1234-5678-9012", "4321-9876-4321-1234"]
}
masker = DataMasker()
masked_data = masker.mask_dataset(data, ["email", "phone", "credit_card"])
print(masked_data)
Role in the G.O.D. Framework
- Preprocessing: Applied within pipelines such as
ai_automated_data_pipeline.py
to ensure privacy before analysis. - Monitoring and Reporting: Works alongside
ai_data_monitoring_reporting.py
to log and verify masked data. - Security Compliance: Used in conjunction with
ai_data_privacy_manager.py
for data regulation adherence.
Future Enhancements
- Dynamically Generated Rules: Enable automatic inference of masking rules based on dataset patterns.
- Advanced Tokenization: Use reversible algorithms to allow controlled restoration of masked data.
- International Standards Compliance: Expand masking options to address regional data privacy laws.