G.O.D. Framework

Script: ai_data_masking.py - Securing Sensitive Data

Introduction

ai_data_masking.py is a dedicated script in the G.O.D. Framework engineered to protect sensitive data by applying masking techniques. This script ensures compliance with data privacy laws and security best practices by replacing or obfuscating personal identifiable information (PII) and other confidential data elements.

Purpose

Key Features

Logic and Implementation

The script operates by detecting target attributes in a dataset, applying specified masking rules to obfuscate sensitive information, and returning the masked dataset. Below are the key steps:

  1. Identify attributes requiring masking (e.g., PII fields).
  2. Apply masking or pseudonymization techniques based on predefined rules.
  3. Ensure mask values retain compatibility with downstream systems.

            import re
            import random

            class DataMasker:
                def __init__(self, rules=None):
                    """
                    Initialize the DataMasker with custom or default rules.
                    :param rules: A dictionary specifying masking rules for different attributes.
                    """
                    self.rules = rules if rules else {
                        "email": self.mask_email,
                        "phone": self.mask_phone,
                        "credit_card": self.mask_credit_card
                    }

                def mask_email(self, value):
                    """
                    Mask an email address, retaining the domain.
                    """
                    domain = value.split("@")[-1]
                    return "masked_user@" + domain

                def mask_phone(self, value):
                    """
                    Mask a phone number while keeping the last four digits.
                    Example: '123-456-7890' -> 'XXX-XXX-7890'
                    """
                    return re.sub(r"\d(?=\d{4})", "X", value)

                def mask_credit_card(self, value):
                    """
                    Mask all digits except the last four in a credit card number.
                    Example: '1234-5678-1234-5678' -> 'XXXX-XXXX-XXXX-5678'
                    """
                    return re.sub(r"\d(?=\d{4})", "X", value)

                def mask_dataset(self, dataset, columns_to_mask):
                    """
                    Apply masking rules to the specified columns in the dataset.
                    :param dataset: A dictionary or pandas DataFrame.
                    :param columns_to_mask: A list of columns to mask.
                    :return: Masked dataset.
                    """
                    masked_data = dataset.copy()
                    for column in columns_to_mask:
                        if column in masked_data:
                            masked_data[column] = masked_data[column].apply(self.rules[column])
                    return masked_data

            if __name__ == "__main__":
                # Sample dataset
                sample_data = {
                    "email": ["user1@example.com", "user2@domain.com"],
                    "phone": ["123-456-7890", "987-654-3210"],
                    "credit_card": ["1234-5678-1234-5678", "4321-8765-4321-8765"]
                }

                masker = DataMasker()
                masked_result = masker.mask_dataset(sample_data, ["email", "phone", "credit_card"])

                print(masked_result)
            

Dependencies

This script uses Python's standard library modules for efficient implementation:

How to Use This Script

Follow these steps to use ai_data_masking.py effectively:

  1. Integrate the script into your preprocessing pipeline.
  2. Define masking rules for specific attributes or use the provided defaults.
  3. Load the dataset and specify which columns require masking.
  4. Run the mask_dataset function to generate a masked version of the dataset.

            # Example Usage
            from ai_data_masking import DataMasker

            data = {
                "email": ["user3@gmail.com", "sample@yahoo.com"],
                "phone": ["345-678-9012", "789-123-4567"],
                "credit_card": ["5678-1234-5678-9012", "4321-9876-4321-1234"]
            }

            masker = DataMasker()
            masked_data = masker.mask_dataset(data, ["email", "phone", "credit_card"])
            print(masked_data)
            

Role in the G.O.D. Framework

Future Enhancements