G.O.D. Framework

Script: ai_data_registry.py - Data and Metadata Registry

Introduction

The ai_data_registry.py script implements a centralized data and metadata registry, enabling comprehensive tracking and management of datasets across the G.O.D. framework. This system ensures unified data access, efficient metadata storage, and seamless integration for downstream processing.

Purpose

Key Features

Logic and Implementation

At the core of ai_data_registry.py is a lightweight database or file-based mechanism like SQLite or JSON for storing registry information. Below is a breakdown of how it works:


            import json
            import os

            class DataRegistry:
                def __init__(self, registry_file='data_registry.json'):
                    """
                    Initializes the data registry with a specified storage file.
                    :param registry_file: Path to the file storing datasets' metadata.
                    """
                    self.registry_file = registry_file
                    self._initialize_registry()

                def _initialize_registry(self):
                    """
                    Ensure the registry file is initialized.
                    """
                    if not os.path.exists(self.registry_file):
                        with open(self.registry_file, 'w') as file:
                            json.dump({}, file)

                def register_dataset(self, dataset_id, metadata):
                    """
                    Register a new dataset in the registry.
                    :param dataset_id: Unique identifier for the dataset.
                    :param metadata: Dictionary containing metadata (e.g., schema, source).
                    """
                    registry = self._load_registry()
                    registry[dataset_id] = metadata
                    self._save_registry(registry)
                    print(f"Dataset '{dataset_id}' registered successfully.")

                def get_metadata(self, dataset_id):
                    """
                    Retrieve metadata for a specific dataset.
                    :param dataset_id: Unique identifier of the dataset.
                    :return: Metadata dictionary, if dataset exists.
                    """
                    registry = self._load_registry()
                    return registry.get(dataset_id, f"Dataset '{dataset_id}' not found.")

                def list_datasets(self):
                    """
                    List all datasets registered in the system.
                    :return: List of dataset IDs.
                    """
                    registry = self._load_registry()
                    return list(registry.keys())

                def _load_registry(self):
                    """
                    Load registry data from file.
                    :return: Dictionary containing the registry data.
                    """
                    with open(self.registry_file, 'r') as file:
                        return json.load(file)

                def _save_registry(self, registry):
                    """
                    Save updated registry data to file.
                    """
                    with open(self.registry_file, 'w') as file:
                        json.dump(registry, file, indent=4)

            if __name__ == "__main__":
                # Example usage
                registry = DataRegistry()

                # Register a new dataset
                registry.register_dataset(
                    "dataset_001",
                    metadata={
                        "source": "external API",
                        "schema": {"id": "int", "name": "string", "value": "float"},
                        "last_updated": "2023-11-01",
                    }
                )

                # List all datasets
                print("Registered datasets:", registry.list_datasets())

                # Retrieve metadata
                print("Metadata for dataset_001:", registry.get_metadata("dataset_001"))
            

Dependencies

The script uses lightweight Python standard libraries:

How to Use This Script

Follow these steps to use ai_data_registry.py for managing datasets:

  1. Initialize the registry (creates an empty file if not present).
  2. Use the register_dataset() method to add new datasets with unique identifiers and metadata.
  3. Retrieve metadata for specific datasets using get_metadata().
  4. List all registered datasets with list_datasets().

            # Example Usage
            from ai_data_registry import DataRegistry

            registry = DataRegistry("my_registry.json")
            registry.register_dataset("dataset_123", {"source": "file.csv", "schema": {"col1": "int"}})
            print(registry.list_datasets())
            

Role in the G.O.D. Framework

Future Enhancements