ai_data_registry
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| ai_data_registry [2025/05/25 19:18] – [Key Features] eagleeyenebula | ai_data_registry [2025/05/25 19:55] (current) – [Conclusion] eagleeyenebula | ||
|---|---|---|---|
| Line 39: | Line 39: | ||
| * Ensure retrieval of standardized information about datasets, including their structure, size, and source. | * Ensure retrieval of standardized information about datasets, including their structure, size, and source. | ||
| - | * Assist with automation in ETL workflows, ensuring datasets are persistently tracked across transformations and pipes. | + | * Assist with automation in **ETL** workflows, ensuring datasets are persistently tracked across transformations and pipes. |
| * Facilitate compliance and governance by maintaining comprehensive dataset logs, metadata, and time records. | * Facilitate compliance and governance by maintaining comprehensive dataset logs, metadata, and time records. | ||
| Line 55: | Line 55: | ||
| * **Dataset Registration: | * **Dataset Registration: | ||
| - Easily add datasets to the registry with custom metadata. | - Easily add datasets to the registry with custom metadata. | ||
| - | - Automatically tags datasets with `date_added` timestamps. | + | - Automatically tags datasets with **date_added** timestamps. |
| * **Persistent Storage:** | * **Persistent Storage:** | ||
| - | - Datasets and metadata are saved in JSON format for reliability and readability. | + | - Datasets and metadata are saved in **JSON** format for reliability and readability. |
| - Supports a default catalog file (**data_registry.json**) or custom file paths. | - Supports a default catalog file (**data_registry.json**) or custom file paths. | ||
| Line 79: | Line 79: | ||
| The **DataCatalog** class provides three main functions for dataset registry management: | The **DataCatalog** class provides three main functions for dataset registry management: | ||
| - | | + | * **add_entry: |
| - | | + | * **load_catalog: |
| - | | + | * **save_catalog: |
| ==== Adding a Dataset ==== | ==== Adding a Dataset ==== | ||
| - | The `add_entry` method accepts a dataset name and metadata as inputs. Metadata can be any dictionary describing meaningful attributes (e.g., size, source, dependencies). | + | The **add_entry** method accepts a dataset name and metadata as inputs. Metadata can be any dictionary describing meaningful attributes (e.g., size, source, dependencies). |
| **Example Metadata Record:** | **Example Metadata Record:** | ||
| - | ```json | + | < |
| + | json | ||
| { | { | ||
| " | " | ||
| Line 98: | Line 99: | ||
| } | } | ||
| } | } | ||
| - | ``` | + | </ |
| ---- | ---- | ||
| Line 109: | Line 110: | ||
| Add a dataset with metadata and manage the registry. | Add a dataset with metadata and manage the registry. | ||
| - | ```python | + | < |
| + | python | ||
| from ai_data_registry import DataCatalog | from ai_data_registry import DataCatalog | ||
| - | + | </ | |
| - | # Initialize the catalog | + | # **Initialize the catalog** |
| + | < | ||
| catalog = DataCatalog() | catalog = DataCatalog() | ||
| - | + | </ | |
| - | # Add a dataset entry | + | # **Add a dataset entry** |
| + | < | ||
| catalog.add_entry(" | catalog.add_entry(" | ||
| " | " | ||
| Line 121: | Line 125: | ||
| " | " | ||
| }) | }) | ||
| - | + | </ | |
| - | # Load catalog data | + | # **Load catalog data** |
| + | < | ||
| data = catalog.load_catalog() | data = catalog.load_catalog() | ||
| print(" | print(" | ||
| - | ``` | + | </ |
| ---- | ---- | ||
| Line 134: | Line 139: | ||
| You can add more intricate metadata, such as tags and related datasets, to help track high-level attributes of your datasets. | You can add more intricate metadata, such as tags and related datasets, to help track high-level attributes of your datasets. | ||
| - | < | + | < |
| catalog.add_entry(" | catalog.add_entry(" | ||
| " | " | ||
| Line 146: | Line 151: | ||
| === 2. Custom Storage Paths === | === 2. Custom Storage Paths === | ||
| - | By default, the **DataCatalog** saves the registry in `data_registry.json`. You can configure it to use a different file path when needed. | + | By default, the **DataCatalog** saves the registry in **data_registry.json**. You can configure it to use a different file path when needed. |
| - | <code python> | ||
| - | # Initialize the catalog with a custom file path | ||
| - | catalog = DataCatalog(registry_path="/ | ||
| - | # Add an entry to the catalog | + | # **Initialize the catalog with a custom file path** |
| + | < | ||
| + | catalog = DataCatalog(registry_path="/ | ||
| + | </ | ||
| + | # **Add an entry to the catalog** | ||
| + | < | ||
| catalog.add_entry(" | catalog.add_entry(" | ||
| " | " | ||
| " | " | ||
| }) | }) | ||
| - | + | </ | |
| - | # Output the catalog contents | + | # **Output the catalog contents** |
| + | < | ||
| print(catalog.load_catalog()) | print(catalog.load_catalog()) | ||
| </ | </ | ||
| Line 169: | Line 177: | ||
| Combine **DataCatalog** with versioning capabilities. This allows you to track the progress of specific dataset versions directly in your pipeline. | Combine **DataCatalog** with versioning capabilities. This allows you to track the progress of specific dataset versions directly in your pipeline. | ||
| - | < | + | < |
| def versioned_entry(catalog, | def versioned_entry(catalog, | ||
| """ | """ | ||
| Line 185: | Line 193: | ||
| } | } | ||
| catalog.add_entry(dataset_name, | catalog.add_entry(dataset_name, | ||
| - | + | </ | |
| - | # Initialize catalog | + | # **Initialize catalog** |
| + | < | ||
| catalog = DataCatalog() | catalog = DataCatalog() | ||
| - | + | </ | |
| - | # Add a versioned entry | + | # **Add a versioned entry** |
| + | < | ||
| versioned_entry( | versioned_entry( | ||
| catalog, | catalog, | ||
| Line 197: | Line 207: | ||
| size=" | size=" | ||
| ) | ) | ||
| - | + | </ | |
| - | # Output the catalog | + | # **Output the catalog** |
| + | < | ||
| print(catalog.load_catalog()) | print(catalog.load_catalog()) | ||
| </ | </ | ||
| Line 204: | Line 215: | ||
| **Expected Output:** | **Expected Output:** | ||
| - | < | + | < |
| { | { | ||
| " | " | ||
| Line 223: | Line 234: | ||
| ==== Best Practices ==== | ==== Best Practices ==== | ||
| - | To get the most out of **DataCatalog**, | + | To get the most out of **DataCatalog**, |
| - | * **Use Metadata Consistently**: | + | * **Use metadata consistently** Add fields like **source**, **size**, and **tags** to all datasets |
| - | Ensure that metadata | + | * **Secure |
| - | | + | * **Version |
| - | * **Secure | + | * **Automate |
| - | | + | |
| - | + | ||
| - | * **Version-Control Your Datasets**: | + | |
| - | Use versioning (e.g., | + | |
| - | + | ||
| - | * **Automate | + | |
| - | Integrate registry updates using pipeline automation | + | |
| - | + | ||
| - | --- | + | |
| ==== Extensibility ==== | ==== Extensibility ==== | ||
| Line 256: | Line 258: | ||
| ==== Conclusion ==== | ==== Conclusion ==== | ||
| - | The **DataCatalog** module is a scalable and flexible solution for managing metadata registries. With support for versioning, extensibility, | + | The **DataCatalog** module is a scalable and flexible solution for managing metadata registries. With support for versioning, extensibility, |
| - | + | ||
| - | Whether you’re working on small-scale or enterprise-level pipelines, the **DataCatalog** provides all the tools you need for clean and structured data management. | + | |
ai_data_registry.1748200725.txt.gz · Last modified: 2025/05/25 19:18 by eagleeyenebula
