G.O.D. Framework

Script: ai_clustering.py - Unsupervised Learning Clustering Module

Introduction

The ai_clustering.py script is a dedicated module for performing unsupervised machine learning tasks, particularly clustering. This script employs algorithms like K-Means, DBSCAN, and Hierarchical Clustering to segment datasets into meaningful groups. It serves as a core component for data exploration, customer segmentation, anomaly detection, and feature engineering within the G.O.D. Framework.

Purpose

Key Features

Logic and Implementation

This script provides multiple clustering options and processes data as follows:

  1. Accepts a dataset (e.g., a Pandas DataFrame or NumPy array).
  2. Normalizes the dataset to ensure all features contribute equally to distance calculations.
  3. Applies clustering algorithms based on specified parameters.
  4. Outputs labels and, optionally, cluster metrics for evaluation (e.g., silhouette score).

Here is an example of core functionality:


            from sklearn.cluster import KMeans, DBSCAN
            from sklearn.preprocessing import StandardScaler
            from sklearn.metrics import silhouette_score
            import matplotlib.pyplot as plt

            def perform_clustering(data, algorithm="kmeans", **kwargs):
                """
                Performs clustering using the specified algorithm.
                :param data: Dataset to cluster (NumPy array or DataFrame).
                :param algorithm: Clustering algorithm ('kmeans', 'dbscan', 'hierarchical').
                :return: Cluster labels and evaluation scores.
                """
                scaler = StandardScaler()
                normalized_data = scaler.fit_transform(data)

                if algorithm == "kmeans":
                    n_clusters = kwargs.get("n_clusters", 3)
                    model = KMeans(n_clusters=n_clusters, random_state=42)
                elif algorithm == "dbscan":
                    eps = kwargs.get("eps", 0.5)
                    min_samples = kwargs.get("min_samples", 5)
                    model = DBSCAN(eps=eps, min_samples=min_samples)
                else:
                    raise ValueError("Unsupported clustering algorithm.")

                labels = model.fit_predict(normalized_data)
                silhouette = silhouette_score(normalized_data, labels)
                return labels, silhouette

            def visualize_clusters(data, labels):
                """
                Visualizes clusters using a scatter plot.
                :param data: Dataset (assumes 2D after preprocessing).
                :param labels: Cluster assignments.
                """
                plt.scatter(data[:, 0], data[:, 1], c=labels, cmap="viridis")
                plt.title("Cluster Visualization")
                plt.show()

            if __name__ == "__main__":
                import numpy as np
                # Example dataset: Random points
                sample_data = np.random.rand(100, 2)

                labels, score = perform_clustering(sample_data, algorithm="kmeans", n_clusters=3)
                print(f"Silhouette Score: {score:.2f}")
                visualize_clusters(sample_data, labels)
            

Dependencies

How to Use This Script

  1. Load a preprocessed dataset (numerical data).
  2. Select a clustering algorithm (e.g., K-Means or DBSCAN).
  3. Customize parameters like number of clusters (n_clusters) or density thresholds (eps).
  4. Generate and interpret cluster labels and evaluations like silhouette scores.
  5. Utilize visualization utilities to interpret clusters.

            # Example use case with DBSCAN
            dbscan_labels, dbscan_score = perform_clustering(
                sample_data, algorithm="dbscan", eps=0.3, min_samples=3
            )
            print(f"Silhouette Score (DBSCAN): {dbscan_score:.2f}")
            visualize_clusters(sample_data, dbscan_labels)
            

Role in the G.O.D. Framework