This is an old revision of the document!
Table of Contents
AI Anomaly Detection
* More Developers Docs: The AI Anomaly Detection system is a Python-based utility that identifies outliers in datasets using statistical principles like standard deviation. This function is essential for finding anomalous data points that deviate significantly from the dataset's normal range.
Overview
The detect_anomalies() function analyzes numerical datasets to:
- Calculate statistical metrics such as mean, variance, and standard deviation.
- Identify anomalies that fall outside a defined threshold (e.g., 3 standard deviations from the mean).
- Log the detection process, providing insights into detected anomalies.
This system is highly valuable for:
- Monitoring data streams in real-time.
- Preprocessing data before model training.
- Identifying unusual behaviors or patterns in datasets.
Features
1. Statistical Anomaly Detection
The core anomaly detection mechanism is based on statistical outlier detection. It calculates the mean and standard deviation of the dataset to determine a range within which most data points fall. Any point outside this range is classified as an anomaly.
Threshold for Anomalies: Data points are considered anomalies if they fall outside the range:
[mean - (3 * standard deviation), mean + (3 * standard deviation)]
2. Logging Information
The function integrates Python's logging module to track its operations, including:
- When anomaly detection begins.
- Any errors or edge cases (e.g., empty datasets).
- List of anomalies detected.
Example Log Messages:
INFO: Detecting anomalies in the data… INFO: Anomalies detected: [120, -45]
Function Details
detect_anomalies()
This function takes a list of numeric data points as input and returns a list of values that qualify as anomalies.
Signature:
python
def detect_anomalies(data: List[float]) → List[float]:
""" Detect anomalies in the dataset by identifying outliers. :param data: List of numeric data points :return: List of anomalies detected """
Examples
1. Basic Anomaly Detection
Input Example:
python
data = [10, 12, 15, 10, 11, 14, 120, 12, 9, -45] anomalies = detect_anomalies(data) print(f“Anomalies: {anomalies}”)
Output:
Anomalies: [120, -45]
Explanation: - The dataset has a mean of 17.4 and a standard deviation of 32.4 (calculated internally). - Data points 120 and -45 are beyond 3 standard deviations from the mean and are thus classified as anomalies.
2. Handling Edge Cases
The detect_anomalies() function includes safeguards to handle incomplete or invalid input data.
Example: Empty Dataset:
python
data = [] anomalies = detect_anomalies(data) print(f“Anomalies: {anomalies}”)
Output:
Anomalies: []
Explanation: The function immediately returns an empty list if the dataset is empty.
Example: All Data Within Range:
python
data = [100, 102, 98, 101, 99] anomalies = detect_anomalies(data) print(f“Anomalies: {anomalies}”)
Output:
Anomalies: []
Explanation: No values fall beyond 3 standard deviations from the mean, so no anomalies are detected.
3. Advanced Example: High Variance Dataset
Input Data:
python
data = [100, 150, 200, 1000, 105, 210, 980, 115, 195] anomalies = detect_anomalies(data)
Output:
Anomalies: [1000, 980]
Explanation: Outliers 1000 and 980 are classified as anomalies due to their significant deviation from the mean of the dataset.
4. Real-Time Anomaly Detection
With some modifications, the detect_anomalies() function can be adapted for real-time data stream monitoring.
Framework for Live Data Streams:
python
import random import time
# Simulating real-time data collection def stream_anomaly_detection():
data_stream = []
while True:
new_data = random.randint(50, 150) # Simulate normal range
if random.random() > 0.95: # Simulate anomaly
new_data = random.randint(-500, 500)
data_stream.append(new_data)
# Check anomalies every 10 data points
if len(data_stream) % 10 == 0:
anomalies = detect_anomalies(data_stream)
print(f"Latest Anomalies: {anomalies}")
time.sleep(1)
stream_anomaly_detection()
Advanced Usage
1. Custom Thresholds
By default, the function uses 3 standard deviations as the threshold for anomaly detection. To customize this, modify the following part of the function:
python
anomalies = [x for x in data if abs(x - mean) > THRESHOLD * std_dev]
Example Custom Threshold:
python
THRESHOLD = 2 # Using 2 standard deviations instead of 3 data = [12, 15, 18, 10, 140] anomalies = detect_anomalies(data) print(f“Anomalies with Threshold={THRESHOLD}: {anomalies}”)
2. Batch Detection for Multiple Data Sets
Use the anomaly detection function to analyze multiple datasets in one script, automating the reporting process.
Example: ```python datasets = [
[10, 12, 14, 18, 200], [90, 92, 91, 89, 700], [101, 105, 110, 500]
]
for idx, data in enumerate(datasets):
anomalies = detect_anomalies(data)
print(f"Dataset {idx + 1}: {anomalies}")
```
Output: ``` Dataset 1: [200] Dataset 2: [700] Dataset 3: [500] ```
—
3. Combining with Visualization
For deeper insights, combine the detection function with visualization tools to plot anomalies on a graph.
Example with Matplotlib: ```python import matplotlib.pyplot as plt
data = [10, 12, 15, 10, 11, 14, 120, 12, 9, -45] anomalies = detect_anomalies(data)
# Plotting the dataset and anomalies plt.plot(data, label=“Data”, marker=“o”) plt.scatter(
[i for i, val in enumerate(data) if val in anomalies], anomalies, color="red", label="Anomalies"
) plt.title(“Anomaly Detection in Dataset”) plt.xlabel(“Index”) plt.ylabel(“Value”) plt.legend() plt.show() ```
—
Applications
1. Sensor Data Monitoring: Detect unusual readings in sensor datasets, such as temperature fluctuations or pressure changes.
2. Finance and Fraud Detection: Identify fraudulent transactions or outliers in financial datasets. Detect anomalous patterns in trading or purchase history.
3. Preprocessing for AI Pipelines: Flag and handle anomalous data points before model training to improve model robustness and accuracy.
—
Best Practices
1. Normalize Data: Ensure datasets are normalized to minimize the impact of scaling on anomaly detection.
2. Adjust Thresholds: For datasets with high variance or noise, consider lowering the detection threshold to 2 standard deviations or less.
3. Visualization: Combine detection results with visualizations for better interpretability.
—
Conclusion
The AI Anomaly Detection framework provides a robust, flexible, and extensible mechanism for outlier detection in numerical datasets. With applications ranging from real-time monitoring to preprocessing for AI pipelines, the system is a valuable tool for automated anomaly analysis. By leveraging advanced usage patterns like visualization and threshold adjustments, the functionality can be tailored to a wide range of industry applications.
