Post 4: Training a Variational Autoencoder for Anomaly Detection

Welcome to Part 4 of my series on building an intelligent log analysis system. In Part 3, I focused on parsing raw logs into a structured format for machine learning. Today, I will explore how our system uses a Variational Autoencoder (VAE) to detect anomalies in the processed data. I will also highlight how the model evolves over time to become smarter.

If you’re new here, check out Part 1 and Part 2 for an overview and setup guide.


The Role of the Variational Autoencoder (VAE)

The VAE is a deep learning model designed to reconstruct normal patterns in the data. When it encounters something unusual (an anomaly), the reconstruction error (how far the reconstructed data deviates from the original) spikes. This makes it an excellent tool for anomaly detection.

Here’s how the VAE works in our pipeline:

  1. Training: The model learns patterns from historical logs.
  2. Evaluation: It compares recent logs to its learned patterns and identifies anomalies based on reconstruction error.
  3. Retraining: It incorporates new data periodically to improve its understanding of normal patterns.

Let’s take a closer look at how the VAE is implemented.


The VAE Model Implementation

Below is the core implementation of the VAE used in this system:

import torch
import torch.nn as nn

class VAE(nn.Module):
    def __init__(self, input_dim, hidden_dim, latent_dim):
        super(VAE, self).__init__()
        # Encoder
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2_mean = nn.Linear(hidden_dim, latent_dim)
        self.fc2_log_var = nn.Linear(hidden_dim, latent_dim)
        # Decoder
        self.fc3 = nn.Linear(latent_dim, hidden_dim)
        self.fc4 = nn.Linear(hidden_dim, input_dim)
    
    def encode(self, x):
        h = torch.relu(self.fc1(x))
        return self.fc2_mean(h), self.fc2_log_var(h)
    
    def reparameterize(self, mean, log_var):
        std = torch.exp(0.5 * log_var)
        eps = torch.randn_like(std)
        return mean + eps * std
    
    def decode(self, z):
        h = torch.relu(self.fc3(z))
        return torch.sigmoid(self.fc4(h))
    
    def forward(self, x):
        mean, log_var = self.encode(x)
        z = self.reparameterize(mean, log_var)
        return self.decode(z), mean, log_var

This architecture includes:

  • Encoder: Compresses input data into a latent space.
  • Decoder: Reconstructs the data from the latent space.
  • Reparameterization Trick: Ensures gradients flow through the stochastic layer for backpropagation.

Daily Anomaly Detection

Our daily script (run_vae.sh) processes the latest logs, evaluates them against the current model, and sends a summary of detected anomalies.

Key steps in the daily run:

  1. Copy and Parse Logs: The latest logs are parsed into a structured format using parse_logs.py.
  2. Load the VAE Model:
    • If a previously trained model exists, it is loaded and evaluated for compatibility with the new data.
    • If the structure of the data has changed (e.g., new services appear in logs), the model is reinitialized.
  3. Detect Anomalies:
    • The model calculates reconstruction errors for each log entry.
    • Entries with errors above a threshold are flagged as anomalies.
  4. Send Summary: A summary email is sent with the top anomalies for review.

Implementation for Anomaly Detection:

# Define the loss function
def vae_loss_function(reconstructed, original, mean, log_var):
    reconstruction_loss = nn.functional.mse_loss(reconstructed, original, reduction='sum')
    kl_divergence = -0.5 * torch.sum(1 + log_var - mean.pow(2) - log_var.exp())
    return reconstruction_loss + kl_divergence

# Detect anomalies
print("Detecting anomalies...")
model.eval()
threshold = 0.05  # Adjust this threshold based on testing
anomalies = []

with torch.no_grad():
    for i in range(test_data.shape[0]):
        original = test_data[i]
        reconstructed, _, _ = model(original.unsqueeze(0))
        reconstruction_error = nn.functional.mse_loss(reconstructed, original.unsqueeze(0), reduction='sum').item()
        if reconstruction_error > threshold:
            anomalies.append((i, reconstruction_error))

Weekly Retraining

The weekly retraining process (retrain_vae.sh) aggregates logs from the past week and updates the model to reflect new patterns.

Key steps in the weekly run:

  1. Aggregate Logs:
    • Combines current and past logs (including compressed ones).
    • Ensures the model learns from the latest patterns without forgetting older data.
  2. Retrain the VAE:
    • Reinitializes the model to adapt to any structural changes in the logs.
    • Incorporates new logs to refine its understanding of normal behavior.
  3. Save the Updated Model:
    • The updated model is stored for use in subsequent daily runs.
    • This ensures the system becomes progressively smarter over time.

Conclusion

In this article, I introduced the Variational Autoencoder (VAE) and its role in detecting anomalies from structured log data. I also discussed how the daily detection and weekly retraining processes keep the model accurate and adaptive over time.

In the next post, I will dive into fine-tuning the system for optimal performance, exploring topics like adjusting thresholds, categorizing anomalies, and integrating visualizations.


0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *