In this part of this series on building a smarter log analysis system. In this post, we’ll focus on the crucial step of parsing logs for machine learning. This step transforms raw log files into a structured format that can be processed by our VAE (Variational Autoencoder) for anomaly detection. If you’re joining now, I would recommend reviewing Part 1 and Part 2 to get up to speed.

Why Parse Logs?

Raw log files are messy and unstructured. For machine learning algorithms to process and analyze logs effectively, we need to:

  1. Extract key features (e.g., timestamp, hostname, service, message).
  2. Normalize data (e.g., scaling numerical values).
  3. Encode categorical fields (e.g., one-hot encoding).

This structured data enables the VAE to focus on patterns and anomalies, rather than struggling with noise or irrelevant information.


The Updated parse_logs.py Script

The parse_logs.py script has been improved to handle dynamic input and output files using command-line arguments. This ensures greater flexibility and integration with automated workflows, such as daily and weekly processing.

Here’s the updated script:

import pandas as pd
import re
import argparse
from sklearn.preprocessing import StandardScaler

# Argument parser for dynamic input and output files
parser = argparse.ArgumentParser(description="Parse and process log files.")
parser.add_argument("--input", required=True, help="Path to the input log file.")
parser.add_argument("--output", required=True, help="Path to the output CSV file.")
args = parser.parse_args()

# Read logs from the input file
logs = []
with open(args.input, 'r') as file:
    for line in file:
        logs.append(line.strip())

# Convert to DataFrame for easier manipulation
df = pd.DataFrame(logs, columns=['log_entry'])

# Function to extract features from log lines
def extract_features(log_entry):
    # Regex pattern to parse log lines
    pattern = r"(\w+\s+\d+\s+\d+:\d+:\d+)\s+(\S+)\s+(\S+):\s+(.*)"
    match = re.match(pattern, log_entry)
    if match:
        timestamp, hostname, service, message = match.groups()
        return timestamp, hostname, service, message
    return None, None, None, None

# Apply the feature extraction function
df[['timestamp', 'hostname', 'service', 'message']] = df['log_entry'].apply(
    lambda x: pd.Series(extract_features(x))
)

# Parse timestamps into datetime objects
df['timestamp'] = pd.to_datetime(df['timestamp'], format='%b %d %H:%M:%S', errors='coerce')

# Extract hour and minute
df['hour'] = df['timestamp'].dt.hour.fillna(0).astype(int)
df['minute'] = df['timestamp'].dt.minute.fillna(0).astype(int)

# Simplify the message field
df['message'] = df['message'].str.extract(
    r'(Disconnected|Connection closed|TLS handshaking|auth failed|relay access denied|deferred|bounced|timeout)',
    expand=False
)
df['message'] = df['message'].fillna('other')  # Replace unmatched entries with 'other'

# One-hot encode categorical fields like service and message
df_encoded = pd.get_dummies(df, columns=['service', 'message'])

# Scale numerical features
numerical_features = ['hour', 'minute']
scaler = StandardScaler()
df_encoded[numerical_features] = scaler.fit_transform(df_encoded[numerical_features])

# Save the processed data to the specified output file
df_encoded.to_csv(args.output, index=False)

# Display a summary of the processed data
print("Processed and Encoded Data:")
print(df_encoded.head())
print(f"Number of columns after encoding: {len(df_encoded.columns)}")

How It Works

  1. Dynamic Input and Output:
    • Specify the input log file with --input and the processed CSV output file with --output.
    • Example: python parse_logs.py --input mail.log --output processed_logs.csv.
  2. Feature Extraction:
    • Extracts key fields like timestamp, hostname, service, and message using regex.
    • Simplifies the message field to focus on significant keywords (e.g., Disconnected, timeout, auth failed).
  3. Data Transformation:
    • Converts categorical fields (service, message) to one-hot encoded features.
    • Normalizes numerical fields (hour, minute) using StandardScaler.
  4. Integration with Workflows:
    • The script integrates seamlessly into our daily and weekly workflows using run_vae.sh and retrain_vae.sh.

Key Benefits of the Update

  • Reusable: This approach allows the script to process any log file without hardcoding filenames.
  • Automated: Fits naturally into automated scripts and cron jobs.
  • Scalable: Handles large datasets efficiently, ensuring smooth anomaly detection workflows.

Next Steps

In the next post, I will dive into how the VAE leverages this processed data to detect anomalies and how the system evolves over time to become smarter.


0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *