How Kaizen Scaled Their Logging Infrastructure with Graylog Enterprise

Discover how Kaizen transformed their centralized logging operations, solved scaling challenges, and optimized performance with Graylog Enterprise support and features.

Hello, my name is Marinos Yoris, and I am the SRE Team Lead at Kaizen. I’m part of the team that uses and maintains the Graylog cluster at Kaizen.

Today, I’m going to walk you through the journey of Graylog at our company — the challenges we faced and how we overcame them with the help of the Graylog team.

Introduction to Kaizen

Kaizen is an online betting company operating in more than 16 countries under the brands “Stoiximan” and “Betano.”

Why We Needed Graylog

We required a centralized logging tool to:

  • Visualize logs from applications

  • Retain logs for 2 weeks to 1 month

  • Efficiently search logs

  • Monitor application health

Initial Graylog Setup

In 2012, we implemented the open-source version of Graylog with ElasticSearch. It started as a small cluster but grew rapidly with business needs.

Scaling Challenges

By 2023, during high-traffic periods, we faced:

  • Slow search performance

  • Frequent disruptions (weekly outages)

  • Log loss during downtime

This was critical because we rely heavily on real-time monitoring.

Transition to Graylog Enterprise

At the end of 2023, we:

  • Collaborated with the Graylog team

  • Migrated to Graylog Enterprise

  • Focused on cluster redesign, optimization, and adopting Enterprise features like Illuminate, audit functionality, advanced filtering, and 24/7 support.

Migration and Optimization Process

Steps taken:

  • Evaluated the existing cluster

  • Planned migration and cluster creation

  • Migrated non-production environments first, then production

Key optimizations included:

  • Removing unnecessary replicas

  • Adjusting shard sizes

  • Optimizing pipelines and stream rules

  • Increasing batch sizes for output

  • Fine-tuning process and output buffer processors

These changes resulted in:

  • Faster processing (from 5s to 2s per log)

  • No outages for months

  • More stable cluster performance

Current Cluster Scale

  • Processes 15–20 TB of logs daily

  • Retains ~200 TB of logs at a time

  • Handles spikes of up to 750,000 messages per second

  • Upgraded cluster resources (more cores and storage)

Main Use Cases

  1. Production Issue Investigation

  2. Real-Time Monitoring

  3. Application Debugging

  4. Customer Activity Logging (for troubleshooting and regulatory needs)

Future Plans

  • Scale to handle over 1 million logs per second

  • Upgrade to Graylog 6.1

  • Implement Illuminate for network logs in production

Thank you for your attention. We look forward to continued growth with Graylog!