You’ve made the decision to implement a centralized log management solution because you know that it’s going to save you time and money in the long term. However, to get the most bang for your log management buck, you need to understand how the different parts of your log management deployment work. Once you understand each resource, you can implement a more efficient log management architecture.
Considerations For Centralized Logging Architectures
As you start the research process, you should understand what you need from your centralized log management solution, what you want from it, and how you plan to use it.
You’re likely implementing a centralized log management tool because your different teams need to know what’s happening in your environment. In complex, connected, cloud environments, an issue routed to the IT help desk may be the beginning of a security incident investigation.
All your log management architecture decisions begin with your use cases because those drive the log data you need to ingest. Collecting too much log data can increase costs without providing a return on investment.
To optimize your data collection, you should engage stakeholders across all the teams using your centralized log management solution so that you can build out meaningful use cases designed for collaboration.
Historical Data Management
You typically use log data for real-time visibility into your environment for IT operations or security use cases. However, you shouldn’t underestimate historical data’s value, either.
Consider the following historical data use cases:
- Management wants to know when an issue began.
- You notice a similarity between a current and past issue.
- You want to review trends over time.
While you need historical data, you may not need all the raw data because storing large volumes of data becomes cost prohibitive. When determining your log management architecture, you need to know your ingest rate, log size, and available storage.
Determining Disk Space
Your storage configurations directly relate to your disk space. Managing transaction log alone can become overwhelming, depending on the amount of log data you decide to collect and store. You may choose to collect all log messages from all locations or limit based on the information that’s most important to your use cases.
For example, if your company implemented a microservice architecture with docker containers you can rapidly fill up your available disk space. When you have gigabytes or terabytes of incoming logs everyday, you want to index and rotate logs so that you can save disk space and do faster searches.
Your log rotation strategy and configuration can include:
- Delete: remove index from disk to save space, usually used for operational data
- Close: stop writing data to the index while maintaining its searchability
- Nothing: leaves index open and on disk until manual removal
- Archive: compresses files and sets up backend storage
Log Retention Requirements
Your log management architecture also relies on the audit and compliance data retention requirements. Your solution should provide archiving that allows you to set a records retention timeframe and enable you to access data when necessary.
Choosing Your Hosting Model
Each company is different so your hosting model should match your organization’s goals.
If you’re a small organization, an on-premises hosting model might be attractive, especially if you aren’t already leveraging cloud technologies like microservices or Software-as-a-Service (SaaS) apps.
However, with an on-premises hosting model, you need to consider the following:
- Storage costs
- Maintenance costs
- Staffing needs
Although you might want to start small with your own on-premises environment, you can easily start to run out of disk space which undermines your overarching business objectives.
If you have long-term plans for scaling your log management use cases, then cloud hosting may be a better fit. As with everything cloud-based, you don’t have to worry about the maintenance concerns because they become the provider’s responsibility.
However, you should consider the compute resources required for processing the log files. If you don’t make an up-front decision about log retention, then you might find the compute costs surprising.
If you don’t think that you can manage either an on-premises or cloud-based architecture yourself, you might want to look into centralized log management solutions.
A centralized log management solution can be deployed on-premises or in the cloud, so you have the flexibility you need. Additionally, since the storage and compute costs are built into the pricing model, you don’t have to worry about controlling those costs on a month-to-month basis.
The Different Log Management Resources
At a high level, centralized log management appears to be a single tool that does all the work. However, just like a car’s engine consists of a lot of smaller parts, your log management tool has various interconnected resources.
Your log management solution consists of:
- Processing engine
- Search engine
The frontend server hosts the user interface that you’re working with. This can be your internal built solution or a third-party centralized log management solution. This is also the technology that does the background work of collecting, parsing, normalizing, aggregating, correlating, and analyzing data. In other words, this is where all the magic happens.
When you look at this from an architectural point of view, you want to focus on disk I/O, since the frontend must both ingest data and forward processed log data to log storage The central processing unit (CPU) power is also important, because it enables the analytics and end-user experience.
Search Engine and Log Storage
Logs are indexed and stored in this component. It serves two functions, though. This component also contains a search engine, that fields the requests from the frontend server for queries against the stored logs. If you’re using a cloud-based solution, it’s probably running a search engine like OpenSearch.
If this is an on-premise solution, this is the resource that drives your investigations so you want to have as much RAM and the fastest disks possible. Additionally, the solution also stores the ingested messages that are indexed here. So, you may need more than one search engine node if you have a lot of log data.
Finally, your solution will also provide a database that stores all meta information and configuration data. You’ll need long-term storage capabilities and probably want some kind of data backup. However, these don’t need as many resources as the other two.
As you develop your log management strategy and plans, you need to consider various factors impacting log ingestion. Further, you should consider not only current-state needs but incorporate long-term IT plans.
( Graylog Sr Solutions Engineer Chris Black: Graylog Reference Architectures )
As your IT environment becomes more complex, your log management architecture will need to scale so you need to consider things like:
- Amount of device and application logs ingested
- Number of processing engine or centralized log management servers
- Number of search servers
- CPU cores for servers
- RAM required for servers
Velocity: Events Per Second (EPS)
EPS is the number of events that a device generates per second. To determine EPS, you would look at the number of events each device generates over the course of a period of seconds, based on normal, everyday usage.
The calculation might look like this:
(Total Number of Events/Total Number of Devices)/Total Number of Seconds
Additionally, you need to estimate high volume or “peak” usage, such as during a security incident that will generate more events than you’re used to.
The volume calculation is directly related to all the work you did trying to determine your use cases and what event log data you needed to collect. For example, if you decided to collect all event log data, a file might be as large as 2GB. However, if you focused your log message field, you reduced the file size.
Now you know the
- number of events per second
- average bytes per file
- 86,4000 seconds in a day
To calculate daily volume you can use this equation:
(EPS * Bytes per File) x 86400
Structure logs appropriately
The format that you use to store files directly impacts your hardware and storage sizing.
If you’re using an on-premises log server, you want to have enough room to grow without spending too much money up front. Some considerations include:
- Format used for storing files
- Log compression ratio
- Short-term versus long-term data storage requirements
Cloud Server Sizing
If you decide to use the cloud, then you have additional scalability and flexibility, not just for searching your logs but also for storing them. However, this also means you need to understand your architecture a little bit differently.
The key issue to consider in a cloud deployment is the RAM difference required for search compared to processing. Your search engine server/node should have double the RAM as your primary endpoint node.
Consider the following architectures:
- 20-50GB log ingestion: 2 front end servers with 32GB RAM per server, 2 backend servers with 16GB RAM per server
- 50-100GB log ingestion: 3 front end servers with 32GB RAM per server, 2 backend servers with 16GB RAM per server
- 100-300GB log ingestion: 4 front end servers with 32GB RAM per server, 3 backend servers with 16GB RAM per server
Graylog: Flexible, Scalable Solutions for Operations and Security
Whether delivered as a self-managed or cloud experience, Graylog offers you a powerful, flexible, and seamless centralized log management solution that gives you the security analytics and operations tools you need. IT operations and security teams can collaborate more efficiently and effectively using our power, lightning-fast features and intuitive UI to improve key metrics like mean time to investigate and mean time to recover.
With our out-of-the-box content, you can start getting valuable data from your logs quicker. We help you manage day-to-day tasks and respond to the unexpected by increasing visibility across your IT environment and ensuring you can search volumes of log data in seconds.