Building a Security Data Lake Strategy

The high volumes of security data that cloud environments generate leave security teams swimming in data, but many feel like they need a life preserver to improve their incident response capabilities.

Enter security data lakes. As the costs associated with data retention become overwhelming, organizations are embracing the idea of security data lakes and data warehouses. These data repositories enable organizations to store large amounts of security data, typically types not immediately required for search and analysis. While they reduce the overall total cost of ownership, organizations should still understand why they need a security data lake strategy to optimize their investment and gain their hoped-for benefits.

Why Do Organizations Use Security Data Lakes?

Security data lakes enable organizations to centralize their collected security data using a cost-effective, long-term storage solution. Security data lakes, like Amazon Security Lake, offer various benefits, including:

Cost reductions: storage without the licensing costs associated with traditional security information and event management (SIEM) systems
Scalability: faster investigations by scaling up computing resources and enabling collaboration
Flexibility: management for different data types, including structured, unstructured, and semi-structured data
High-fidelity alerts: correlation across more data reduces the number of false alerts
Compliance: historical data retention to meet legal, regulatory, and internal policy requirements
Improved decision-making: data storage and processing to leverage security analytics

What is Data Lake Architecture?

Data lake architecture supports ingesting data across various sources and formats so organizations can achieve real-time storage and analysis.

Data sources

A security data lake acts as a centralized repository, storing data from various sources, including:

Software-as-a-Service (SaaS) applications
Cloud environments, like AWS CloudWatch and Azure
Network devices, like firewalls
Endpoint detection and response (EDR) tools

By centralizing security telemetry from numerous tools and environments, security data lakes improve visibility and facilitate thorough security investigations.

Data Ingestion

Data ingestion is the process of importing data into a data lake. As part of this process, the data should be parsed and normalized, providing a standardized format to facilitate correlation and analysis.

Data Storage And Processing

As a centralized repository, the data security lake enables organizations to retain massive and diverse data sets. They can achieve efficient data management with automation that enables them to balance storage costs, system performance, and data access.

Why design a security data lake strategy?

A security data lake strategy centers on creating a scalable, centralized repository to safeguard organizational assets from hidden and unknown threats. Unlike traditional SIEM solutions, security data lakes are designed to handle the sheer volume of security data affordably, providing both storage and compute scalability.

While a security data lake enables cost savings and improves data accessibility, you should build an implementation strategy that considers the challenges organizations face.

Choosing a solution

While implementing a security data lake comes with benefits, you should choose a solution that focuses on security data. Many technologies offer data transformation, but not all of them understand the nuances of security telemetry. Using the same tool for business intelligence and security data can make the normalization process more time-consuming and expensive.

Parsing data

Organizations integrate various technologies into their environments. Many of them use proprietary formats that make data parsing a challenge. Without auto parsing, you can struggle to create a standardized data format.

Choosing a format

As you ingest the data, you want to transform it into a format that makes it usable for its ultimate destination. Depending on your chosen destination, you may still find yourself constrained to a vendor’s proprietary format. However, during the normalization process, this can lead to the technology dropping log fields to fit the data structure.

Key Considerations for a Data Lake Strategy

As the volume of logs continues to increase, organizations turn to security data lakes for cost and data management. When implementing a security data lake, organizations should build a strategy to optimize costs and resources.

Align Strategy with Business Goals

Your security data lake strategy should align with your organization’s business and security objectives. Your organization’s business goals act as the foundation for your security program. By linking your security data lake strategy to these intended outcomes, you can make informed, purposeful decisions about data ingestion. Some considerations include:

Compliance requirements for your company’s industry vertical
Organizational risk tolerance and current security controls
Threat landscape and attack surface

Choose your Cloud Storage

When selecting a cloud storage solution, consider factors like

Overall complexity
Management overhead
Native capabilities, like parsing and normalization
Total cost of ownership.

While public cloud solutions offer low-cost storage for large data volumes, you should consider the costs and time spent on processing security data before being able to use it.

Define Clear Objectives

Setting clear objectives in a data lake strategy ensures alignment with organizational goals and tracks progress effectively. Clear goals facilitate the implementation process and maximize value from data assets. Some considerations when defining objectives include:

Determining key performance indicators, like mean time to detect (MTTD) and mean time to investigate (MTTI)
Compliance outcomes, like identifying and remediating compliance gaps
Cross-functional collaborations, like ensuring that IT teams and security analysts have the data and access they need to work together

Define Use Cases

Use cases act as blueprints for a successful data lake strategy, addressing specific business challenges. Some use cases to consider include:

Threat hunting
Insider threat detection and investigation
Detection and response for credential-based attacks, like credential stuffing
API attack detection and investigation
Compliance reporting

Choose Your Data Sources

Your data sources should help you achieve the defined objectives and respond to your use cases. Some log sources that you might want to consider include:

Security
Firewalls
Endpoint Security (EDR, AV, etc.)
Web Proxies/Gateways
LDAP/Active Directory
IDS
DNS
DHCP
Servers
Workstations
Netflow
Ops
Applications
Network Devices
Servers
Packet Capture/Network Recorder
Email
DevOps
Application Logs
Load Balancer Logs
Automation System Logs
Business Logic

Configure Data Ingestion

Data pipelines serve as the structured framework for evaluating, modifying, and routing log data through various processing steps. When you configure data ingestion, you should consider the following:

Source format, like Windows Event logs, journald, application data
Complexity of rule creation
Ability to secure inputs

As you configure your data ingestion, you want to ensure that you create reusable rules so you can share them across different pipelines.

Implement Robust Security Measures

Security data contains sensitive information, so you should know how you plan to implement security controls and data governance as part of your strategy. Some considerations include:

Access controls that limit users to only what they need to complete job functions
Data encryption to make security data unusable if someone gains unauthorized access
Data obfuscation to remove or change values for sensitive data in log message

Optimize for scalability

A well-defined strategy balances scalability and data management, maximizing asset potential. Data tiering balances efficient data storage management and data accessibility during investigations. You can consider three types of data tiers:

Hot: high-use data necessary for easy access and search
Warm: data not requiring frequent access but that can be valuable during a search, like logs from recent weeks
Cold: less critical data that can be archived as compressed files

Graylog Security and Graylog Cloud: Cost-effective security data management

Graylog Security and Graylog Cloud provides centralized storage of event log data enabling you to optimize your security data while still meeting data retention compliance requirements. Using Graylog Cloud with Graylog Security enables you to create a cost-effective, scalable security data management solution that integrates security analytics with built-in security content to improve your threat detection and incident response (TDIR) capabilities.

Our data routing ensures optimal storage by using index field type profiles to create hot, warm, and cold storage tiers. Our warm storage tier enables less expensive remote or on-premises storage options while still enabling the same lightning-fast and robust search experience of data in hot storage. Take advantage of Data Warehouse to send data to storage not stored into Hot Storage for a future purpose or compliance.

To learn how Graylog can streamline security data management and improve total cost of ownership, contact us today.

Jeff Darrington

Jeff Darrington is Graylog's Director, Technical Marketing. He is a long-time Graylog OS user with extensive experience in IT Operations, IT product solutions deployment in Firewalls, Networking, VOIP, Physical security Controls, and many others.

View More Posts By Jeff Darrington

Get the Monthly Tech Blog Roundup

Subscribe to the latest in log management, security, and all things Graylog blog delivered to your inbox once a month.