The high volumes of security data that cloud environments generate leave security teams swimming in data, but many feel like they need a life preserver to improve their incident response capabilities.
Enter security data lakes. As the costs associated with data retention become overwhelming, organizations are embracing the idea of security data lakes and data warehouses. These data repositories enable organizations to store large amounts of security data, typically types not immediately required for search and analysis. While they reduce the overall total cost of ownership, organizations should still understand why they need a security data lake strategy to optimize their investment and gain their hoped-for benefits.
Why Do Organizations Use Security Data Lakes?
Security data lakes enable organizations to centralize their collected security data using a cost-effective, long-term storage solution. Security data lakes, like Amazon Security Lake, offer various benefits, including:
- Cost reductions: storage without the licensing costs associated with traditional security information and event management (SIEM) systems
- Scalability: faster investigations by scaling up computing resources and enabling collaboration
- Flexibility: management for different data types, including structured, unstructured, and semi-structured data
- High-fidelity alerts: correlation across more data reduces the number of false alerts
- Compliance: historical data retention to meet legal, regulatory, and internal policy requirements
- Improved decision-making: data storage and processing to leverage security analytics
What is Data Lake Architecture?
Data lake architecture supports ingesting data across various sources and formats so organizations can achieve real-time storage and analysis.
Data sources
A security data lake acts as a centralized repository, storing data from various sources, including:
- Software-as-a-Service (SaaS) applications
- Cloud environments, like AWS CloudWatch and Azure
- Network devices, like firewalls
- Endpoint detection and response (EDR) tools
By centralizing security telemetry from numerous tools and environments, security data lakes improve visibility and facilitate thorough security investigations.
Data Ingestion
Data ingestion is the process of importing data into a data lake. As part of this process, the data should be parsed and normalized, providing a standardized format to facilitate correlation and analysis.
Data Storage And Processing
As a centralized repository, the data security lake enables organizations to retain massive and diverse data sets. They can achieve efficient data management with automation that enables them to balance storage costs, system performance, and data access.
Why design a security data lake strategy?
A security data lake strategy centers on creating a scalable, centralized repository to safeguard organizational assets from hidden and unknown threats. Unlike traditional SIEM solutions, security data lakes are designed to handle the sheer volume of security data affordably, providing both storage and compute scalability.
While a security data lake enables cost savings and improves data accessibility, you should build an implementation strategy that considers the challenges organizations face.
Choosing a solution
While implementing a security data lake comes with benefits, you should choose a solution that focuses on security data. Many technologies offer data transformation, but not all of them understand the nuances of security telemetry. Using the same tool for business intelligence and security data can make the normalization process more time-consuming and expensive.
Parsing data
Organizations integrate various technologies into their environments. Many of them use proprietary formats that make data parsing a challenge. Without auto parsing, you can struggle to create a standardized data format.
Choosing a format
As you ingest the data, you want to transform it into a format that makes it usable for its ultimate destination. Depending on your chosen destination, you may still find yourself constrained to a vendor’s proprietary format. However, during the normalization process, this can lead to the technology dropping log fields to fit the data structure.
Key Considerations for a Data Lake Strategy
As the volume of logs continues to increase, organizations turn to security data lakes for cost and data management. When implementing a security data lake, organizations should build a strategy to optimize costs and resources.
Align Strategy with Business Goals
Your security data lake strategy should align with your organization’s business and security objectives. Your organization’s business goals act as the foundation for your security program. By linking your security data lake strategy to these intended outcomes, you can make informed, purposeful decisions about data ingestion. Some considerations include:
- Compliance requirements for your company’s industry vertical
- Organizational risk tolerance and current security controls
- Threat landscape and attack surface
Choose your Cloud Storage
When selecting a cloud storage solution, consider factors like
- Overall complexity
- Management overhead
- Native capabilities, like parsing and normalization
- Total cost of ownership.
While public cloud solutions offer low-cost storage for large data volumes, you should consider the costs and time spent on processing security data before being able to use it.
Define Clear Objectives
Setting clear objectives in a data lake strategy ensures alignment with organizational goals and tracks progress effectively. Clear goals facilitate the implementation process and maximize value from data assets. Some considerations when defining objectives include:
- Determining key performance indicators, like mean time to detect (MTTD) and mean time to investigate (MTTI)
- Compliance outcomes, like identifying and remediating compliance gaps
- Cross-functional collaborations, like ensuring that IT teams and security analysts have the data and access they need to work together
Define Use Cases
Use cases act as blueprints for a successful data lake strategy, addressing specific business challenges. Some use cases to consider include:
- Threat hunting
- Insider threat detection and investigation
- Detection and response for credential-based attacks, like credential stuffing
- API attack detection and investigation
- Compliance reporting
Choose Your Data Sources
Your data sources should help you achieve the defined objectives and respond to your use cases. Some log sources that you might want to consider include:
- Security
- Firewalls
- Endpoint Security (EDR, AV, etc.)
- Web Proxies/Gateways
- LDAP/Active Directory
- IDS
- DNS
- DHCP
- Servers
- Workstations
- Netflow
- Ops
- Applications
- Network Devices
- Servers
- Packet Capture/Network Recorder
- DevOps
- Application Logs
- Load Balancer Logs
- Automation System Logs
- Business Logic
Configure Data Ingestion
Data pipelines serve as the structured framework for evaluating, modifying, and routing log data through various processing steps. When you configure data ingestion, you should consider the following:
- Source format, like Windows Event logs, journald, application data
- Complexity of rule creation
- Ability to secure inputs
As you configure your data ingestion, you want to ensure that you create reusable rules so you can share them across different pipelines.
Implement Robust Security Measures
Security data contains sensitive information, so you should know how you plan to implement security controls and data governance as part of your strategy. Some considerations include:
- Access controls that limit users to only what they need to complete job functions
- Data encryption to make security data unusable if someone gains unauthorized access
- Data obfuscation to remove or change values for sensitive data in log message
Optimize for scalability
- Hot: high-use data necessary for easy access and search
- Warm: data not requiring frequent access but that can be valuable during a search, like logs from recent weeks
- Cold: less critical data that can be archived as compressed files
Graylog Security and Graylog Cloud: Cost-effective security data management
Graylog Cloud provides centralized storage of event log data for up to 90 days and archives data for up to 1 year, enabling you to optimize your security data while still meeting data retention compliance requirements. Using Graylog Cloud with Graylog Security enables you to create a cost-effective, scalable security data management solution that integrates security analytics with built-in security content to improve your threat detection and incident response (TDIR) capabilities.
Our data routing ensures optimal storage by using index field type profiles to create hot, warm, and cold storage tiers. Our warm storage tier enables less expensive remote or on-premises storage options while still enabling the same lightning-fast and robust search experience of data in hot storage. Take advantage of Data Warehouse to send data to storage not stored into Hot Storage for a future purpose or compliance.
To learn how Graylog can streamline security data management and improve total cost of ownership, contact us today.