Creating an Incident Response Process

In today’s cyber threat landscape, almost everyone is one bad day away from a security incident. While not every incident becomes a data breach, security teams need to be prepared for anything. Just like that one friend who has a spreadsheet to help them organize the minute tasks associated with a project, security teams need to have a prepared list of steps to take during an incident.

While your incident response plan acts as the overarching strategy, your incident response process outlines the steps and documentation necessary for investigating, containing, remediating, and recovering from an incident more efficiently.

What is an incident response plan?

An incident response plan is a strategic, step-by-step guide that defines security analyst responsibilities and outlines actions they take during an incident. The incident response plan will include policies and procedures for:

Preparation: testing processes
Identification: building detections
Investigation: tracing suspicious activity to compromise assets
Containment: strategies for preventing additional damage
Remediation: identifying root cause and fixing impacted systems
Recovery: restoring affected systems to pre-incident state
Lessons learned: reviewing incident response for any potential areas of improvement

What are the differences between an incident response plan and incident response processes?

Although the incident response plan and the processes inform one another, they are different and serve different purposes. The plan provides a framework for how organizations define the processes.

The incident response processes are the ongoing, continuous activities for managing the incident response life cycle from risk assessment and incident handling to lessons learned and continuous improvement.

What are the primary incident response frameworks?

Incident response frameworks provide the structure for the organization’s processes. Both the National Institute of Standards and Technology (NIST) and the industry organization SANS Institute have frameworks based on cybersecurity best practices. Additionally, the OODA Loop, which stands for Observe, Orient, Decide, and Act, has gained popularity in recent years.

NIST Special Publication 800-61r3

In April 2024, NIST released the initial public draft of “Incident Response Recommendations and Considerations for Cybersecurity Risk Management,” the update to its 2012 “Computer Security Incident Handling Guide.”

To better align with NIST Cybersecurity Framework (CSF) 2.0 Functions, the agency updated the incident response framework by treating it as a continuous feedback loop rather than a linear process. The original NIST incident response framework outlined the life cycle as follows:

Preparation
Detection and analysis
Containment, eradication, and recovery
Post-incident activity

In the original model, a continuous feedback loop existed between detection and analysis and containment, eradication, and recovery. Security teams were expected to update the preparation stage only after reviewing post-incident activity, like lessons learned.

Under the new model based on the six CSF 2.0 functions, the incident response framework now clearly separates Preparation as the ongoing daily activities of a security program, consisting of the NIST CSF 2.0 Govern, Identify, and Protect functions.

The incident response lifecycle phases focus on the Detect, Response, and Recover functions. However, unlike before, all functions continuously inform one another.

Within the Detect, Respond, and Recovery Functions, NIST focuses on more than thirty High Priority CSF elements.

SANS Institute Incident Handler’s Handbook

Although the SANS Institute Incident Handler’s Handbook is organized differently, it contains similar steps. SANS identifies the following parts of the incident life cycle:

Preparation: Establishing policies, creating a plan/strategy, communicating the policies and processes, documenting the incident, assigning responsibilities, providing access controls, choosing tools, and training responsible parties
Identification: Detecting anomalous activity in the environment and reporting it to the responsible parties
Containment: isolating the threat and affected systems to reduce impact
Eradication: removing malicious content from systems and remediating vulnerabilities that attackers exploited
Recovery: bringing affected systems into production environment after testing, monitoring, and validating them to verify that they have not been reinfected or otherwise compromised
Lessons Learned: summarizing and reporting on who detected the incident, when they detected it, its scope, process for containment and eradication, recovery activities, areas of effectiveness, and areas of improvement

The OODA Loop

Although John Boyd originally developed the OODA Loop for the military, security teams increasingly apply it to cybersecurity incidents as a decision-making framework. The OODA Loop focuses on four actions:

Observe: monitor for abnormal activity
Orient: apply threat intelligence to gain situational awareness
Decide: identify next steps to minimize damage and recover quickly
Act: contain threats, remediate affected systems, recover affected systems, review response effectiveness

Checklists to Help Security Team Organize Incident Response Processes

With the right detections, you can identify a security incident. For example, if you build high fidelity Sigma detections aligned to the MITRE ATT&CK framework, you can get started investigating the incident, remediating the systems, and recovering from the incident faster.

The last three stages of the incident response lifecycle often require hands-on-keyboard tasks. Without well-defined processes, security teams often struggle to risk-rate alerts appropriately which can have a cascading delay effect across the entire incident. The longer the triage, investigation, and containment phases take, the more time attackers have to do damage to the environment.

To get started thinking about what processes you need, you may want to consider the following checklists.

Triage and Investigation

During the triage and investigation phase, you need to make fast decisions about:

What the alert tells you
How much impact the potential incident could have
What caused the incident

During this phase, you should engage in the following activities:

Prioritizing incidents based on impact to continued business operations
Assigning responsibilities in a ticketing or tracking system to prove oversight and governance
Gathering forensic evidence to identify the incident’s root cause

Some critical activities include:

Performing initial incident analysis to understand the incident’s nature and potential impact
Assigning an incident severity level using predefined criteria to prioritize response activities
Creating a ticket or case in the incident tracking system to document activities
Gathering additional information through interview with or documentation from impacted parties
Analyzing logs, system configurations, network traffic, and other data sources to identify potential indicators of compromise (IoCs)
Following proper forensic procedures to collect and preserve evidence for future law enforcement or legal documentation

At minimum, security teams should collect incident data and documentation like:

Incident details
- Date of alert
- Time of alert
- Affected system/network
- Initial observations
Logs and alerts from security tools for information about suspicious or malicious activity, including data from:
- Identity and Access Management (IAM)
- Endpoint detection and response (EDR)
- User and Entity Behavior Analytics (UEBA)
- Operating systems
Network traffic data for insight into anomalies, including data from:
- Intrusion detection system (IDS)/intrusion prevention system (IPS)
- Firewall log data
- Network device
- Data downloads and uploads
- Netflow logs
System configuration information, including from:
- Vulnerability scanner data
- Software-as-a-Service (SaaS) configurations
Threat intelligence and IoC information, including from:
- Government alerts, like from the Cybersecurity and Infrastructure Security Agency (CISA)
- Paid feeds
- Vendor vulnerability notices

Containment and Eradication

During the containment and eradication phase, you should engage in the following activities:

Isolate the threat
Remove malware
Delete compromised accounts or account data
Update vulnerable systems with security patches

Some critical activities include:

Isolating the affected systems or network from the rest of the environment
Disconnecting infected devices from the network or disabling their network access
Reroute or filter network traffic
Close vulnerable ports and mail servers
Scanning for malware and removing malicious software or files
Using a sandbox or system restore to eliminate malware threats
Identifying compromised user accounts and revoking access rights
Changing passwords for affected accounts
Implementing multi-factor authentication for affected accounts
Identifying exploited vulnerabilities and applying security updates

At minimum, security teams should collect incident data and documentation like:

Details about containment and eradication activities
- Date and time for each action
- Systems that actions were taken on
- What actions were taken and why
Record of malware analysis, including:
- Malware type
- Malware behavior
- Remediation activities taken
Record of remediation activities, including:
- Device or system updated
- Security updates applied to device or system
- Date when security update applied to device or system
Changes made to user accounts, including:
- Affected user IDs
- Changes to passwords
- Changes to user privileges
Document communications between team members, including:
- Incident reports
- Emails
- Meeting minutes
Network and system logs showing containment and eradication activities

Graylog: Streamline Your Incident Response Processes

With Graylog Security, you can use prebuilt content to map security events to MITRE ATT&CK. By combining Sigma rules and MITRE ATT&CK, you can create high-fidelity alerting rules that enable robust threat detection, lightning-fast investigations, and streamlined threat hunting. For example, with Graylog’s security analytics, you can monitor user activity for anomalous behavior indicating a potential security incident. By mapping this activity to the MITRE ATT&CK Framework, you can detect and investigate adversary attempts at using Valid Accounts to gain Initial Access, mitigating risk by isolating compromised accounts earlier in the attack path and reducing impact.

Graylog’s risk scoring capabilities enable you to streamline your TDIR by aggregating and correlating the severity of the log message and event definitions with the associated asset, reducing alert fatigue and allowing security teams to focus on high-value, high-risk issues.

Jeff Darrington

Jeff Darrington is Graylog's Director, Technical Marketing. He is a long-time Graylog OS user with extensive experience in IT Operations, IT product solutions deployment in Firewalls, Networking, VOIP, Physical security Controls, and many others.

View More Posts By Jeff Darrington

Get the Monthly Tech Blog Roundup

Subscribe to the latest in log management, security, and all things Graylog blog delivered to your inbox once a month.