Monitoring with AWS Cloudwatch

For organizations that leverage AWS resources, CloudWatch is a “must-have.” Operations teams leverage data from their AWS environments to identify service issues and maintain availability. By using CloudWatch and its associated products, organizations can monitor their AWS resource usage to optimize their environments. Problematically, while CloudWatch integrates with some third-party services, it often lacks robust capabilities, meaning that many organizations look for ways to gain comprehensive visibility.

Monitoring with AWS CloudWatch is critical to managing applications and resources, but you should understand its use cases and limitations.

What is CloudWatch in AWS?

CloudWatch is an Amazon Web Services (AWS) service that provides observability so that IT teams gain real-time insights about operational health, including:

Application monitoring
Responding to performance changes
Resource use optimization

With CloudWatch, organizations can:

Monitor AWS resources
Track custom metrics
Aggregate system, application, and custom AWS log files
Create alarms based on metrics
Build workflows that automate responses to resource changes

Operations and security teams use CloudWatch to identify and resolve potential issues before they impact system-wide performance.

What is the difference between CloudWatch and CloudTrail?

Although CloudWatch and CloudTrail are AWS-supplied monitoring tools, they respond to different use cases, meaning that they provide different data.

Purpose

While both enable visibility, they serve distinct purposes:

CloudWatch: central logging and monitoring service that is activated by default
CloudTrail: auditing resource changes made by users and applications that is not activated by default

Data collected

Based on their purpose, the two services collect different data:

CloudWatch: log files with information about the activity of all AWS services and resources
CloudTrail: API calls from AWS Console, CLI, third-party applications, and other AWS Services with details about the request, response, and user identity

Data Delivery

Although both provide data about cloud activity, their delivery times differ significantly:

CloudWatch: log data updated every five seconds with metrics data delivered in 1-minute or 5-minute periods
CloudTrail: delivery within fifteen minutes of the API call

CloudWatch Features and Capabilities

CloudWatch offers various features and capabilities that enable customers to gain deeper insights, automate actions, and manage resources.

CloudWatch Logs

CloudWatch Logs centralizes all AWS system, application, and service logs, making them easy to view, search, filter, and archive. The primary features associated with this are:

CloudWatch Logs Insight: interactive log data search and analysis
Live Tail: streaming new log event list that users can view, filter, and highlight in real-time to detect and resolve issues
Amazon EC2 instance monitoring: log data for applications and systems that can be turned into customized metrics
AWS CloudTrail event monitoring: integration with CloudTrail for notifications about defined API activity
Data protection policies: auditing and masking sensitive data in logs based on defined data identifiers
Log retention: storing logs indefinitely or based on compliance requirement timeframe
Archiving: sending rotated and non-rotated log data off host and into the log service
Route 53 DNS queries: log information about public DNS queries that Route 53 receives

CloudWatch Metrics

CloudWatch Metrics fall into two categories:

Basic monitoring: default setting provided at no extra charge
Detailed monitoring: additional monitoring available for some service that incurs charges

Amazon EC2 sends the following categories to CloudWatch:

Instance metrics
CPU credit metrics
Dedicated Host metrics
Amazon EBS metrics for Nitro-based instances
Status check metrics
Traffic mirroring metrics
Auto Scaling group metrics
Amazon EC2 usage metrics

Amazon EventBridge

Formerly called CloudWatch Events, EventBridge is the updated version that enables users to connect applications with data from various sources, including internally built applications, Software-as-a-Service (SaaS) applications, and AWS services.

EventBridge processes events in two ways:

Event buses: receive events and deliver them to various targets
Pipes: receive event from a single source and deliver to a single target

Often, organizations combine buses and pipes.

CloudWatch Alarms

AWS offers two different types of Alarms:

Metric alarms: performing one or more actions based on either a single CloudWatch metric or related to a threshold based on a threshold number over time
Composite alarm: rules built around multiple metric and composite alarm states

When creating an alarm, users must specify three settings:

Period: time in seconds spent creating the data points
Evaluation period: number of most recent periods/data points evaluated
Datapoints to Alarm: number of data points within the Evaluation Period that triggers the alarm’s change in state

CloudWatch Dashboards

Dashboards are customizable homepages in the CloudWatch console for visibility into things like:

Metrics and alarms to assess resources and applications
Operational playbooks to help teams respond to incidents
Critical resource and application measurements

Organizations can use dashboards to gain cross-account cross-Region observability. With this customized view, teams can share dashboards that collect real-time data for better cross-functional communication.

Challenges of Using CloudWatch

Many companies use CloudWatch because it’s a service included with their subscription that enables them to:

Create visualizations that make monitoring their AWS environments easier
Improve total cost of ownership by automating activities
Optimize applications and resources
Gain insight into key issues like CPU, capacity, and memory utilization.

However, despite these benefits, many organizations struggle to use CloudWatch effectively.

Cost

Although CloudWatch is a native tool, it becomes increasingly expensive as the organization’s environment grows, making large-scale monitoring and logging cost-inefficient. For example, organizations with the free tier are limited to:

Basic monitoring metrics
10 detailed metrics
1 million API requests
10 alarm metrics
3 Custom Dashboards
5 GB of data, including ingestion, archiving, and data scanned by queries

Query limitations

Although CloudWatch Metrics Insights enables you to query data, its limits create challenges. For example, Amazon explains the following limits:

Ability to query only the most recent three hours of data
Inability to process more than 10,000 metrics with a single query
Limited to single query returning no more than 500 time series
Limited to 75 Metrics Insights alarms per Region
Failure to support high-resolution data
One query per GetMetricData operation

Resource intensive

To track resource use with CloudWatch, organizations need to install the CloudWatch Agent on their servers. However, the CloudWatch Agent can be resource-intensive, using up a lot of CPU for various reasons, including:

Use of wildcard symbols when monitoring a large number of files
Collecting too many metrics during a timeframe
Collecting too many metrics across various processes, filters, and patterns with the procstat plugin
Monitoring too many large-sized log files without rotating them

Graylog: Reduced Cost and Time for Monitoring AWS CloudWatch Data

Graylog’s solution enables you to gain the full value of your CloudWatch logs by helping you overcome the challenges associated with them. Using the AWS Kinesis/CloudWatch input, you can stream CloudWatch Logs, CloudWatch Flow logs, and Kinesis Raw Logs to Graylog for comprehensive AWS monitoring.

By leveraging Graylog Cloud, you can reliably handle the occasional spikes in data, allowing you to monitor your AWS environments consistently. Further, by aggregating all log data across your environment in Graylog, you can correlate events for more meaningful insights. For example, by correlating load balancer and CloudWatch application logs, you can visualize distribution across Availability Zones and source IPs for more precise availability monitoring and faster remediation.

To learn how Graylog can help you gain greater insight into your environment, contact us today.

Jeff Darrington

Jeff Darrington is Graylog's Director, Technical Marketing. He is a long-time Graylog OS user with extensive experience in IT Operations, IT product solutions deployment in Firewalls, Networking, VOIP, Physical security Controls, and many others.

View More Posts By Jeff Darrington

Get the Monthly Tech Blog Roundup

Subscribe to the latest in log management, security, and all things Graylog blog delivered to your inbox once a month.

Read Now

Get instant answers

"The most powerful and flexible SIEM and centralized log management system I know."

– T-IN in the Healthcare Industry