Developing an application is like composing a song. You know your intended outcome, and the creation is what gives you the jolt of adrenaline to keep going. However, your job isn’t over once you push the application live. You need to monitor and maintain it to ensure performance and cost optimization.
AWS Lambda forward metrics to CloudWatch once the function completes processing an event. Through the CloudWatch console, you can set alarms and build visualizations with these metrics. By understanding the key AWS Lambda metrics available, you can more effectively monitor and maintain your applications to ensure continued cost optimization and end-user experience.
What are the different types of Lambda metrics?
Since AWS manages the memory, CPU, network, and other resources that run your code, managing costs directly relates to your application’s performance and efficiency. To help you manage your Lambda functions, AWS provides basic serverless monitoring into the CloudWatch platform.
The four primary metrics that you can monitor are:
- Invocation metrics: the number of times a function or service is called for assessing usage patterns and understanding demands on the service
- Performance metrics: how well a function operates under different conditions for insights about service efficiency and reliability
- Concurrency metrics: number of simultaneous executions for understanding system capability and ability to effectively scale during peak load
- Asynchronous invocation metrics: how services process requests that happen at different times for tracking successful executions, failed executions, and processing times
3 Important Function Invocation Metrics
Each function invocation represents a request to execute for code. These metrics provide insight into service reliability and operational efficiency by helping you understand resource allocation and load balancing needs.
Invocations
Invocations is the total number of times that a function executes or “is called.” Although these metrics fail to distinguish successful or failed invocations, they provide insights into:
- Function performance
- Usage periods
- Traffic trends
- Resource allocation
- Cost optimization
For example, if the invocation count drops, you might have underlying system architecture or dependency issues.
Errors
Errors track the number of invocations that result in function errors by dividing the error count by the total invocation count. These metrics can help identify issues like:
- Timeouts
- Configuration errors
- Coding exceptions
For example, a sudden error rate spike could indicate a Lambda function issue or a misconfigured AWS service.
Dead-letter errors
Dead letter errors are events that failed to reach a dead letter queue (DLQ), the temporary storage location for messages that a system cannot process. Sending errors to the DLQ enables you to keep and analyze failed event data to identify underlying issues. Dead-letter errors can help you identify problems that might not be easily traced.
These errors may raise concerns about potential data loss and can provide insight into issues like:
- Improper permissions
- Misconfigured resources
- Message size limits
8 Important Performance Metrics
Performance metrics help you evaluate your Lambda function’s event processing effectiveness. With these insights that these metrics generate, you can optimize your Lambda functions to enhance overall application performance.
Duration
Duration measures the time (in milliseconds) from the invocation of an AWS Lambda function until it completes execution. This metric supports percentiles statistics so you can filter out outliers to better understand performance.
Monitoring this metric can help you:
- Identify code inefficiencies
- Latency issues with external dependencies
- Effectiveness of memory allocation
Billed Duration
Billed duration rounds the Duration metric up to the nearest 100 milliseconds and directly impacts costs. Regularly monitoring billed duration enables you to understand how optimizing performance can improve cost management.
Init duration
Captured separately from the overall duration metric, init duration uses seconds to measure how long a function takes to initialize during a cold start, indicating how long the runtime environment takes to prepare for execution.
Monitoring init duration helps:
- Optimize function code or configurations to startup speed
- Identify potential bottlenecks related to function initialization
Memory size and max memory used
Typically, you want to use these two metrics together to understand how your function uses – or wastes – memory. Memory size uses megabytes to indicate the total memory allocated to a function. Max memory used refers to the peak memory consumption during a function’s invocation, also expressed in megabytes.
Monitoring these metrics can provide insights into:
- Excessive memory, resulting in wasted resources and increased costs
- Insufficient memory allocation, resulting in prolonged execution times
Post runtime extensions duration
This metric uses milliseconds to track the time that Lambda extension spends after the function handler finishes executing. It provides insights into additional time extensions introduce when performing tasks like :
- Sending logs, metrics, or traces to external services
- Cleaning up resources
- Interacting with other AWS resources
Monitoring this metric enables you to optimize performance by:
- Identifying where extensions add too much time
- Fine-tuning serverless applications to enhance functionalities
Iterator age
Iterator age uses milliseconds to measure the time between when records arrive and a function processes them, especially important for streaming services like Kinesis or DynamoDB. A high iterator age indicates that income data volumes surpass the functional processing capabilities, creating a backlog of unprocessed records.
Monitoring this metric can help identify issues like:
- Prolonged execution duration of the function
- Insufficient stream shards
- Invocation errors
Latency (P50, P90, P99)
Monitoring latency uses percentile distributions because they offer a more accurate representation of latency, helping to capture user experience effectively.
The three most common percentile metrics are:
- P50: 50th percentile, a baseline for typical performance
- P90: 90th percentile, often triggering performance issue alerts
- P99: 99th percentile, insights into extreme outliers
You can use these metrics to identify issues that might not otherwise be apparent, enabling more effective troubleshooting and optimization.
Offset Lag
Offset Lag is a metric specific to Amazon MSK and self-managed Apache Kafka when they are event sources for Lambda functions. By measuring the total number of messages waiting in the message queue to be sent to a target Lambda function, it provides visibility into how polling runs.
Monitoring this metrics can help identify issues like:
- Undesirable congestions
- Inefficiencies in data streams
8 Important Concurrency Metrics
Concurrency metrics provide insight into how many simultaneous requests a function is handling. Lambda provisions a separate execution environment instance for each concurrent request, increasing the number of execution environments until you reach your account’s concurrency limit.
Concurrent executions
This metric is the total number of function instances actively process events. By providing insight into how close to your regional or reserved concurrency limit you are, this metric can help prevent throttling that can undermine other functions’ performance.
Properly managing concurrent executions enables efficient resource allocation and improves applications’ responsiveness.
Unreserved concurrent executions
Unreserved concurrent executions tracks the total concurrency that remains unallocated, providing insight into resource distribution during peak workloads. Consistently depleting unreserved concurrencies may indicate function or workload inefficiencies.
For example, if specific functions regularly exhaust their unreserved concurrency during traffic spikes, you may need to distribute the workload more across multiple functions to enhance performance.
Claimed Account Concurrency
This metric provides visibility into the total concurrent executions across all functions. If traffic surpasses the established concurrency limit, some requests may be declined, undermining reliability and efficiency.
Throttles
The Throttles metric tracks the total number of invocation requests that are rejected due to a lack of available function instances or an exceeded concurrent execution limit. Throttled invocation requests are not reflected in the standard Invocations or Errors metrics.
By monitoring this metric, you can take actions that improve application reliability and speed, like:
- Reserving concurrency
- Optimizing execution time
- Increasing the concurrent execution limit
Provisioned concurrent executions
If you have provisioned concurrency enabled, this metric tells you the number of function instances actively actively processing events for a specific version or alias. MOnitoring this metric can provide insights into underlying issues with the Lambda function or dependencies on other services.
Provisioned concurrency utilization
Provisioned concurrency utilization assesses how effectively the function uses its allocated provisioned concurrent. Monitoring this metric supports cost management with insights about:
- Low utilization that suggests reducing or disabling provisioned concurrency for cost savings
- Need for more provisioned concurrency if the function consistently reaches thresholds
- Issues with the function or upstream services
Provisioned concurrency invocations
Provisioned Concurrency Invocations measure the total executions of a Lambda function that is utilizing provisioned concurrency. This metric is distinct from standard invocation metrics in that it exclusively counts invocations operating on provisioned concurrency when it’s configured. Monitoring this metric can provide insights into:
- Peak demand periods
- Overall function performance to identify areas of optimization
- Potential costs saving by reducing or disabling provisioned concurrency
Provisioned concurrency spillover invocations
This metric tells you when the Lambda function exceeds the provisioned number of concurrent invocations. When a function exceeds this threshold, it operates on non-provisioned concurrency, increasing the likelihood of cold starts that impacts performance and response times.
Monitoring this spillover metric can help identify:
- Ways to change configurations to better align with traffic demands
- Underlying issues with the function or an upstream service
- Opportunities for improving responsiveness and reliability
4 Important Asynchronous Invocation Errors
In AWS Lambda, asynchronous invocation means that the invoking application can proceed without waiting for the function to finish executing, improving application performance. Some examples of asynchronous services include:
- Amazon Simple Email Service (SES)
- Amazon Simple Notification Service (SNS)
- Amazon S3
Asynchronous events received
This metric is the number of events that Lambda successfully queues for processing, giving you insight into the events that the function receives. If this metric and the invocations metric don’t match, you might want to look for issues like dropped events or potential queue backlogs.
Destination Delivery Failures
Delivery errors may occur during asynchronous invocations if Lambda cannot send events to their designated destinations, like the DLQ. These errors can occur for reasons like:
- Permission errors
- Misconfigured resources
- Size limitations
- Destination is not supported
Asynchronous Event Age
This metric tracks the time an event spent waiting the in queue, providing insight into issues like:
- Incorrect triggers
- Function misconfigurations
- Throttling
Setting alarms for thresholds can help investigate when queue backlog occurs, especially when comparing this metric with:
- Errors: to identify function errors
- Throttles: to identify concurrency issues
Asynchronous Events Dropped
This metric tracks the total count for events that a Lambda function fails to process then ultimately discards. Some reasons that a function might drop events include:
- Exceeding the maximum age
- Reaching the attempt retry limit
- Hitting a concurrency limit
Graylog: Security and Operations Monitoring for Insights into AWS Environments
Using Graylog’s CloudWatch inputs, you can integrate your AWS Lambda monitoring directly into your overarching security and operations monitoring. Graylog’s purpose-built solution provides lightning-fast search capabilities and flexible integrations that allow your team to collaborate more efficiently.
Graylog Operations provides a cost-efficient solution for IT ops so that organizations can implement robust infrastructure monitoring while staying within budget. With our solution, IT ops can analyze historical data regularly to identify potential slowdown or system failures while creating alerts that help anticipate issues.
Since you can easily share Dashboards and searches with Graylog’s cloud platform, you have the ability to capture, manage, and share knowledge consistently across DevOps, operations, and security.
With Graylog’s security analytics and anomaly detection capabilities, you get the cybersecurity platform you need without the complexity that makes your team’s job harder. With our powerful, lightning-fast features and intuitive user interface, you can lower your labor costs while reducing alert fatigue and getting the answers you need – quickly.
Our prebuilt search templates, dashboards, correlated alerts, and dynamic look-up tables enable you to get immediate value from your logs while empowering your security team.