Back to blog results

11월 27, 2023 By Michael Riordan and Greg Ziemiecki

Lightning-fast troubleshooting for AWS: How to find the root cause fast with Sumo Logic

Lightning-fast troubleshooting for AWS

It’s time to stop firefighting. With Sumo Logic’s AWS Observability, companies like Snoop have been able to simplify data collection, achieve unified visibility across AWS accounts and regions and leverage machine learning to troubleshoot — fast.

This re:Invent, we’re excited to showcase how our capabilities for AWS have evolved. Offering a unified approach to monitoring and troubleshooting for AWS, Sumo Logic lets DevOps and SRE teams improve the reliability of their services and cut troubleshooting toil in just a few clicks.

Looking for lightning-speed troubleshooting? Here’s how Sumo Logic can help you find the root cause and reclaim your time.

Your starting point: a unified view of your AWS environment

In the fast-paced world of e-commerce, timely order processing and inventory updates are crucial for maintaining customer satisfaction. But what happens when an efficient, serverless architecture starts showing intermittent delays? 

Here the processing and inventory update system for our e-commerce site leverages Amazon SQS for queuing orders, AWS Lambda for the core business logic, and Amazon RDS as the persistent data store. Customers are reporting experiencing intermittent delays in placing orders and during checkout. 

To understand what might be going wrong, you first need a centralized view of your AWS environment that brings together your relevant logs and metrics. With AWS Observability, you unlock a comprehensive view across your AWS accounts, regions and individual namespaces. This content is provided out of the box after deploying the solution via the CloudFormation template or Terraform.

Your starting point a unified view of your AWS environment

Detecting issues with pre-built alerts

AWS Observability comes with pre-built alerts for different AWS services, including Amazon SQS, AWS Lambda, and Amazon RDS. These alerts can notify you about the issue with the e-commerce site. In our example, the “Amazon SQS - Message processing not fast enough” alert was triggered.

From the alert, you can determine the characteristic of the issue – if it triggers often, how long it has been unresolved, and other relevant details. In addition, you can understand how long messages are waiting in the queue before they are processed. 

Detecting issues with pre-built alerts

High-speed troubleshooting in action

Now, with this knowledge, the troubleshooting begins.

You start your investigation by diving into SQS, where messages from the Order Processing Service are queued. CloudWatch metrics for SQS provide the first clues.

You observe that the NumberOfMessagesSent is much higher than NumberOfMessagesReceived, indicating that messages are being queued faster than they are being consumed. The ApproximateAgeOfOldestMessage metric shows that some messages have been in the queue for a long time, which could indicate a bottleneck.

High-speed troubleshooting in action

Next, you turn your attention to AWS Lambda, responsible for processing SQS messages to update your inventory. Log entries give evidence of prolonged function execution and timeouts, suggesting potential issues with the Lambda function's efficiency or resource allocation.

Here, Sumo Logic’s out-of-the-box dashboards for AWS Lambda error analysis indicate the following log entry.

dashboards for AWS Lambda error analysis


Because the Lambda function interacts with an Amazon RDS instance, checking RDS would be your next step. 

The RDS performance metrics show high CPU utilization and errors related to database locks.

RDS performance metrics

Again, Sumo Logic’s out-of-the-box dashboards for Amazon RDS error log analysis help to locate particular log error messages confirming the database issue.

2023-11-09T01:45:00Z [ERROR] Deadlock found when trying to get lock; 
try restarting transaction

A closer look into the RDS slow query logs analysis out of the box dashboard revealed sub-optimal queries significantly dragging down performance.

RDS slow query logs analysis
# Query_time: 899.00 Lock_time: 0.594385 Rows_sent: 45 Rows_examined: 54392
SELECT * FROM inventory;
  

You can see that the culprit is a full table scan caused by a missing index.

By thoroughly examining each component of the serverless architecture, you can now address any delays. As the next steps, you can adjust the Lambda function's timeout settings and increase the memory allocation. Additionally, you can add an index to the RDS instance to speed up the problematic query.

It’s time to reclaim your time

Without a unified view of your AWS environment, and the ability to pivot between services and centralized logging, getting to the root cause of this issue may have been extremely difficult, if not impossible.

Looking to reclaim your time? Get started today with AWS Observability, which you can deploy in minutes via the CloudFormation template or Terraform. Learn more and start your trial here.

Have questions? We’ll be hosting a special webinar on Dec. 11, where attendees can hear tips and tricks for implementing Sumo Logic for AWS troubleshooting from our product lead, Greg Ziemiecki. Register now to attend the workshop.

And last but not least, if you’re at re:Invent, swing by booth #789 to see our powerful monitoring and troubleshooting capabilities firsthand. 

Complete visibility for DevSecOps

Reduce downtime and move from reactive to proactive monitoring.

Sumo Logic cloud-native SaaS analytics

Build, run, and secure modern applications and cloud infrastructures.

Start free trial

Michael Ziemiecki

Michael Riordan and Greg Ziemiecki

Senior Product Marketing Manager | Senior Technical Product Manager

More posts by Michael Riordan and Greg Ziemiecki.

More posts by Michael Ziemiecki.

People who read this also enjoyed