Sumo Logic: Using Its Own Advanced Tools
As a developer of next-generation log management and analytics, Sumo Logic faces some of the same challenges in its daily operations that its customers deal with: coping with the complexities of today’s IT infrastructure, meeting customer expectations for reliability of service and applications performance, and quickly diagnosing and fixing operational problems such as service interruptions or slow response times.
By applying its core technologies and tools to its own business, Sumo Logic realizes benefits in three areas:
- Software development and quality assurance
- Problem resolution
- Customer service
Sumo Logic is a next-generation log management and analytics service that greatly reduces the cost and complexity of managing log data in enterprise IT organizations and extracting strategic operational insights from that data.
Software Development and Quality Assurance
Sumo Logic’s most common uses for its own log management and analytics service involves software debugging and application monitoring.
“Our use cases for software debugging generally fall into two categories,” says Stefan Zier, Cloud Infrastructure Architect for Sumo Logic. “The first type are cases in which there’s a specific behavior that we know of, such as an inability of a customer to log into the service, or a query that didn’t finish fast enough. In these instances, we want to view all the components of the architecture that were touched by this session and what they did during that session. The other category consists of cases where we want the system to tell us about things that we haven’t detected previously and are starting to occur repeatedly. The Sumo Logic solution lets us do both of these things very quickly.”
Zier points to a unique Sumo Logic capability called a “summarize query,” which is used by the company’s product development team as well as with customers. The summarize query “is a proprietary algorithm that figures out which log messages are similar to one another and then groups and counts them,” Zier says. “You can say, ‘Show me all the errors and sort them by how frequently they’ve occurred.’ Moreover, you can have the service send you the result of that discovery repeatedly, say at 9:00 a.m. every day. So when software developers come in in the morning, they have an e-mail that shows them error messages over the last 24 hours. This enables them to see which ones require immediate attention or others that merit further inspection at-a-glance.”
The ability to quickly understand behavior in an application has beneficial impacts both within the company and with customers. “By applying our own service, we gain the advantage of a very short feedback loop to implement improvements and new ideas into the product while it is still under development,” says Zier. “As a result, before issuing product releases to customers, we can identify issues and very quickly make fixes.”
“When a customer asks us something, we can actually figure out what happened in the application as they interacted with it.”
All machines in Sumo Logic’s deployment run instances of the Sumo Logic collector, the component responsible for log collection. They are deployed in a fully automated fashion, a capability that makes collectors suitable for highly elastic cloud environments such as Sumo Logic’s.
Operations & Problem Resolution
One of the most common challenges facing enterprise-scale IT organizations and SaaS providers with highly distributed cloud environments is diagnosing the root cause of an operational problem. The task becomes exponentially more difficult in complex, distributed computing environments because each device in the network (such as a server, router or firewall) generates its own log data. This data contains information essential to identifying and solving an operational problem. But with logs located in many different locations—each generating a huge stream of data each day—finding the right information quickly can be very difficult.
Sumo Logic makes this job easy by centralizing all log data in one location, and applying highly sophisticated query and analytic tools to comb through gigabytes or terabytes of data to rapidly produce operational insights.
“Like many of our customers, we run a complex, distributed SaaS application,” says Zier. “It consists of a large number of nodes that work in concert to provide the service, so one user interaction touches many of these machines. That’s a challenging environment in which to understand what’s going on, and it’s a task we handle with our own product. Each of the machines runs one of our collectors, and all the collectors report data into the Sumo Logic service. So we can see what the application is doing by logging in and querying this log data from a single, centralized vantage point.
“As we look at customer or production issues, it’s essential to get results in near real-time,” Zier adds. “At Sumo Logic, developers not only write code but operate the actual production service as well. Every week, two developers are on pager duty. If something unusual happens in the middle of the night, they are alerted via their cell phones. Having a service like ours at their disposal, they can easily and quickly drill down into the details and see what actually needs to be done. In this instance, real-time or near real-time data is essential because there’s an incident you’re responding to and it’s useless if you get data about that incident an hour from now. You really want to be able to query that incident in real-time.”
Mean-time-to-resolution—a critical metric when application availability and service-level agreements are at stake—is greatly reduced “by not having to log into a large number of machines and search for logs,” Zier notes. “The sheer volume of log data makes it almost impossible to do that by hand. When you can execute a search within seconds and get real-time data about what’s going on, you can quickly resolve the issue.”
The summarize query, which shows how frequently certain types of log messages occur, is one of the first features Sumo Logic likes to show its customers. “Our customers and prospects become very focused on the results of that query,” Zier notes. “Often times they comment, ‘Hey, I’ve never seen this log message before; what does it mean?’ It’s instrumental to discovering things they haven’t seen before and determining the frequency with which they happen. When you’re just looking at a single log file someplace, you may see it as a one-off event. When you see it in context across all these machines, and notice that it occurs frequently, it’s a whole different story.”
One of the strengths of the Sumo Logic solution is its ability to boil down huge amounts of data into a few meaningful insights. Those insights might be important trends, anomalies not otherwise identified, or imminent threats. “For example, if a log-in failure is happening often, that potentially points to a brute-force security attack,” Zier says. “A summarize query will tell you, for example, that there have been 1.5 million log-in attempts. And that’s something that stands out, and makes you pay attention right away.”
“When you can execute a search within seconds and get real-time data about what’s going on, you can quickly resolve the issue.”
One of the advantages of the Sumo Logic service is that it gives customers a way to quickly diagnose and solve problems without setting up and maintaining on-premise systems or appliances that inevitably will not be up to the task of log management in today’s complex IT infrastructures. But this doesn’t mean customers never call Sumo Logic with questions. And in these cases, Sumo Logic is again applying its own tools.
“One day, a customer called in and said they were unable to log into the site,” Zier recalls. “We drilled down, and observed that somebody else in the customer’s account had reset the person’s password. That probably was not a potential cause that anyone would ever have looked for. But with our tool, we just gave it the e-mail address to search for and it showed us all the things that happened related to that e-mail address. One of them said, ‘user X changed the password for this e-mail address.’ We determined a co-worker inadvertently reset the password. We were able to resolve that within a few minutes.”
Zier notes that easy access to relevant information can translate into giving customers a quick, specific answer. “In many customer service situations, you can get a boilerplate answer that says ‘here are some steps you can try.’ But that seldom works. It’s a lot of trial and error. At Sumo Logic, we’re able to see the specific data as it affects that customer. So if someone files a support ticket, we can directly say, ‘Here’s what happened, and here’s how you resolve it,’ without any doubt, because we have the data. When a customer asks us something, we can figure out what happened in the application as they interacted with it.”
By applying its own groundbreaking tools and technology to its own operations, Sumo Logic reaps benefits in two major areas. Internally, it sees shorter development cycles, quicker releases of new features, and faster resolution of problems that could affect critical functions. Externally, Sumo Logic’s log management and analytics service helps the company exceed customer expectations by assuring a reliable service, resolving support problems quickly (or preventing them altogether), and delivering new insights that lead to tangible improvements in customers’ IT operations.