Back to blog results

6월 11, 2024 By Joe Kim

From “rebooting” to reliable and secure applications: Optimizing the customer experience

New standards for observability and reliability


Not so long ago in my career, I remember when it was relatively acceptable for infrastructure or development teams to solve a problem by rebooting a server or just “turning things off and on again.” It didn’t matter what caused the problem or how long the reboot would fix things, provided they were fixed for now.

Security teams were always held to a different standard. It wasn’t good enough to say that things were now secure or that they didn’t know why or how a security incident had occurred. Security teams were accountable for knowing what happened, how it was fixed, and how that same vulnerability could be healed in the future.

Increasingly, engineering teams responsible for the reliability and performance of critical, customer-facing applications are being held to this same level of accountability. This is where adopting a DevSecOps practice can be crucial to an organization.

Why are standards for observability and reliability increasing?

As organizations have embraced digital transformation, more and more of an organization’s revenue, reputation, and overall success are tied to mission-critical applications. Whether it’s a payment portal, a shopping cart, or a single button that allows a user to share an experience with others, it’s vital that your technology works and – from the end user’s perspective – works flawlessly.

Whether your application goes down because of a bug that slipped unnoticed into production or due to the exploitation of an unidentified vulnerability, the impact is similar in that the end-user’s experience has been negatively impacted. And that’s just the beginning, we often hear stories about the cost of downtime, with some estimates of over $6k per minute ranging to $16k per minute (or $1 million per hour) or more depending on your industry and size.

This cost can come from lost revenue, tarnished reputation, diminished customer trust and confidence and other brand impacts. In fact, as technology has expanded into almost every industry, customers have a higher expectation for their digital experiences and it’s even easier for them to change solutions after a negative experience.

At Sumo Logic, a recent IDC ROI report showed an 82% reduction in unplanned downtime as part of our business value. When every moment of delay costs money, reputation, and valued customers, reliability, security, and observability are non-negotiable.

That said, chaos and unpredictability are widespread. No matter how well you build and secure your business, you’ll need to anticipate the unknown and prepare for increasing complexity. That is the foundation of modern observability and security.

What are the new standards for reliability and observability?

As a CEO, the first question I’ll ask if there’s a problem is, of course, “How can I help?” But soon thereafter, it’s “What happened?” and “How can we make sure this doesn’t happen again?”

In the security world, these questions are typically answered by investigating audit logs, conducting forensic analysis, identifying the root cause, and analyzing how the organization should modify its security posture to avoid similar incidents. Now, observability teams need to do the same.

Uncovering the root cause of infrastructure and application reliability issues has become increasingly challenging. Maintaining highly resilient, cloud-native, cost-effective applications at scale in a way that meets the expectations of modern customers is well beyond a human-scale problem. It is a machine-scale problem and arguably an AI-scale problem for bleeding-edge applications.

Optimizing reliability with accountability requires the right telemetry, AI and machine learning, as well as real-time customer journey insights mapped to clear business objectives. Ultimately, business outcomes matter most, and engineers have used SRE practices to reverse engineer what Sumo Logic calls reliability management.

  • Telemetry - Metrics give you directional input on where an issue is occurring, and traces give you directional input on what part of your stack may be contributing to the issue as it relates to customer transactions, but logs are the critical telemetry that provides the atomic-level insights to identify the actual root cause of your reliability issues.

  • AI and ML - When we say that apps are complex and only growing more complicated, just think of Netflix with over 1000 microservices or the staggering rate of changes with Amazon’s tens of millions of deployments per year. Organizations grapple with so much data that AI must assist in all parts of the monitoring and troubleshooting lifecycle including data correlation, anomaly detection, root cause analysis, change intelligence, and even incident remediation to minimize MTTR in a way that enables organizations to serve their customers and grow their business.

  • Real-time insights - technical teams need true reliability management to build powerful log and metric-based SLIs to measure and report on reliability, customer impact, and business objectives. Without measuring how customers are impacted, you can only report on metrics that don’t attach well to business objectives. For example, reporting that 25 percent of customers in EMEA had a negative experience purchasing a product, and five percent of those customers left the website with their cart open, is a much better way to communicate business impact against objectives than reporting a “five-minute outage with the CartService”.

Sumo Logic is built to support teams across the DevSecOps lifecycle. Sometimes this is centralized purely in observability teams or security teams, but we’ve learned from experience that customers that center shared data built on logs are typically the most successful.

As observability teams evolve and enhance monitoring, organizations need to hold development, security and operations teams to the same standards of reliability. I’m excited to see how improved observability and security evolve.

Learn more about monitoring and troubleshooting with Sumo Logic’s SaaS Log Analytics Platform.

Complete visibility for DevSecOps

Reduce downtime and move from reactive to proactive monitoring.

Sumo Logic cloud-native SaaS analytics

Build, run, and secure modern applications and cloud infrastructures.

Start free trial
Joe Kim

Joe Kim

President & CEO

Joe Kim is the President & CEO of Sumo Logic, with over two decades of operating executive experience in the application, infrastructure, and security industries. He is passionate about helping customers address complex challenges through the delivery of powerful and efficient technologies and innovations.

Before joining Sumo Logic, Joe was a senior operating partner for Francisco Partners Consulting (FPC), assisting in deal thesis, assessing product-market-fit and technology readiness, and helping portfolio companies create value for customers and shareholders through advisory, board, and mentorship activities. Prior to FPC, Joe served as the chief technology and product officer at Citrix, where he was responsible for strategy, development, and delivery of the company’s $3.2B portfolio of products. Joe has held other senior executive roles at SolarWinds, Hewlett Packard Enterprise, and General Electric. Joe currently serves on the Board of Directors of SmartBear and Andela. Joe holds a B.S. in Computer Science, Criminology and Law studies from Marquette University. During his spare time, Joe enjoys spending time with his family.

More posts by Joe Kim.

People who read this also enjoyed