Root cause analysis (RCA) is a method of problem-solving used to investigate known problems and identify their antecedent and underlying causes. While root cause analysis seems to imply that issues have a singular cause, this is not always the cause. Problems may have a singular cause or multiple causes stemming from deficiencies in products, people, processes or other factors.
- Root causes can be divided into physical, human, and organizational causes.
- The universal four-step process of a root cause analysis is identification and description, chronology, differentiation and causal graphing.
- The two most important tools and methods for RCA in cloud computing environments are the "Five Whys" method, where one repeatedly asks "Why?" and the 5 M’s, where investigators look at man, machine, material, method, and measurement as possible causes.
- You can use a tool like Root Cause Explorer from Sumo Logic, which helps on-call staff, DevOps, and infrastructure engineers accelerate troubleshooting and root cause isolation for incidents in their apps and microservices running on AWS, public cloud hosts, and Kubernetes.
Root cause analysis is implemented as an investigative tool in a variety of industries.
For IT organizations, root cause analysis is a key aspect of the cyber security incident response process. When a security breach occurs, SecOps teams must collaborate quickly to determine where the breach originated, isolate the vulnerability that caused the breach and initiate corrective and preventive actions to prevent exploiting the vulnerability again. Root causes can be divided into three types.
Physical - when a physical part of a system breaks down. These include hardware failures, system errors from booting up, issues with tools not functioning, or other tangible components breaking down.
Human - arise from human errors or mistakes. If a person does not have the necessary skills to operate systems properly, does not know the tools, creates a programming error, or tries to perform tasks with incorrect tools.
Organizational - arise from administrative issues. For example, suppose a team lead provides incorrect instructions to team members. In that case, organizations make the incorrect selection of people to perform tasks, or the organization does not handle or maintain staff correctly.
There are many benefits to conducting a root cause analysis:
Reduce the number of errors that occur from the same root causes.
Implement tools and solutions to address future issues.
Implement tools to log and monitor for potential future issues.
Enable your team to address issues faster.
When investigating a cyber security incident, security operations teams must act quickly to identify and isolate the event's root cause. The basic outline of the RCA process is identical across industries, regardless of the tools that individual practitioners choose to implement:
1. Identification and description
The first step to a successful root cause analysis is the accurate characterization of a problem. If the problem is poorly understood, it may be difficult to isolate the underlying causes correctly. Accurate event descriptions also play an important role in RCA. The starting point for a successful analysis should be a collection of accurate event descriptions detailing everything that happened in connection with the problem.
Once IT operators have identified the problem and associated events, they should be arranged in chronological order, as in a timeline or sequence of events. This makes it easy to establish and identify causal relationships between events connected to the problem. Organizations that leverage security analytics software can automate the collection of event logs and the integration of logs from multiple sources into a single, standardized format and platform. This streamlines the RCA process, helping these organizations get to step three of RCA at lightning speed.
Investigators incorporate additional contextual data surrounding the events to understand how events are correlated. When a cyber security event is detected, security operators must analyze dependencies between events to distinguish between root causes, causal factors and non-causal factors within the system. Enterprise security analytics tools use a data analysis technique called event correlation, which filters through high volumes of computer logs from various sources and pinpoints the most likely to be connected to the problem.
4. Causal graphing
In the final step of the RCA process, investigators are encouraged to produce a causal graph, diagram or another visual interpretation of the result of the RCA process. Causal graphing illustrates a sequence of key events that begins with the root causes and ends with the problem. This exercise demonstrates the logical pathway that was followed to determine how the problem occurred.
While the general process for root cause analysis remains consistent across industries, investigators differ in the tools and techniques they use to get to the underlying source of a problem. Even security operators who can automate much of the RCA process with security analytics applications must be familiar with root cause analysis methodologies to interpret the causes of security events accurately. Here are the two most important tools and methods for RCA in cloud computing environments:
Five whys of root cause analysis
The "Five Whys" method of root cause analysis is an investigative technique that encourages the practitioner to ask, "Why?" to get to the deepest chain of causation that leads to an incident, event or problem. When a problem is observed, we can rarely get to the root cause after a single iteration of asking, "Why did this happen?" We may have to go through several layers of questioning to understand the root cause of an event and identify an opportunity for corrective actions. Use this example as a template for conducting Five Whys RCA:
Problem statement: The company data server was infected with malware.
Why? The server was not updated with the latest malware definitions for our anti-malware application.
Why? The automated server that deploys the updates is not operational.
Why? The automated server broke last month, and it hasn't been repaired or replaced.
Why? The person responsible for approving the repair or replacement is on vacation, and there was inadequate communication about who should cover change approvals.
Why? Lack of process.
Solution: Create a process to ensure that repairs can be approved, even when the normal approving person is away.
This simple example illustrates the depth of questioning required to isolate the root cause.
Fishbone/Ishikawa diagram analysis
A fishbone diagram is a visual graphing tool that encourages the investigator to identify potential causes for a problem from various sources. Fishbone diagrams help investigators quickly get to the root cause of issues by encouraging them to identify different causes that could have resulted in the problem. The leading framework for Fishbone diagrams is the five Ms, where investigators look at:
Man: Human factors that could have caused the problem
Machine: Hardware or technical causal factors
Material: Causal factors stemming from material issues, including consumables and information
Method: Causal factors arising from breakdowns in process or methodology
Measurement: Causal factors arising from inaccuracies in measurement tools or inspections
Environmental causal factors are frequently investigated as part of a Fishbone/Ishikawa diagram analysis.
Team members use different toolsets and troubleshoot between staging and production environments, so it is best to gather evidence from event streams and log files to gain visibility across the entire stack. Sumo Logic provides application observability and multi-cloud observability solutions to help. You can use a tool like Root Cause Explorer from Sumo Logic, which helps on-call staff, DevOps, and infrastructure engineers accelerate troubleshooting and root cause isolation for incidents in their apps and microservices running on AWS, public cloud hosts, and Kubernetes.
Complete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.