Of Causality and Reasoning . . . OpsCruise’s Automated Root Cause Analysis

Of Causality and Reasoning . . . OpsCruise’s Automated Root Cause Analysis

Of Causality and Reasoning . . . OpsCruise’s Automated Root Cause Analysis

Automating the root cause analysis (RCA) process definitely sounds like the quest in the now-classic Monty Python movie. What Ops and SRE have wanted forever has been to eliminate the 'war-room' . . . automating fault isolation in real-time, getting to the cause before a major fallout and quickly fixing the problem. The shift to microservices and serverless applications have only increased this need for automation.

Causal Analysis is Harder in Microservices Applications
Consider how Ops teams typically run RCA today. It begins with an alert event or events, typically a service level objective (SLO) breach that affects a customer-facing service. After a problem alert is detected, logs and events are reviewed and then the investigation begins. This involves walking through traces surrounding the areas of the alerts to find dependencies up and down the service chains.

With microservice architectures, besides more obfuscation with multiple tiers of virtualization and more ephemeral application structure, there are more dependencies to explore including shared infrastructure or services, such as database calls, and Kubernetes (K8s) related issues such as autoscaling.

With so many causal sources at each layer of the application stack, the number of possible combinations can create too many discrete failure sources too many to enumerate. Throwing ML to learn all possible cases is not feasible. Neither is the use of simple correlation between different telemetry such as logs or metrics or traces can lead to false causal inferences. The net effect is that Ops and Dev teams have to spend an enormous amount of time sifting through a very high-dimensional problem space.

A Knowledge-Augmented Approach

Our approach at OpsCruise to automate RCA builds on two broad principles:

Because there are too many possible causal domains, to constrain the search space we need a systematic method that builds on prior experience and expertise, domain knowledge, heuristics and enables a way to automate the complete RCA process.

We believe there are four key components needed for such an automated system.

  1. Experiential Learning: we need to learn from prior experience of subject matter experts (SMEs) in analyzing and diagnosis, i.e., where to start and what questions to ask from based on experience from similar scenarios. Using such curated knowledge not only reduces the search space but also chooses the most likely and proven place to begin the investigation. Prior learning will need to be embedded in some form of a knowledge based system.

  2. Multi-Stage Decisions: an effective RCA process follows a logical sequence of queries, i.e., asking the “5 Whys“ in sequence where the answer to one Why leads to the next Why and eliminating unlikely causes. For example, understanding dependencies in a chain of connected microservices can reveal if there is forward or backward pressure that causes a chain of anomalies or alerts. A notable advantage of using such a decision tree is that the sequence of choices made at each step provides the explicit reasoning underlying the cause identification unlike more opaque correlation based techniques. 

  3. Domain Knowledge: at each step of the decision process, what events, telemetry, configuration or other probes to check benefit from domain-specific knowledge. Using domain knowledge also reduces the cardinality problem of too many variables, known and unknown. In a single cloud application there could be 10s to 100s of individual systems with different purpose, design and operational elements. An Ops SME has much of this in her head when she is presented with a symptom of a failure. Therefore, understanding how the systems work and why they do what they do is required. 

  4. Predictive Explanations: a service could be in a different state at any observed time and knowing the current state and the expected next state provides hints as to what could have gone wrong. We can leverage information related to the primary predictive indicators (symptoms) of the problem that are deviations from expected. Behavioral models of the applications are particularly useful here. Example, if a service failed and the behavior model indicates higher CPU usage and also more disk writes. Then we need to check what caused more writes (and the associated increase in CPU usage).
How Does This Work

Consider the small microservices segment below that has containers A1, B1-3, C1-C4 and D1-D2. In this case the following situation occurs: we detect a latency SLO breach at the ingress node A, and at the same time find anomaly alerts in containers B3 and C4.


In a typical scenario, one would have to examine each alert for these three services and check the logs to dig into each anomaly, and then pull up the trace paths between each of the possible paths that contains A1, B3 and C4, and

  1. Examine dependencies in the trace to eliminate the paths that are not causing the issue.

  2. Identify if B3 or C4 or both are causing the increased latency at the ingress.

  3. Verify the causal relationship between B3 and C4: if one of B3 or C4, say B3, is determined to be the cause, what was the reason for the anomaly in that container, and 

  4. Check conditions such as logs, events, or even metrics history and the details of the anomaly in B3, to find what changed.

  5. Determine what corrective remediation steps have to be taken.

While finding the offending dependency paths can be done using different methods such as instrumented code tracing or, preferably, real-time flow tracing, isolating the problem source usually requires significant manual effort. Instead, an automated system would kick off the analyses immediately after the ingress latency breach is detected and do the following:

  1. Identify the path that caused the latency increase, for example, it might be the path A1-B1-B2-B3-C4.

  2. Confirm contributors to the breach are B3 and C4, and analyze their predictive anomaly insights.

  3. From B3 anomaly insights find that higher outbound request counts means more demands on C4 resulting in latter's higher latency. Isolate the problem to B3.

  4. Check all relevant probes surrounding B3 at container, K8s, infrastructure and application and note that an image change was made in B3 that most likely caused increased request counts to C4.

  5. Notify Ops that B3 image change may need to be rolled back.
Net Result and Conclusion

In the scenario above with traditional tools and approaches, getting to the cause may take multiple teams and several hours of forensics. With this approach even a junior Ops or an SRE with lesser skills and experience would be more effective in carrying out the RCA. The above is only one example of a class of scenarios that SRE and DevOps teams have to address today. It is not hard to imagine how other problem classes and associated decision workflows are added to the knowledge base to extend it for more automated analysis, significantly reducing toil and improving time to resolution.

References

Gan et al, ‘An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge Systems,’ ASPLOS 2019, April 13–17, 2019. https://www.csl.cornell.edu/~delimitrou/papers/2019.asplos.microservices.pdf

J. Jackson, ‘Debugging Microservices: Lessons from Google, Facebook, Lyft,’ July 3, 2018. https://thenewstack.io/debugging-microservices-lessons-from-google-facebook-lyft/

The Five Whys, https://en.wikipedia.org/wiki/Five_whys

Try OpsCruiseReturn to Newsroom