Automating the root cause analysis (RCA) process definitely sounds like the quest in the now-classic Monty Python movie. What Ops and SRE have wanted forever has been to eliminate the 'war-room' . . . automating fault isolation in real-time, getting to the cause before a major fallout and quickly fixing the problem. The shift to microservices and serverless applications have only increased this need for automation.
Causal Analysis is Harder in Microservices Applications
Consider how Ops teams typically run RCA today. It begins with an alert event or events, typically a service level objective (SLO) breach that affects a customer-facing service. After a problem alert is detected, logs and events are reviewed and then the investigation begins. This involves walking through traces surrounding the areas of the alerts to find dependencies up and down the service chains.
With microservice architectures, besides more obfuscation with multiple tiers of virtualization and more ephemeral application structure, there are more dependencies to explore including shared infrastructure or services, such as database calls, and Kubernetes (K8s) related issues such as autoscaling.
With so many causal sources at each layer of the application stack, the number of possible combinations can create too many discrete failure sources too many to enumerate. Throwing ML to learn all possible cases is not feasible. Neither is the use of simple correlation between different telemetry such as logs or metrics or traces can lead to false causal inferences. The net effect is that Ops and Dev teams have to spend an enormous amount of time sifting through a very high-dimensional problem space.
Our approach at OpsCruise to automate RCA builds on two broad principles:
Because there are too many possible causal domains, to constrain the search space we need a systematic method that builds on prior experience and expertise, domain knowledge, heuristics and enables a way to automate the complete RCA process.
We believe there are four key components needed for such an automated system.
Consider the small microservices segment below that has containers A1, B1-3, C1-C4 and D1-D2. In this case the following situation occurs: we detect a latency SLO breach at the ingress node A, and at the same time find anomaly alerts in containers B3 and C4.
In a typical scenario, one would have to examine each alert for these three services and check the logs to dig into each anomaly, and then pull up the trace paths between each of the possible paths that contains A1, B3 and C4, and
While finding the offending dependency paths can be done using different methods such as instrumented code tracing or, preferably, real-time flow tracing, isolating the problem source usually requires significant manual effort. Instead, an automated system would kick off the analyses immediately after the ingress latency breach is detected and do the following:
In the scenario above with traditional tools and approaches, getting to the cause may take multiple teams and several hours of forensics. With this approach even a junior Ops or an SRE with lesser skills and experience would be more effective in carrying out the RCA. The above is only one example of a class of scenarios that SRE and DevOps teams have to address today. It is not hard to imagine how other problem classes and associated decision workflows are added to the knowledge base to extend it for more automated analysis, significantly reducing toil and improving time to resolution.
Gan et al, ‘An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge Systems,’ ASPLOS 2019, April 13–17, 2019. https://www.csl.cornell.edu/~delimitrou/papers/2019.asplos.microservices.pdf
J. Jackson, ‘Debugging Microservices: Lessons from Google, Facebook, Lyft,’ July 3, 2018. https://thenewstack.io/debugging-microservices-lessons-from-google-facebook-lyft/
The Five Whys, https://en.wikipedia.org/wiki/Five_whys