A 2022 survey on the state of Kubernetes shows it is now a mainstream technology for software development and cloud adoption, with nearly half of the current users expecting to grow the number of clusters by more than 50%. At the same time, ensuring performance guarantees when there are continual changes in the application topology, service additions or deletions, or changes in the code (‘image’) behind a service, and a large number of containers creates significant challenges to the Ops or SRE teams. Because of the many dependencies between microservices and between application components and Kubernetes and underlying infrastructure, diagnosing problems are more complex in K8s applications. Even an innocuous small change in Kubernetes deployment can lead to an application slowdown or worse a crash of the business service. To paraphrase an old adage, ‘for want of a nail the battle can be lost.”
Trying to find an exact root cause in a large K8s application is not easy or deterministic given the high cardinality of interacting objects, the dynamism and scale of the environment. Ops or SRE teams have limited visibility into the state of the application in real-time while they look through metrics, logs, traces, or changes in deployments.
What we can do in an effective automated root cause analysis (RCA) system is to quickly narrow down the areas of the application and point the Ops team to focus on a few components or objects that are the likely cause -- and surface the data and insights relevant to the fault domain. We take inspiration from how most expert SREs solve problems in war rooms use their extensive experience and different sources of information, including:
In effect, an automated RCA system is an “SME in a box" which determines the overall ‘state’ of the application across all telemetry and eliminates components that are not likely responsible for the performance problem. We have implemented automated RCA as a dynamic decision AI engine that:
The dynamic decision engine makes assessments across all telemetry, dependency, alert details and its analyses, configuration, etc. to eliminate areas that are not relevant to the problem and isolate the fault domain. An example will illustrate how the automated RCA works.
We show how automated RCA works in isolating the cause of performance slowdown in a sample application which is not instrumented for traces. The slowdown is detected using flow metrics from eBPF (note OpsCruise uses only open source and OTel monitoring sources) and as shown below detects when the ingress service ‘nginx’ exceeds the preset SLO set at 4 seconds. Automated RCA dynamically creates the high latency path chart that contains the anomalous services and containers
It is not surprising that the high latency path shows 4 other anomalous components, services and components, that were discovered by OpsCruise’s anomaly detection mechanism which does not require setting thresholds or selecting metrics to monitor.
Examining these other alerts, we start examining the first one, ‘cartcache’, where the alert as detected by the ML based on its learned behavior model showing that there are increased errors detected in the container and that all traffic (L4 bytes or packets) interactions with the next container (“supply side”) has dropped to 0.
The next alert is on the ‘cartserver’ service which indicates an immediate problem: there are no pods behind the service.
Following the causal path, we check the container and pods behind ‘cartserver’ reveals more specifics on the reason for the anomaly: the container is in pending state due to a ImagePullBackOff alert detected from Kubernetes.
Finally, checking the RCA fishbone tab indicates the real cause for the image backoff error: a bad image name that was deployed creating a startup failure.
The full causal chain shown above is automatically detected by the RCA system when the initial SLO breach was detected as the decision engine searched dependencies, related alerts, events and information provided by the ML models. There was no need for Ops to dig through different alerts, events, metrics or flows, or try to construct the causal spatial dependencies, or find the contextual linking between all of them. Recall there were no traces available here either that are often the only way most Ops rely on to diagnose and solve performance problems.
We believe that troubleshooting performance issues in K8s applications requires an automated RCA system that can quickly identify and focus the Ops team into a few components or objects that are the likely cause -- and surface the data and insights relevant to the fault domain. In effect, an automated RCA acts like expert SREs who build on their domain and diagnostic process knowledge and pull all information, telemetry from metrics to logs and events (and traces) besides configuration to isolate the cause. The benefit is a significant decrease in the manual effort (“toil”) and time to resolution.