The ultimate goal for Ops (and SRE) teams is to stay on top of their applications before a problem ruins their service levels and an outage impacts customers. Meeting this goal requires both detecting problems early and when possible taking proactive steps to find the source of the problem and the cause. At a deeper level, achieving this requires Ops to get as complete an understanding of the application state in real-time and get the insights needed to quickly identify the likely problems sources, isolate the real cause, and run remediation actions.
None of the above is new in any IT system. However, modern cloud applications that are microservices based create even bigger challenges for Ops.
As noted in our earlier blogs (see ‘Are Microservices Turning Applications into Complex Networks’ and ‘Microservices: An Explosion of Metrics and Few Insights?’), microservice applications are complex and often large-scale distributed systems. Identifying emerging problems and isolating the cause becomes an order of magnitude more difficult given the dependencies between multiple interacting services (components such as containers) and each service depending on underlying orchestration and infrastructure services.
The typical process that has been in use and is emerging may include some or all of the following steps during a diagnostics stage for identifying a root cause:
As can be seen, a heterogeneous set of observability objects may have to be considered in the causal analysis process, as shown in the Ishikawa diagram below.
Typical sequence of observability objects processing for problem detection and fault isolation
Unfortunately, many of the analysis steps in the above process create challenges for Ops both in terms of their effectiveness, much less for them to get to the root causes in real-time.
There are a number of limitations in the current approach to actionable Observability for cloud applications. This happens because of
Nature of Alerts: In the case of alerts on metrics, a critical step in problem or anomaly detection, is based on thresholds. Thresholds are usually set manually or are set based on historical range or statistical distribution. Unfortunately, threshold based problem detection can result in a large number of false alerts given they do not necessarily represent the valid operating ranges of an application component.
Type of Processing: Threshold-based alerting, even when they are not false alerts, result in post-facto diagnostics; further using processes as tracing can only be done after traces are collected and analyzed offline so cannot be used for real-time issue detection until after sampling. Both these aspects reduce the possibility of early and real-time detection of problems and reduces the prospect of proactive intervention and remediation.
Sequence of Processing: The sequence of processing of observability objects has a significant impact on the efficacy of problem isolation and root cause analysis. For example, many processes and rule books are based on examining log-based alerts first, next finding the relevant metrics, and then identifying the topological dependencies. By using this order, we lose the opportunity to detect emerging problems in real-time, waiting for preset alerts to be triggered.
Repeatability and Consistency: The current processing pipeline relies on a number of manual steps. For example, exploring spatial and temporal dependencies and reviewing trace data is a manually intensive process. Instead, if the steps were more repeatable and consistent, the process would be less error-prone, scalable and automatable.
Beyond the above limitations, the mechanisms for collecting the different Observability objects is also a challenge since they are often obtained from multiple proprietary isolated systems, e.g., metrics and traces are often collected via proprietary agents and code instrumentation.
A more effective approach to early problem detection and causal analysis is clearly needed - one that can enable proactive and autonomous approaches to problem resolution and reduce performance outages.
In the last few years, there is increased availability of different Observability metrics for cloud applications given the rapid adoption of the CNCF monitoring and observability systems around Kubernetes (K8s). We can now leverage the open source telemetry of Prometheus, Jaeger and Loki. We can now build a more effective Observability around the three pillars, of metrics, logs and traces, albeit with some qualifiers.
Our approach is based on building a detailed understanding of the application, of both its structure and behavior (see “Real-Time Application Maps for Proactive and Actionable Visibility”). This entails using the following features related to the application to extract key insights
Steps in Effective Observability
By cohesively integrating and processing the different Observability objects in the correct order and context, we can not only get the most unified visibility into the state of the application, but also proactively detect and isolate problems within the application.