The movement to the cloud has fundamentally changed how applications are now being written and deployed. Cloud native applications, built from microservices and containerized will, by some estimates, show up in 90% of all applications by 2022.
Microservices applications are characterized by not only large scale in terms of number of heterogeneous components, containers or pods, nodes or VMs, but also highly dynamic structure, transient workloads, and changes within the application components. Legacy monitoring systems pre-dating Kubernetes have been built on disparate, closed, and proprietary silos of data collectors and its associated analyses. Ops teams had to build up and train on multiple dedicated and specialized tools for logs, metrics, and traces.
Comprehending and managing performance of microservice applications, where Ops teams are receiving an order of magnitude more signals, is much more challenging. They can’t continue to scale their teams at the pace of new application roll-outs and are over-provisioning cloud resources because they lack the visibility. The old ways of monitoring are no longer enough!
The Kubernetes and cloud ecosystem from CNCF has many free monitoring tools, such as Prometheus for real-time metrics, Fluentd and Loki for logs and telemetry tools that can provide the information needed for actionable observability. In their latest April 2020 report, Gartner estimates 'by 2025, 50% of new cloud-native application monitoring will use open-source instrumentation instead of vendor-specific agents for improved interoperability, up from 5% in 2019.'
OpsCruise has shown that using data from such open source tools that provide metrics, events, flows and logs, one can create detailed insights into the application operations. And, then use it further to proactively identify emerging problems, isolate sources of problems and determine corrective actions.
There is an imperative need to have ‘early warning systems’ to reduce the war room effort that Ops faces. Given the increased complexity of cloud application in terms of the volume and rate of change a reactive response to application problems is becoming less acceptable.
Legacy monitoring tools are great for viewing dashboards and setting thresholds for alerts. Unfortunately, while Ops teams can get a shower of alerts, many which are false, they often do not identify or help isolate the cause of the problem, requiring significant human effort in remediation.
What Ops needs is to detect problems and take corrective actions to minimize the war rooms. In short Ops teams need closed-loop control of their applications!
OpsCruise’s approach is to build a model-based control system for cloud applications. Given the complex, heterogeneous, and dynamic distributed systems, a multi-stage sequence of processing is required. Each step or stage poses a different problem statement requiring a different solution as shown below.
The objective of each stage in this seven-step AI/ML pipeline, known as Cruise Control is to gain progressively increasing visibility into and application understanding across its components so that emerging problems can be detected, analyzed, isolate faults and determine the required corrective action.
In production, OpsCruise has proven it can make significant improvements around the observability of cloud applications with the following business outcomes:
50% SRE productivity improvement. OpsCruise can get organizations on the path to supporting more applications at higher release velocity with existing staff. Organizations have been able to improve their SRE / Dev ratios by as much 50%.
40% more alerts handled by L1. Because OpsCruise alerts are highly enriched and prescriptive, it means fewer need to be escalated to expensive L2/L3 resources for resolution.
20% fewer cloud resources. A lack of performance understanding leads to over-sized instances and pod allocations in K8s. OpsCruise has enabled organizations to make adjustments using as much as 20% fewer resources without impacting performance or risk.