Cloud-native applications are complex distributed spatiotemporal systems with highly dynamic behavior profiles. OpsCruise's AI engine 'Cruise Control' runs as a multi-stage analytics pipeline using tailored patented knowledge augmented machine learning (ML) techniques to address the problem of detecting and isolating problems to enable corrective actions.
OpsCruise automatically builds ML-based predictive models of all services or components of the application. These learned models are then used in predictive mode in real-time to detect early onset of performance problems without the need to set thresholds whether manually or from statistical outliers. Such preemptive problem detection enables Ops to take proactive actions, reducing MTTD and MTTR.
OpsCruise uses ML techniques tailored to the analytical needs in different stages of processing. To understand the context, curated knowledge relevant to the application, Kubernetes and infrastructure is incorporated in each stage. These techniques augment the known facts, e.g., application structure, or the meaning of container metrics and implications of Kubernetes configuration. Knowledge-augmented ML creates more transparent white or gray box models that allows Ops to execute granular corrective actions not possible with black box models.
OpsCruise provides automated causal reasoning that builds on insights from its behavior models, application topology and dependencies, and an extensible knowledge base of fault isolation diagnostics. The root cause analysis (RCA) process includes both local microservice analysis as well as a global dependency analysis to isolate the problem source reducing significant manual war room efforts.
Cruise Control is executed in a real-time multi-stage pipeline . . . from the ingestion of different metrics and application related data to fault isolation enabling corrective actions.
Hover over the pipeline for details
Data from metrics, flows, logs, events, and configuration are collected from open source monitoring frameworks and used to build the Application Graph (App Graph) that represents the application structure and topology. Predefined mapping rules allow the App Graph to be built automatically without user input.
As traffic flows are detected, all telemetry data are mapped onto the App Graph to provide real-time views of all interactions between services and performance using operational flow tracing. Dynamic changes in the applications are automatically rendered in the visualization.
Monitored data collected over a 24 to 48 hour period is used to model the workload of each application microservice at scale. This ML-based model that has a predefined template for Kubernetes containers and any mcroservice is built without Ops involvement. It is then used predictively at runtime to detect the onset of performance problems across the application without the need for Ops to set or tune thresholds or rely on statistical outliers. Simple binary feedback is captured if provided by Ops to incrementally improve the model and anomaly detection.
Insights extracted from the model and Application Graph structure, and leading indicators that explain the problem within the service. A sophisticated ML interpretation algorithm then extracts the leading causes for the local microservice problem. The information from the local analyses, in the case of multiple simultaneous anomalies, and the application structure is used for global causal analysis.
To isolate and understand the source and nature of the problem, an extensible knowledge base that codifies fault diagnostics processes specific to microservice applications is used. Different problems that arise in the infrastructure or Kubernetes or application are analyzed using current state information, the results of the predictive models and anomaly analysis, as well as application configuration. The automated process eliminates Ops collecting different related data, and the manual steps used today to pinpoint the cause.
The causal analysis contains granular detail that provides sufficient information for Ops to execute remedial corrective action. The corrective action for problem remediation is done by Ops using an “uber” operator or via scripted step per the organization’s internal process. Changes in application estate are then detected in Stage 1 and pipeline execution is restarted. This closed-loop control enables autonomous operations.
This stage is the actual execution of the recommended action for remediation. OpsCruise acts as an uber Kubernetes Operator in some cases and action script executor in others.