OpsCruise Cruise Control

Contextual ML for Application Performance


Contextual ML for application performance

Cloud-native applications are complex distributed spatiotemporal systems with highly dynamic behavior profiles. To provide proactive actionable insights for detecting performance problems to recommendations for incident remediation, OpsCruise's AI engine 'Cruise Control' uses different patented contextual machine learning (ML) techniques. These augment the analytical approaches OpsCruise uses for application behavior modeling and control. Cruise Control's multi-stage contextual ML engines execute different types of ML and analytics processes in each of seven stages as shown in the pipeline below.

the cruise control pipeline

Our machine learning. Explained.

Hover over the pipeline for details

Maps all data collected from the open data collectors into a normalized form to auto-build a multilayer Application Graph from configuration, metrics and network data. The Application Graph captures the structural topology and directional dependencies across all three layers, i.e.,  the application layer (microservices), the Kubernetes (orchestration) layer, and the infrastructure (cloud) layer. 

Enriches and updates the Application Graph and Time Series Database with the most recent metrics and event data for each entity across all three layers. This provides the most recent known state of the application and its infrastructure.

Builds the application behavior model of the complete application using a priori curated knowledge and specialized Cruise Control ML engines: 

Application Modeling: novel implicit supervised learning is used to build a predictive model of each component of the application. The algorithm handles highly complex non-convex behavior, is multivariate and tolerant of imprecision, can learn incrementally, and extremely scalable.

Behavior-driven Anomaly Detection: once deployed in runtime, Cruise Control can detect changes that result from internal failures, code changes or other unexpected changes in the nature of requests between services. An adjunct explanation algorithm uses ML to analyze information from the model to identify causal aspects within a component. Only occasional feedback on the correctness of the anomaly detection is required to improve its efficacy.

Launched whenever any anomaly is detected and its analysis is initiated by Stage 3:

Automated Causal Analysis and Fault Isolation: Cruise Control's ML-based dependency engine can determine the root cause of problems when multiple anomalies or incidents occur. This structured causal analysis is based on a combination of a priori knowledge and relationships between components.

Cruise Control's anomaly detection and causal analysis eliminate most false positives, reduce false correlations and uncovers hard to detect problems such as long-chain dependency problems.

Classifies the enriched anomaly into an existing adaptable knowledge based taxonomy. Over time, with exposure to more varieties of anomalies, the problem classification is designed to address a greater breadth of problem types.

Provides recommended actions. It  defines corrective actions that specifies necessary changes to the configurations, resources, or services.  The recommendation is created using a formal grammar that can be understood and acted upon by downstream systems; either by OpsCruise or by external automation systems.

Once the recommendation has been provided, the system monitors the efficacy of the recommendation can be analyzed. A closed-loop feedback action to earlier stages is taken to improve the recommendation.

This stage is the actual execution of the recommended action for remediation. OpsCruise acts as an uber Kubernetes Operator in some cases and action script executor in others.