FULL STACK HARMONY
By intelligently analyzing provisioning, configuration, metrics, event and flow information in real time, OpsCruise discovers all elements of the application estate, their interdependencies and interactions. For example, the flow from the ELB ingress controller all the way through to the terminal storage service such as DynamoDB or Spanner.
LIVE OPERATIONAL FLOW
OpsCruise gathers metrics on the interaction between components using high-performance mechanisms without instrumenting the application. This enables the Operations teams to view the behavior of the system in its entirety; and allows other technology and application teams to focus on their components. Performance conditions are easily isolated as anomalous conditions by efficient use of unique operational range behavior modeling.
SUPPORT FOR BUSINESS METRICS
The application developer can emit metrics using standard tooling and have them sent to Prometheus. These metrics emitted by the application can be absorbed into the object model and thereby be part of the availability and performance behavior profiles. Such metrics can be attached to any element in the object model automatically; a metric emitted by a Java program can be attached to the metrics from its container. Logical Business metrics that are about the application as a whole and not just on a component can also be auto discovered and managed.
TIME TRAVEL WITH TOPOLOGY
The topology of the legacy systems did not change very much. Therefore to view the status of the application, scrolling the metrics back in time was adequate. In modern applications, the topology changes frequently, so to troubleshoot, a history of topology is also essential. OpsCruise offers Time Travel with Topology, greatly improving agility of the war room; rolling back a Deployment to the version at the Snapshot without problems.
CURATED APP REGISTRY
Today’s applications are built using a combination of well-known open source tools (e.g., Kafka) as well as cloud services. OpsCruise maintains a performance profile registry of such popular tools, which automatically ingest, map, analyze and predict when such a component is detected in the environment. This curated knowledge leverages the wisdom of the open source community.
Monitoring metrics alone does not provide insight into the current state and health of the application. An unique feature of OpsCruise’s platform is getting an in-depth understanding of the application by building predictive dynamic behavior models of all components of the application.
To provide predictive insights into the application operation, OpsCruise builds an operational behavior model of the application that captures the normal operating region of the application component by component. The model is dynamic since it accounts for the current state of the application including all driving inputs and interacting flows into the application. The model incorporates both seasonality changes as well as changes that occur due to an image change within the microservice.
The model building is completely automated and runs at scale with very light and infrequent supervision. Using information from the Application Understanding, OpsCruise identifies the relevant metrics for each application component, collects all related data, and builds the behavior model using a novel ML technique. The modeling process runs at scale for all containers of the application without any involvement of the Ops team. To validate the model, OpsCruise periodically solicits binary feedback from the Ops team.
The application behavior modeling provides numerous benefits to Ops. First, the model exposes what key metrics influence the application behavior. OpsCruise then extracts what metrics matter and should be monitored. There is no need for the Ops team to decide what metrics need to be monitored. Second, using the model predictively, OpsCruise can determine if the application is operating correctly. Ops does not need to guess what thresholds need to be set to detect anomalies. And, third, as the application changes - services added, changed or deleted - the model is continuously updated to reflect the new correct operating behavior without Ops involvement.
By using the learned behavior models to check each application component’s health, OpsCruise can detect emerging problems across the application without need for setting thresholds. To achieve this, OpsCruise infers the correct operating regions of each component observed during runtime. Once the normal and correct region of operations are known, OpsCruise’s ML runtime systems Cruise Control can detect if the component is deviating from its expected behavior for the current demands. This avoids any guesswork that is involved in setting thresholds or making assumptions that historical trends are always the best indicator of correct behavior.
Because anomaly detection is not based on static or historical thresholds, there is much less noise or false alerts that Ops has to contend with.
AVOID FALSE NEGATIVES
With the behavior model running in predictive mode at runtime, OpsCruise detects the emergence of problems by noting deviations from expected behavior given the current state, i.e., the component is outside the expected region of operation for the demands or requests, as predicted by the application behavior model. Since these changes can be detected before a service breaches its SLO, these anomaly alerts can be used by Ops to take proactive corrective actions.
PROACTIVE PRESCRIPTIVE RECOMMENDATION
Once it detects an anomaly, OpsCruise uses a novel ML-based explanation algorithm to analyze each anomaly and determine which metrics, including resources or services, are the likely causes of the anomaly. This analysis provides granular insights into prescriptive recommendations that can be used by Ops through notification to the existing incident management system to remedy the problem before impact.
FAULT ISOLATION IN LONG CHAINS
Often microservice applications can surface multiple alerts including anomalies and SLO breaches that span long chains of application components. OpsCruise is unique in its ability to isolate root cause in such cases. This is possible because OpsCruise understands application topology and intra-dependency and can eliminate false alerts by verifying behavioral correctness of each component across the service chain.
No agents, No Kubernetes sidecars - traditional monitoring systems require proprietary agents to be deployed in every host or sidecars to be included in every container orchestrated by Kubernetes. OpsCruise leverages open source instrumentation/frameworks.
NO APP CHANGES
No code changes or instrumentation are required in your applications. OpsCruise harnesses eBPF tracing and other networking techniques to automatically capture L4/L7 data from the network stack in the operating system and correlate it with namespaces, tags, and environment characteristics from your private/public cloud infrastructure, Kubernetes and Docker.
OpsCruise was architected from Day 1 to be native to K8s, open standards and open source monitoring tools (e.g., Prometheus, FluentD, Jaeger, Grafana, etc.). While OpsCruise can support traditional VM/physical hosts, it is K8s and container-centric in it’s design, visualization and workflow.
LEVERAGE YOUR TELEMETRY BEYOND MONITORING
In line with open standards and best practices, OpsCruise does not need to be the long-term store for your metrics, logs and traces. Modern enterprises are centrally collecting this data once for multiple use cases beyond monitoring, including security analytics, chargeback, capacity planning and user experience management. OpsCruise selectively analyzes this data in real-time for trending, fault isolation and causal analysis.
OpsCruise operates with negligible resource overhead so you can safely deploy in any production environment without impacting your application. OpsCruise captures flow data statistics without being in the data path. The OpsCruise service operates an extremely efficient processing machine-learning pipeline that provides aggregate views as well as behavior profiles of every component in the application estate.
WORKS WITH YOUR EXISTING TOOLS
OpsCruise integrates with your existing monitoring, ticketing and incident management tools.