Highly enriched alerts. Happy SRE/DevOps Teams.

OpsCruise eliminates the 'war-room' by isolating problems in real-time, getting them to the right infrastructure or application team for fast resolution.

Automated fault isolation

OpsCruise isolate problems before they are problems and gets the right team engaged

It starts with profiling your application's behavior.

To truly determine whether your cloud application is running correctly, you need to understand if it is working as expected. Much like when you drive a car, you know that the car will speed up when you give it more gas, have slower pickup when you take on extra passengers . . . you don't set a priori thresholds on the car’s speed or acceleration to know there's a problem. Similarly, OpsCruise knows whether a container is working as expected under different conditions, by profiling and learning the application behavior automatically using its specialized 'Cruise Control' ML algorithms.

Anomaly detection minimizing false positives.

OpsCruise is focused on assuring your application performance. To enable such proactive intervention, OpsCruise uses the power of its behavioral analytics to continuously monitor and predict anomalous behavior.  Because it does not rely on fixed thresholds, there are very few false alerts. Conversely, behavioral analytics also detects errant behavior that indicates an incipient issue might need attention. Further, OpsCruise does not escalate all alerts, but processes them with curated knowledge to eliminate alert storms common with traditional monitoring tools. 

Causation, beyond correlation.

The classic example of correlation not equaling causation is that of rates of violent crime and murder are strongly correlated to ice cream sales in summer. Unfortunately, many AI/Ops solutions fail to strictly enforce ‘correlation does not imply causation.’ To avoid this, OpsCruise uses two important principles: one, leverage context around the application in terms of its structure and topology as well as intra and interdependencies; and two, use curated knowledge of IT operations to determine where there is a likely cause,  such as detecting noisy neighbors. Leveraging such contextual analysis for causation minimizes false conclusions. Finally, OpsCruise exposes detail on the context to substantiate its conclusions so you have full transparency to the process.

Live Operational Tracing to fault isolate.

Extending the causal analysis process to isolate the fault in the cases where multiple anomalies are detected, requires determining how containers that exhibit anomalies might be related. To expose and monitor these interactions OpsCruise provides “live” Operational Tracing so that you can see how pairs of neighboring containers interact. For example, if one container is creating more load on another or vice-versa, OpsCruise relates it to the performance and resource metrics so that you can determine which container might be the actual cause of the problems. Most importantly, OpsCruise’s Operational Tracing does not require any code instrumentation or any heavyweight tracing infrastructure. 

See long-chain dependencies.

A challenging problem for fault isolating microservice applications is the occurrence of long chain dependencies - when a problem that occurs in one service can create problems in a service that are many hops away. To enable fault isolation in such situations, OpsCruise employs a novel solution: automated tracing of the path or chain across the containers that span these anomalies, and an associated Service Performance dashboard. These provide immediate visibility of all performance metrics and the state of the services across the traced path.  Using the Service Performance dashboard and Operational Tracing enables you to quickly narrow down the source of the problem even in the case of long chain dependencies.