A Modern Scalable Observability Architecture for Your Modern Apps
OpsCruise was born in the cloud. Its future safe architecture leverages Kubernetes as a first class object and taps into a range of monitoring tools, configuration data and network flows for its visualization and analysis.
OpsCruise ingests information from the application, Kubernetes orchestration, and infrastructure layers. It does this by tapping into the monitoring and configuration environment with lightweight gateway Pods that are added to the Kubernetes cluster.
While OpsCruise collects a diverse set of telemetry data for analysis from open source tools and cloud infrastructure, they continue to be the long-term archival store.
Key Aspects of the Architecture
Easy installation using Kubernetes and related popular tools like Helm
No impact to your application workloads and worker nodes
No changes to your application code, no added sidecars
All connections to the Kubernetes API and the Cloud accounts originate only from your environment
All credentials remain within Kubernetes and are not sent to the OpsCruise SaaS servers
No personally identifiable, i.e., PII and PHI, data leave the premises since the ingested operational data is primarily the output of system-level instrumentation
You maintain control of your instrumented data (metrics, logs, traces, etc.), lowering expensive storage, access costs and allowing data to be used for other purposes and by other tools
The gateway Pods communicate with Kubernetes, cloud, metrics and logs environments and send information in compressed and secure messages to the OpsCruise SaaS backend. The user accesses the UI using a standard browser. All connections are outbound and use SSL/TLS.
OpsCruise emits highly enriched alerts that are sent to popular services such as email, PagerDuty, Slack and ServiceNow.
Look Ma, No Agents
Agentless Telemetry and Configuration Ingestion
A wide variety of telemetry is available on the many elements that comprise a modern application. All of this is now available using open standards and tools: metrics from Prometheus, logs from Loki or ELK, traces from Jaeger, configurations and state from Kubernetes, and flow from eBPF and Istio.
In addition, specific application component metrics can be obtained from additional sources. For example, Kafka metrics from the Kafka exporter, RDS metrics from AWS CloudWatch and Redis metrics from Azure Monitor.
OpsCruise gathers all this information from these sources seamlessly without any proprietary agents colocated with the applications. This reduces the security risk profile, creates no performance impact, and makes it easy to deploy.
Weave no more
Real-time and Dynamic Topology
The abundance of the telemetry is not enough. The topology of the environment, the interrelationships and dependencies are required. Teams spend endless hours chasing down renet deployment details, stitching together information across layers, and using tribal knowledge of the application. But, entities are ephemeral. Nodes come and go, and containers show up and are gone in a flash. All the effort has to be redone. OpsCruise addressed this need by a weaving engine that coherently integrates all the information to create a dynamic, real-time graph of interrelationships from the application, the orchestration, and the Infrastructure.
For example, a container communicates with an IP address outside the cluster on the internet. OpsCruise recognizes that the destination IP address is the host IP address of an AWS RDS instance backed by a MySQL database. The Weaving Engine connects them together. Such connections continue to be mapped and joined as synapses continually and even when the IP addresses change.
The character of a Container
Current monitoring mechanisms treat containers and other elements as black box appendages with a few signals that indicate health. Typically, this means checking the use of resources such as CPU or Disk, or the number of Pod restarts. This leads Ops to detect problems by relying on thresholds . . . which has always proven to be inadequate in both pre-container and post-container worlds. OpsCruise recognizes each container has a unique behavior. A holistic learning approach is used to profile the behavior of each container.
This model is built from predefined templates of curated metrics and information about the element. By aggregating data from multiple instances of the application component, i.e., replicas, a single representative model is built This behavior model for each element is adapted continually to improve its fidelity. Lightweight supervision is also provided to capture feedback from Ops to improve the models.
Am I Okay?
Anomaly Detection Conditions
When an instance of an element is created, the telemetry starts arriving. The OpsCruise system weaves it into the application graph. In the process, the associated model (if it exists) is engaged in the Cruise Control, the runtime ML engine. As telemetry data is ingested, the Cruise Control uses the model to predict the behaviour of the element using the incoming telemetry data.
Model predictions are compared with the observed telemetry. Any deviations that signal an anomalous condition are recorded as an Alert. This predictive modeling approach determines when the container is out of the correct operational range -- which it learned from demand requests, resource consumption, incoming flows as well as generated requests or flows -- without relying on manual thresholds or misleading statistical outliers.
This predictive analysis is done in parallel for all elements (e.g., containers, nodes, etc) in real-time at the sampling rate of the telemetry. Predictive means decreased MTTD or even avoidance of a brownout or outage since we are not waiting for an SLO threshold to be breached, for which time it is too late to act! Knowing that things are not OK before going off the rails gives that edge to Ops to take preventive action.
In addition to the anomalous scenarios, the model can identify differences between the behaviors of changed containers. OpsCruise can compare blue-green versions using these models and determine whether the new deployment will work, significantly improving agility.
Start with Why
RCA and Fault Domain Isolation
Once an Alert is noted, the locus of the problem and its source needs to be identified: Is it an infrastructure problem? An orchestration-related issue? A network problem? Has the demand increased to unsupportable levels? Resources saturated? Application component out of operational range? The real-time weaving of the environment and the behavioral analysis work together to make it easy for the SRE/Ops person to triage the situation. All the information on everything related to the problem brought together in one place, and contextually linked.
The ML model includes interpretation analysis that provides explanations that indicate the likely areas cutting down random walks by Ops. Real-time interaction flows allow Ops to traverse up and down-stream towards possible problem areas. Alert flood management removes redundancies and irrelevant dependencies thereby reducing distractions.
The RCA engine performs analysis and marks elements that are working correctly, greatly reducing work for the Ops team. The Why becomes easier to find!
At a later point in time, the entire topology and state of the environment can be time-traveled to a point when the problem was identified for post mortems.
OpsCruise Supports & Integrates with a Broad Range of Cloud Infrastructure, Technologies and DevOps Tools.