Enterprise Class Prometheus Monitoring

OpsCruise extends the popular open source CNCF Prometheus monitoring framework with large scale object and topology visualization, behavioral analysis and anomaly detection.

Prometheus monitoring

The future of application monitoring starts here.

Originated by engineers at Soundcloud and Google, Prometheus is an open source, community-driven project for monitoring modern cloud-native applications and Kubernetes. Key drivers for the rapid adoption of Prometheus include the following: 

  • Community adoption. Prometheus is a graduated CNCF project, with a rich community of contributors and users - more than 25,000 stars on GitHub.

  • Growing ecosystem. Mirroring the strong community adoption, many application and service developers have built native Prometheus exporters. Further, popular open source dashboarding tools such as Grafana have built-in support for Prometheus dashboards. 

  • Rich data model and query language. Prometheus incorporates a multi-dimensional data model, with time series data identified by metric names and key/value pairs. Further, a flexible query language leverages this dimensionality.

Prometheus Challenges

Limitations with Prometheus in enterprise production deployments

Enterprises looking to use Prometheus as their monitoring solution face some challenges, including:

Opscruise-POWERED prometheus

A new friend of Prometheus users

OpsCruise is a cloud and container-native platform, focused on extracting granular data that is enriched and automatically tagged to provide rich context to facilitate monitoring, detection, troubleshooting and forensics at scale. It automatically detects applications running within containers and extracts relevant metrics, maps service topologies with service golden signals, and goes beyond monitoring to incorporate deep troubleshooting workflows.

Our approach

Service-oriented workflow and topology for problem isolation

Detect bottlenecks visually using the LiveAppMap Service Topology.

OpsCruise integrates natively with all orchestrators and container platforms, extracting service labels and constructing an optimized data model to aggregate data for services. OpsCruise service topology shows inter-service communication without any instrumentation or configuration. The topology map can be overlaid with Prometheus metrics that are instrumented by users such as latency and error counts. 

“I was having an issue with latency and needed to isolate what was wrong. The application-level dashboard provided me the information that the URL related to one of the services has slowed. I opened the service topology for that service and there it was - an outbound connection to an external cloud service that was the bottleneck. Literally in a 3 clicks”.  - SRE, Fintech

Real-time operational tracing.

Real-time tracing such as response time or latency, request count and error count are key signals used for monitoring the performance of your applications. Traditionally these metrics are gathered by code instrumentation or large distributed tracing projects. Ops Cruise provides these signals in several different ways today such as via custom metrics, application component metrics such as for Mongo DB, Elastic, NGINX, HAProxy or Istio - or by decomposing protocols such as HTTP/S and JDBC on the fly. Users can observe trends on which URLs have high error count or which SQL queries are slow, to isolate where to look next.

Revolutionary 3 Layer View

Our 3-layer view provides a top down view of your services letting you drill down from one level of the hierarchy to another to view relevant data at each level – application >  K8s > infrastructure. 

“The first thing I do every morning is to look at the OpsCruise 3-layer view to get a bird’s eye inventory view of my services -- my clusters, services, applications, namespaces and pods. From here, I can look at the memory or CPU usage at any level and drill down to see which container running in which pod is the root cause of such an issue. And I’m able to do it without configuring anything at all. I can then switch to the infrastructure view and find out what nodes my services are running on.” – TechOps Engineer, Consumer Tech

Anomaly Detection & Causal Analysis.

Prometheus has no notion of topology and behavior to help predict performance issues. OpsCruise’s patent pending, Cruise Contol AI/ML Engine ingests Prometheus metrics in combination with event, topology, K8s config and flow data to surface performance conditions and trace probable causes.

“We had a code change, that unknowing to us in ops, changed the cache hit ratio. That resulted in higher transaction response times for incoming demand. OpsCruise detected the K8s container change and was able to quickly flag that anomaly and recommend a fix. In Prometheus alone, it would have been impossible to associate the container change with network flows and I/O rates. We would have figured it out in a war room a few hours later.” – IT Ops Director, Pharmaceutical

Deep trouble shooting on specific containers.

Through Prometheus you can get application metrics such as JMX heap usage for a deployment and identify the pod it is running on. Traditionally, to further isolate the specific container on the pod that was using the max heap, you would next need to SSH into the specific node on which that pod is running and look through process names. Since OpsCruise has an object model tightly coupled with Prometheus times-series with a few clicks in the user interface, you can see all the processes running on the pod that lead to high heap usage, getting you to root cause significantly faster.  Further, OpsCruise reports all network connections including ingress and egress for any process, container, services or namespace. This is a valuable data source that is critical to troubleshooting and forensics workflows.


Scalability, long-term data retention, and multi-cloud visibility

An out-of-the-box Prometheus setup scales an instance vertically by adding more resources such as disk and memory to a single instance. To scale beyond the ability of a single Prometheus server and across cluster/clouds, users need to address certain issues.

OpsCruise in conjunction with Thanos/Cortex provides a high-performance metric store which enables a single, horizontally-scaled data store that can grow up to 100M+ metrics per second managed as federated instance.  Further, it provides cross-cluster, cross-cloud visibility. You can build dashboards, compare and contrast metrics, and create PromQL expressions to analyze data from the rich data collected.

enterprise capabilities

Scalable, manageable, and secure

As organizations look to establish an enterprise-wide monitoring solution, OpsCruise meets security and manageability needs that ease adoption at scale.