Originated by engineers at Soundcloud and Google, Prometheus is an open source, community-driven project for monitoring modern cloud-native applications and Kubernetes. Key drivers for the rapid adoption of Prometheus include the following:
Community adoption. Prometheus is a graduated CNCF project, with a rich community of contributors and users - more than 25,000 stars on GitHub.
Growing ecosystem. Mirroring the strong community adoption, many application and service developers have built native Prometheus exporters. Further, popular open source dashboarding tools such as Grafana have built-in support for Prometheus dashboards.
Rich data model and query language. Prometheus incorporates a multi-dimensional data model, with time series data identified by metric names and key/value pairs. Further, a flexible query language leverages this dimensionality.
Enterprises looking to use Prometheus as their monitoring solution face some challenges, including:
Set-up and configuration: Prometheus requires that you select and maintain metrics, dashboards and thresholds and there are many choices in the community.
Lack of an object model. Like most metrics stores, it doesn’t maintain the notion of an object (e.g., container, pod, node, services, etc.) and leaves it to the user’s mental map and queries to stitch things together.
Troubleshooting: Troubleshooting needs more than metrics. Events, flow and configuration data are also needed for comprehensive troubleshooting - all of is beyond the scope Prometheus.
Scalability, HA and long-term data retention: The current storage model that Prometheus uses causes it to be constrained to short-term monitoring with a few weeks of data retention and not being suitable for long-term data retention and querying.
OpsCruise is a cloud and container-native platform, focused on extracting granular data that is enriched and automatically tagged to provide rich context to facilitate monitoring, detection, troubleshooting and forensics at scale. It automatically detects applications running within containers and extracts relevant metrics, maps service topologies with service golden signals, and goes beyond monitoring to incorporate deep troubleshooting workflows.
OpsCruise integrates natively with all orchestrators and container platforms, extracting service labels and constructing an optimized data model to aggregate data for services. OpsCruise service topology shows inter-service communication without any instrumentation or configuration. The topology map can be overlaid with Prometheus metrics that are instrumented by users such as latency and error counts.
“I was having an issue with latency and needed to isolate what was wrong. The application-level dashboard provided me the information that the URL related to one of the services has slowed. I opened the service topology for that service and there it was - an outbound connection to an external cloud service that was the bottleneck. Literally in a 3 clicks”. - SRE, Fintech
Real-time tracing such as response time or latency, request count and error count are key signals used for monitoring the performance of your applications. Traditionally these metrics are gathered by code instrumentation or large distributed tracing projects. Ops Cruise provides these signals in several different ways today such as via custom metrics, application component metrics such as for Mongo DB, Elastic, NGINX, HAProxy or Istio - or by decomposing protocols such as HTTP/S and JDBC on the fly. Users can observe trends on which URLs have high error count or which SQL queries are slow, to isolate where to look next.
Our 3-layer view provides a top down view of your services letting you drill down from one level of the hierarchy to another to view relevant data at each level – application > K8s > infrastructure.
“The first thing I do every morning is to look at the OpsCruise 3-layer view to get a bird’s eye inventory view of my services -- my clusters, services, applications, namespaces and pods. From here, I can look at the memory or CPU usage at any level and drill down to see which container running in which pod is the root cause of such an issue. And I’m able to do it without configuring anything at all. I can then switch to the infrastructure view and find out what nodes my services are running on.” – TechOps Engineer, Consumer Tech
Prometheus has no notion of topology and behavior to help predict performance issues. OpsCruise’s patent pending, Cruise Contol AI/ML Engine ingests Prometheus metrics in combination with event, topology, K8s config and flow data to surface performance conditions and trace probable causes.
“We had a code change, that unknowing to us in ops, changed the cache hit ratio. That resulted in higher transaction response times for incoming demand. OpsCruise detected the K8s container change and was able to quickly flag that anomaly and recommend a fix. In Prometheus alone, it would have been impossible to associate the container change with network flows and I/O rates. We would have figured it out in a war room a few hours later.” – IT Ops Director, Pharmaceutical
Through Prometheus you can get application metrics such as JMX heap usage for a deployment and identify the pod it is running on. Traditionally, to further isolate the specific container on the pod that was using the max heap, you would next need to SSH into the specific node on which that pod is running and look through process names. Since OpsCruise has an object model tightly coupled with Prometheus times-series with a few clicks in the user interface, you can see all the processes running on the pod that lead to high heap usage, getting you to root cause significantly faster. Further, OpsCruise reports all network connections including ingress and egress for any process, container, services or namespace. This is a valuable data source that is critical to troubleshooting and forensics workflows.
An out-of-the-box Prometheus setup scales an instance vertically by adding more resources such as disk and memory to a single instance. To scale beyond the ability of a single Prometheus server and across cluster/clouds, users need to address certain issues.
OpsCruise in conjunction with Thanos/Cortex provides a high-performance metric store which enables a single, horizontally-scaled data store that can grow up to 100M+ metrics per second managed as federated instance. Further, it provides cross-cluster, cross-cloud visibility. You can build dashboards, compare and contrast metrics, and create PromQL expressions to analyze data from the rich data collected.
As organizations look to establish an enterprise-wide monitoring solution, OpsCruise meets security and manageability needs that ease adoption at scale.
Prometheus users have come to rely on community contributions for exporters, dashboards and alerts, but at the same time run into challenges of choice. OpsCruise offers a highly curated repository of dashboards and alerts based on application best practices for the most popular Prometheus integrations.
Single sign on integrates with an organization’s preferred LDAP, SAML or other authentication and authorization methods leading to better adoption.
Teams and Role Based Access Control helps isolate data and restrict permissions per user per team.
Audit and compliance to ensure a fully secure and compliant Prometheus environment, OpsCruise provides access to audit logs that show users who did exactly what on the system. Additionally, OpsCruise is embarking on the journey of becoming compliant with several standards starting with SOC2, GDPR and HIPAA.
Support: OpsCruise provides enterprise class phone and online technical support.