Anomaly detection has become a critical part of observability for modern cloud-native and micro-service applications. Unfortunately, existing legacy approaches are creating a lot of false alerts distracting DevOps/SRE teams and increasing MTTR. We at OpsCruise believed a more effective application-aware approach was needed. So we built one and put it to the test. While a more detailed paper on our study is now available, here is a summary of our anomaly detection approach and results of its efficacy from the field..
Get the full eBook, Rethinking Anomaly Detection Here
Anomaly detection has been in use long before cloud and micro-services. So it’s not surprising that existing detection approaches are not up to snuff to meet the challenges with the scale, complex dependencies and dynamic nature of cloud-native applications .
Today, detecting anomalies falls into two broad categories: either, manually setting thresholds on a metric, or using some statistical or ML-based technique to detect an outlier. Unfortunately, both have big drawbacks.
Take manual setting of the thresholds, for example. One guesses the upper limit for response time for an application or the CPU utilization based on the past history without knowing the maximum expected request rates. In cloud applications, because workloads are not known ahead of time, when any threshold is breached, Ops typically lowers the threshold to reduce the false positives but then risks false negatives.
Using outlier detection on a set of metrics can improve the alert noise and reduce the manual tuning efforts but faces other challenges: how do you know what metrics capture the correct baseline for the application across different load conditions? Also, every time it sees a different metric value, such as a higher latency response caused by new higher request rates it does not mean the application is not working correctly.
We realized that not understanding the application and how it’s supposed to work causes agnostic anomaly detection to throw a very high volume of false alerts. More importantly, when the detection process does not specify how the anomaly relates to the problem source, isolating the cause is much more difficult. Remediation is pushed off to a later war room with skilled, expensive DevOps resources manually resolving the problem.
We believe micro-services architecture requires an application-aware, model-based approach, using Ops and knowledge of the application stack together with ML. This means:
Anomalies are detected when there are deviations from the model that has learned correct or normal behavior from data collected continually on the service and updated periodically, e.g., daily or over a smaller period.
Here are some key features of the model and how it is used:
To validate our approach, we collected empirical data from a number of deployed production environments where we had no control on designing and running the application. This is because there is no benchmark for anomaly detection even though there are open source sample micro-services applications for use as a sandbox to play with monitoring.
While there are more examples and details in the white paper, here we present a summary from a serverless micro-service application: detecting anomalies in a Kinesis-Lambda subsystem. .
The model included 12 metrics including Execution Time and Number of Invocations (from Lambda), while for comparison using thresholding we chose 5 metrics on which we applied the well-known Tukey’s 1.5 IQR rule. You can see from Figure 2 that the decrease in detected anomalies was 89% lower than the dynamic threshold approach over an 8-day period, and the number of alerts dropped significantly after the 2nd day.
For generalizing the efficacy, we also tested on a number of Kubernetes application containers. The number of model metrics are much larger, nearly 30, and we used dynamic thresholds on just five metrics such as Response Time(Latency), CPU utilization, etc. The model-based approach generated 55% fewer alerts over the 6-day period than the threshold based approach, and the number of alerts also decreased rapidly overtime.
We had mentioned how avoiding false negative alerts is not easy. We found that to be especially true when we see previously unseen data ranges. This is a case where the model based approach is more effective. In a specific container case, the behavior model flagged 100 data points as anomalies. On closer inspection it became clear that over a100-sample interval, request counts and response times were not being received and recorded as 0. But there was incoming data as shown by bytes and packet-level metrics. Since no high latency thresholds were crossed with response times being 0, no anomalies were detected by the threshold-based approach. The behavior model detected this inconsistency between demand and response metrics and marked those data correctly as valid anomalies, avoiding 100 false negatives.
Modern cloud native applications require amore application-aware context, using a knowledge-augmented ML approach that learns the application’s behavior profile to detect and predict a better indicator of problems. This approach has been applied for real-time anomaly detection on metrics of micro-services at scale and has been shown to significantly reduce false alerts on different applications deployed in the field and help in causal isolation and time to problem resolution.