By Shridhar Venkatraman, Chief Architect and Co-founder, OpsCruise
Applications in the brave new world of microservices are chatty. They are networks unto themselves. Studies show that the communication to computation ratio increases upto 40%. An earlier blog on this.
This has led to the search for better observability and problem isolation methods and tools. Distributed tracing which is useful for profiling and monitoring applications (apps) is one such method.
However, there are a few challenges that come with the benefits of distributed tracing:
Because of the above, operations (Ops) teams typically limit sampling. Not surprisingly, only a small fraction of traces are sampled to reduce the processing and storage overhead. For example, the default Jaeger client samples only 0.1% of traces.
Pre-canned instrumentation, dynamic sampling and selective tracing are therefore employed to address these infrastructure limitations.
Operations requires a view of the entire estate, including the app, the orchestrator and the infrastructure elements interconnected and interacting; akin to a Google map of the area.
This view has to include availability and resource usage in addition to the performance metrics such as latency and throughput.
Using the highway traffic as an example, the individual car is not tracked, but the number and speed of cars entering and exiting the highway is available.
Tracking individual cars does not convey the larger picture. To get the gestalt view of the traffic we need to look at the aggregate traffic patterns, including the average speed of cars on the route and not the individual vehicle speed.
The flow tracing idea is analogous to capturing and tracking the flow of traffic along the intricate connections of highways with multiple entry and exit points.
Ops teams are like traffic operations; they need system wide view of the ebb and flow of the interactions between the system entities.
A flow trace view can be constructed by the following
All of this information is stitched up together into a graph which provides a real-time view of the entire estate; without code instrumentation, lightweight on resources and easy to understand.
The flow tracing based real-time graph is also used to build behavior models that recognize anomalies, help in problem isolation and root cause analysis.
When red flags appear on the map, we need to zoom in. Sometimes, it is important to obtain the detail that distributed tracing provides in certain parts of the system.
When an area of interest has been identified using flow tracing, distributed tracing can be turned on for a part of the environment in which problems have been localized to. By fencing the paths that need to be drilled down to, the flow graph can identify specific spans to be traced for a period of time until the problem is re-manifest.
Flow tracing addresses a need of the operations team: