Performance matters
Just because an application is up and running, doesn’t necessarily mean that everything is fine. There’s always the possibility that key parts of the application flow are not performing at optimal levels. No matter how you view the challenges of scalability for an application, from the customer or end-user perspective, what matters is the performance of the app itself.
Why Is Observability Needed?
When working with simple applications that don’t have many parts to fulfill a need, it’s easy to control its stability to some extent. You can place monitors for CPU, memory, networking, and databases. Many of these monitors will help you to understand where to apply known solutions to known problems.
In other words, these types of applications are mature enough to know their own problems better. But nowadays, the story is different. Almost everyone is working with distributed systems.
There are microservices, containers, cloud, serverless, and a lot of combinations of these technologies. All of these increase the number of failures that systems will have because there are too many parts interacting. And because of the distributed system’s diversity, it’s complicated to understand present problems and predict future ones.
As your system grows in usability and complexity, new and different problems will continue to emerge. Familar problems from the past might not occur again, and you’ll have to deal with the unknown problems regularly.
For instance, what usually happens is that when there’s a problem in a production environment, sysadmins are the ones trying to find out what the problem is. They can make guesses based on the metrics they see. If the CPU usage is high, it might be because the traffic has increased. If the memory is high, it might be because there’s a memory leak in the application. But these are just guesses.
Systems need to emit telemetry that shoots out what the problem is. It’s difficult because it’s impossible to cover all failure scenarios. Even if you instrument your application with logs, at some point you’ll need more context; the system should be observable.
Observability is what will help you become better when troubleshooting in production. You’ll need to zoom in and zoom out over and over again. Take a different path, deep dive into logs, read stack trace errors, and do everything you can to answer new questions to find out what’s causing problems.
Here some situations for which Observability is required:
- No complex system is ever fully healthy.
- Distributed systems are pathologically unpredictable.
- It’s impossible to predict the myriad states of partial failure various parts of the system might end up in.
- Failure needs to be embraced at every phase, from system design to implementation, testing, deployment, and, finally, operation.
- Ease of debugging is a cornerstone for the maintenance and evolution of robust systems.
What Does Observability Mean?
"Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs."
Observability has three main "pillars":
- Logging - A record of events to help understand what changed in the system/application behavior when things went wrong. For example, using Grafana Loki to log certain events.
- Metrics - A value pertaining to your system/application at a point in time. For example, using Grafana to understand resource utilization, or app performance metrics like throughput and response-time.
- Tracing - A representation of a single user’s journey through an application transaction. For example, using Jaeger to understand the call flows between services or how much time it takes a user to finish a transaction.
How Trace Observability works
Tracing:
A trace is a representation of a series of causally related distributed events that encode the end-to-end request flow through a distributed system.
Traces are a representation of logs; the data structure of traces looks almost like that of an event log. A single trace can provide visibility into both the path traversed by a request as well as the structure of a request. The path of a request allows software engineers and SREs to understand the different services involved in the path of a request, and the structure of a request helps one understand the junctures and effects of asynchrony in the execution of a request.
Although discussions about tracing tend to pivot around its utility in a microservices environment, it’s fair to suggest that any sufficiently complex application that interacts with—or rather, contends for—resources such as the network, disk, or a mutex in a nontrivial manner can benefit from the advantages tracing provides.
The basic idea behind tracing is straightforward—identify specific points (function calls or RPC boundaries or segments of concurrency such as threads, continuations, or queues) in an application, proxy, framework, library, runtime, middleware, and anything else in the path of a request that represents the following:
- Forks in execution flow (OS thread or a green thread)
- A hop or a fan out across network or process boundaries
Traces are used to identify the amount of work done at each layer while preserving causality by using happens-before semantics. Figure 4-2 shows the flow of a single request through a distributed system. The trace representation of this request flow is shown in Figure 4-3. A trace is a directed acyclic graph (DAG) of spans, where the edges between spans are called references.
Here are the metrics that are most important to keep an eye on when measuring app performance:
- Throughput : Number of requests handled per second
- Average Response Time: The average amount of time between a request and a response.
- 95th Percentile Response Time: The percentage (95) of requests that had a better time than a certain value.
- Errors: Errors / requests ratio.
- Memory: The amount of memory/CPU consumed by the host machine.
- Concurrent users: Number of active users/sessions in the application.
On one of the education-based applications, we found the below problem :
Problem 1 : Microservices Error Rate Issues.
1. Find the error
2. Find the trace where error exists :
Problem 2 : Resolve Microservices Latency Issues.
By looking at the slowest transaction and distributed tracing we can know where the problem is.
Three steps will help to find latency issue:
- Starting with an overview graph, such as Browser Page Load or Transaction Response Time, determine which element is taking the most time;
- Select the corresponding link or menu item to see more detail.
- Repeat as desired to drill down.
The transactions overview page also allows you to easily drill down to get detailed information so you can quickly resolve errors and performance issues. For example:
- Sort transactions based on factors such as “most time-consuming,” “slowest average response time,” “Apdex most dissatisfied,” and “highest throughput.”
- Select transactions to view app performance, historical performance, and transaction traces.
- Identify segments with high call counts or time.
When Is Tuning most effective?
For dramatically better results, tune during the design phase rather than waiting to tune after implementing your system.
- Proactive Tuning While Designing and Developing a System
- Reactive Tuning to Improve a Production System
Tools:
Google Stackdriver Trace, Spring Cloud Sleuth, SolarWinds TraceView or Zipkin that can trace transactions end-to-end to show delays for individual components.
For many years, the most important metric for application servers was throughput: how many transactions or requests can we squeeze out of this application? The adoption of finer-grained architecture styles, such as microservices and functions, has led to a broadening of the performance characteristics we should care about, such as memory footprint and throughput.
Where you once deployed a small number of runtime instances for your server and application, now you may be deploying tens or hundreds of microservices or functions. Each of those instances comes with its own server, even if it’s embedded in a runnable jar or pulled in as a container image there’s a server in there somewhere. While cold startup time is critically important for cloud functions (and Open Liberty’s time-to-first-response is approximately 1 second, so it’s no slouch), memory footprint and throughput are far more important for microservices architectures in which each running instance is likely to serve thousands of requests before it’s replaced.
Given the expected usage profile of tens to hundreds of services with thousands of requests, reduced memory footprint and higher throughput directly influence cost, both in terms of infrastructure and licenses costs (if you pay for service & support):
OpenTelemetry:
OpenTelemetry is born from two powerful observability standards whose only drawback was the fact that there were two of them. The CNCF’s OpenTracing and Google’s OpenConsensus have been leading the way in terms of tracing and gathering API metrics. While both decided to go different routes in terms of architecture, they have pretty similar functionalities.
Jaeger
Primarily built to run in Kubernetes, Jaeger was originally developed by Uber and was later donated to the CNCF. Jaeger implements the OpenTracing standard to help organizations monitor microservice-based distributed systems with features like root cause analysis and distributed transaction monitoring.
Zipkin
Unlike Jaeger, that’s made up of five primary components, Zipkin is one single process, making deployment a lot simpler. Originally developed by Twitter, Zipkin is implemented in Java, has an OpenTracing compatible API, and supports almost every programming language and development platform available today.
DataDog
Datadog is a monitoring service for IT, Dev and Ops teams who write and run applications at scale, and want to turn the massive amounts of data produced by their apps, tools and services into actionable insight.
New Relic
New Relic is a SaaS-based web and mobile application performance management provider for the cloud and the datacenter. They provide code-level diagnostics for dedicated infrastructures, the cloud, or hybrid environments and real time monitoring.
- Memory consumption: This has a direct impact on the amount of infrastructure (e.g. cloud) you need to provision.
- Throughput: The more work an instance can handle, the less infrastructure you require.
Benefits of Observability
Observable systems are easier to understand, easier to control, and easier to fix than those that are not.
As systems change — either on purpose because you’re deploying new software, new configuration, or scaling up or scaling down, or because of some other unknown action — observability enables developers to understand when the system starts changing its state.
Modern observability tools can automatically identify a number of issues and their causes, such as failures caused by routine changes, regressions that only affect specific customers, and downstream errors in services and third-party SaaS.