Just because an application is up and running, doesn’t necessarily mean that everything is fine. There’s always the possibility that key parts of the application flow are not performing at optimal levels. No matter how you view the challenges of scalability for an application, from the customer or end-user perspective, what matters is the performance of the app itself.
When working with simple applications that don’t have many parts to fulfill a need, it’s easy to control its stability to some extent. You can place monitors for CPU, memory, networking, and databases. Many of these monitors will help you to understand where to apply known solutions to known problems.
In other words, these types of applications are mature enough to know their own problems better. But nowadays, the story is different. Almost everyone is working with distributed systems.
There are microservices, containers, cloud, serverless, and a lot of combinations of these technologies. All of these increase the number of failures that systems will have because there are too many parts interacting. And because of the distributed system’s diversity, it’s complicated to understand present problems and predict future ones.
As your system grows in usability and complexity, new and different problems will continue to emerge. Familar problems from the past might not occur again, and you’ll have to deal with the unknown problems regularly.
For instance, what usually happens is that when there’s a problem in a production environment, sysadmins are the ones trying to find out what the problem is. They can make guesses based on the metrics they see. If the CPU usage is high, it might be because the traffic has increased. If the memory is high, it might be because there’s a memory leak in the application. But these are just guesses.
Systems need to emit telemetry that shoots out what the problem is. It’s difficult because it’s impossible to cover all failure scenarios. Even if you instrument your application with logs, at some point you’ll need more context; the system should be observable.
Observability is what will help you become better when troubleshooting in production. You’ll need to zoom in and zoom out over and over again. Take a different path, deep dive into logs, read stack trace errors, and do everything you can to answer new questions to find out what’s causing problems.
Here some situations for which Observability is required:
"Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs."
Observability has three main "pillars":
Tracing:
A trace is a representation of a series of causally related distributed events that encode the end-to-end request flow through a distributed system.
Traces are a representation of logs; the data structure of traces looks almost like that of an event log. A single trace can provide visibility into both the path traversed by a request as well as the structure of a request. The path of a request allows software engineers and SREs to understand the different services involved in the path of a request, and the structure of a request helps one understand the junctures and effects of asynchrony in the execution of a request.
Although discussions about tracing tend to pivot around its utility in a microservices environment, it’s fair to suggest that any sufficiently complex application that interacts with—or rather, contends for—resources such as the network, disk, or a mutex in a nontrivial manner can benefit from the advantages tracing provides.
The basic idea behind tracing is straightforward—identify specific points (function calls or RPC boundaries or segments of concurrency such as threads, continuations, or queues) in an application, proxy, framework, library, runtime, middleware, and anything else in the path of a request that represents the following:
Traces are used to identify the amount of work done at each layer while preserving causality by using happens-before semantics. Figure 4-2 shows the flow of a single request through a distributed system. The trace representation of this request flow is shown in Figure 4-3. A trace is a directed acyclic graph (DAG) of spans, where the edges between spans are called references.
Here are the metrics that are most important to keep an eye on when measuring app performance:
On one of the education-based applications, we found the below problem :
Problem 1 : Microservices Error Rate Issues.
1. Find the error
2. Find the trace where error exists :
Problem 2 : Resolve Microservices Latency Issues.
By looking at the slowest transaction and distributed tracing we can know where the problem is.
Three steps will help to find latency issue:
The transactions overview page also allows you to easily drill down to get detailed information so you can quickly resolve errors and performance issues. For example:
For dramatically better results, tune during the design phase rather than waiting to tune after implementing your system.
Google Stackdriver Trace, Spring Cloud Sleuth, SolarWinds TraceView or Zipkin that can trace transactions end-to-end to show delays for individual components.
For many years, the most important metric for application servers was throughput: how many transactions or requests can we squeeze out of this application? The adoption of finer-grained architecture styles, such as microservices and functions, has led to a broadening of the performance characteristics we should care about, such as memory footprint and throughput.
Where you once deployed a small number of runtime instances for your server and application, now you may be deploying tens or hundreds of microservices or functions. Each of those instances comes with its own server, even if it’s embedded in a runnable jar or pulled in as a container image there’s a server in there somewhere. While cold startup time is critically important for cloud functions (and Open Liberty’s time-to-first-response is approximately 1 second, so it’s no slouch), memory footprint and throughput are far more important for microservices architectures in which each running instance is likely to serve thousands of requests before it’s replaced.
Given the expected usage profile of tens to hundreds of services with thousands of requests, reduced memory footprint and higher throughput directly influence cost, both in terms of infrastructure and licenses costs (if you pay for service & support):
OpenTelemetry is born from two powerful observability standards whose only drawback was the fact that there were two of them. The CNCF’s OpenTracing and Google’s OpenConsensus have been leading the way in terms of tracing and gathering API metrics. While both decided to go different routes in terms of architecture, they have pretty similar functionalities.
Primarily built to run in Kubernetes, Jaeger was originally developed by Uber and was later donated to the CNCF. Jaeger implements the OpenTracing standard to help organizations monitor microservice-based distributed systems with features like root cause analysis and distributed transaction monitoring.
Unlike Jaeger, that’s made up of five primary components, Zipkin is one single process, making deployment a lot simpler. Originally developed by Twitter, Zipkin is implemented in Java, has an OpenTracing compatible API, and supports almost every programming language and development platform available today.
Datadog is a monitoring service for IT, Dev and Ops teams who write and run applications at scale, and want to turn the massive amounts of data produced by their apps, tools and services into actionable insight.
New Relic is a SaaS-based web and mobile application performance management provider for the cloud and the datacenter. They provide code-level diagnostics for dedicated infrastructures, the cloud, or hybrid environments and real time monitoring.
Observable systems are easier to understand, easier to control, and easier to fix than those that are not.
As systems change — either on purpose because you’re deploying new software, new configuration, or scaling up or scaling down, or because of some other unknown action — observability enables developers to understand when the system starts changing its state.
Modern observability tools can automatically identify a number of issues and their causes, such as failures caused by routine changes, regressions that only affect specific customers, and downstream errors in services and third-party SaaS.