Both words, observability and monitoring, are often used interchangeably and without taking care of the difference.
Monitoring is alerting based on known, predefined metrics and log patterns.
Examples of events are:
- Trigger a minor event when CPU is above 70% for 5 minutes
- Trigger a major event when partition occupancy reached 90%
Last example threshold could even be improved to consider the partition size as 10% remaining space is obviously not the same for a 1 GB partition (100 MB) and 1 TB (100 GB). Still, it will be predefined.
An example is “trigger a critical event when
OutOfMemory string is found in WebLogic log”.
Both Combined is Monitoring
Combination of events and logs might not be sufficient for the troubleshooting of a new issue never seen before. It might give clues and investigations directions, but expertise will still be required. Based on these new issues, metrics and patterns will evolve to cover new cases.
All these are monitored via an agent sending events to a central server which displays this in dashboards. Nagios, Icinga or Oracle Enterprise Manager are typical tools.
Additionally, to existing metrics and logs parsing, observability adds “tracing”. Tracing, also known as distributed tracing, is an evolved logging for a software. It gives details on what and where is the bug or performance issue. Software developer needs to instrument the application for tracing to work. Opentelemetry is a must for that need. Depending on the coding language, there are also “automatic” tracing possibilities.
Logs Management Improvements
Logs have existed since computer exist. With the increase in complexity, like micro-services architecture, and system elasticity (scale up and down), going to a server and
tail a log is not the most efficient way of working. The new way is to centralize and aggregate logs in one place which will make troubleshooting easier. Examples of agents, aka. log exporter, are Filebeat or Fluentd. Example for centralizer is Elasticsearch for storing and Kibana for searching and displaying.
For metrics, you usually need an exporter like the one I described in my “monitoring with Prometheus” blog series. It exposes metrics via http protocol and Prometheus server will pull them periodically. It is possible to use push method for short-lived processes like batches.
Prometheus has two main features:
- Storage within a time series database
It also offers dashboard possibilities, but Grafana is often preferred as it has a very wide variety of panels.
Digital Customer Experience
Something that brings observability even a step further is the customer experience monitoring. In other words, gather metrics and information directly from users (for example, web browsers).
Not only does it give awareness on how users view the application, but it can also help prioritizing where performances optimization matters the most.
The Future: AI Ops
With all these metrics, logs, events, one may become overwhelmed. Machine Learning will be able to help find patterns in this big amount of data on burning incident as well as proactively even before it happens.