While often used interchangeably, monitoring and observability have distinct meanings in the world of DevOps.
Monitoring is about knowing if your system is working. It involves collecting predefined metrics and health checks to answer known questions. For example, “Is the server’s CPU usage high?” or “Is the application available?” Monitoring is great for alerting you to problems you anticipate.
Observability is about understanding why a system is not working. It involves collecting a wide variety of data (metrics, logs, and traces) to allow engineers to ask novel, unknown questions about a system’s behavior. Observability helps you debug complex, novel issues by providing the data needed to trace the root cause. It gives your insight into the internal state of your system based on its external outputs.
Observability relies on three key data types, often referred to as the “pillars.”
Metrics: These are numerical measurements collected at regular intervals. Metrics are best for time-series analysis and monitoring system health. Examples include CPU utilization, request count per second, and error rates. Metrics are efficient for storage and querying, making them ideal for dashboards and alerts.
Logs: Logs are timestamped, immutable records of events that occurred in a system. They are the “story” of what happened. For example, a log might say, “User ‘john.doe’ logged in at 13:00 UTC” or “Error: Database connection failed.” Logs are crucial for debugging and post-mortem analysis.
Traces: A trace is a record of a single request as it flows through a distributed system. In a microservices architecture, a single user request might touch dozens of different services. A trace captures the entire journey, showing the time spent in each service. This helps pinpoint performance bottlenecks and failures in complex systems.
To collect and visualize these data types, DevOps teams use a variety of tools.
Prometheus: This is an open-source monitoring system and time-series database. It “pulls” metrics from applications and services at regular intervals and stores them. Prometheus is excellent for collecting and storing metrics and has a powerful query language (PromQL) for analysis.
Grafana: An open-source data visualization tool. While Prometheus collects the data, Grafana is what makes it readable. It can connect to many different data sources (including Prometheus) and allows you to build custom, beautiful dashboards that display your metrics in real-time.
The ELK Stack: A common combination for logging. It consists of:
Elasticsearch: A powerful search and analytics engine for storing and querying logs.
Logstash: A data processing pipeline that ingests logs from various sources, transforms them, and sends them to a destination like Elasticsearch.
Kibana: A visualization and dashboarding tool that sits on top of Elasticsearch, allowing you to explore and analyze your log data.