Creating a Path for Prometheus Success

A look at the challenges that can easily disrupt smooth operations with Prometheus, and how to overcome them.

Feb 28th, 2024 6:21am by Arthur Sens

Featued image for: Creating a Path for Prometheus Success

Image from djgis on Shutterstock.

Prometheus is an easy-to-use, open source-monitoring and -alerting toolkit. Its popularity is no doubt due to its efficient time-series data collection database, flexible querying language (PromQL) and general scalability. Furthermore, its support for dynamic service discovery, native integration with Kubernetes and alerting capabilities makes it a great choice for monitoring in dynamic, cloud native environments. Prometheus also has an active, open source community that contributes to continuous improvements and growing adoption.

Yet despite all the benefits that Prometheus offers, many challenges can easily disrupt smooth operations. Let’s take a look at some of them.

A Tale of Cardinal Inexperience

It’s quite common for people who are inexperienced with Prometheus to encounter high-cardinality problems. These issues can lead to Prometheus instances growing much faster than expected, thereby creating scalability and performance problems.

In Prometheus, cardinality refers to the number of unique metric series. A high-cardinality situation occurs when there are a large number of distinct metric labels or label values being generated.

This often arises from misuse or misunderstanding of labels. For example, adding highly dynamic labels (like timestamps, unique identifiers or user IDs) to metrics can rapidly increase the number of time series stored.

This can result in a series of unfortunate events:

Increased Storage Requirements

High cardinality leads to a dramatic increase in the number of time series that Prometheus needs to store, which can quickly consume storage resources. Of course, this can get expensive.

Performance Degradation

Query performance can suffer significantly in high-cardinality scenarios. Prometheus has to process a larger number of time series, which can slow down query responses and increase CPU and memory usage.

Management Overhead

Managing and maintaining a Prometheus instance with high cardinality becomes more challenging. It requires more careful tuning and possibly more sophisticated infrastructure solutions.

Making Sure Your Storage Management Doesn’t Go A-WAL

Write Ahead Log or WAL in Prometheus is a mechanism used to ensure data integrity and prevent data loss in case of a crash or unexpected shutdown. Whenever Prometheus records new data, it first writes that data to the WAL, housed on the filesystem of the server where Prometheus is running, before it is written to the database.

This approach means that if Prometheus restarts for any reason, it can use the WAL to recover any data that was not yet written to the database. The WAL acts as a record of what should be in the database, ensuring that no data is lost if the system crashes.

However, one of the main challenges with the WAL is the time it takes to replay it after a crash or restart. When Prometheus restarts, it needs to process the WAL to reconstruct its in-memory state. This process can be time-consuming, especially if there’s a lot of data in the WAL.

In practical terms, this means that if the WAL replay process takes a long time, Prometheus can experience significant downtime with monitoring and alerting being temporarily unavailable — not exactly ideal for systems that rely on real-time monitoring.

Scaling without Complexity? LOL!

Handling scalability in Prometheus, especially in large-scale and dynamic environments, often requires adopting additional strategies and tools. While Prometheus is a monolithic application, it does have many individual features such as scraping and storing metrics, returning metrics through queries, alerting and recording evaluations and more.

If in a particular setup you are heavily dependent on a single Prometheus feature, you may be forced to scale up the entire Prometheus even though you really only need to scale one part of it. This is where distributed setups and tools like Thanos and Cortex come into play.

Both of them help extend Prometheus by adding a global query view, supporting Prometheus query API natively, providing efficient storage and multicluster support. They also allow for long-term storage of Prometheus metrics in object storage (like AWS S3 or Google Cloud Storage), making it more cost-effective and scalable. However, while Thanos and Cortex components can be scaled separately, thereby solving the monolithic scaling issue of Prometheus, all of their additional components require some level of expertise and effort to maintain them.

In short, while immensely helpful, both Thanos and Cortex introduce additional components into the monitoring architecture, which increases complexity in terms of deployment, management and troubleshooting.

Creating a Framework for Success

If you want to use Prometheus without encountering these storage and scalability woes, join our presentation on using Prometheus-Operator at the CNCF-hosted co-located Events Europe, as well as our hands-on workshop a few days later at KubeCon.

You’ll learn how to reap all the rewards of Prometheus without risking a thing. You’ll also get to meet me and my colleague Nicolas Takashi — we’re platform engineers at Coralogix — along with our esteemed co-presenters, Bartłomiej Płotka and Mahmoud Amin, senior software engineers at Google, and Jesus Vazquez from Grafana.

See you there!

To learn more about Kubernetes and the cloud native ecosystem, join us at KubeCon + CloudNativeCon Europe in Paris, from March 19–22.

Arthur Sens is a platform engineer at Coralogix, with a mixture of site reliability engineering and software engineering backgrounds. He actively contributes to the Prometheus ecosystem, maintaining Prometheus-Operator and Prometheus client_golang while mentoring new open source software contributors.