Mastering PromQL :- Advanced Query Techniques in Prometheus

Prometheus Query Language (PromQL) is a powerful tool for extracting and analyzing time-series data collected by Prometheus. While basic queries are relatively straightforward mastering PromQL involves understanding its advanced syntax and capabilities. This blog will provide an in-depth tutorial on advanced PromQL querying techniques, demonstrate how to write complex queries to extract meaningful insights and offer tips on performance considerations and optimization.

Understanding PromQL Syntax and Capabilities

Basic PromQL Recap

PromQL queries consist of expressions that select and aggregate time-series data. The basic syntax involves :-

Selectors :- Identifying specific metrics and filtering by labels.
Operators :- Performing mathematical operations and transformations.
Functions :- Applying predefined functions to manipulate data.

Advanced Selectors

Selectors are the core of PromQL allowing you to filter and refine the time-series data you need.

Label Matchers

In addition to exact matches (=) PromQL supports advanced label matching :-

Regex Match (=~) :- Selects time-series where the label matches the given regular expression.
Negative Regex Match (!~) :- Excludes time-series where the label matches the given regular expression.

Example :- Select all HTTP requests except those with the GET method :-

http_requests_total{method!~"GET"}

Range Vectors

Range vectors allow you to select time-series over a specified time range. This is useful for functions that operate on a window of data.

Example :- Select the rate of increase of HTTP requests over the last 5 minutes :-

rate(http_requests_total[5m])

Advanced Aggregation

Aggregation operators enable you to summarize data across multiple dimensions.

Aggregation with Grouping

You can aggregate metrics by one or more labels using the by or without clauses.

Example :- Sum the total HTTP requests by method :-

sum by (method) (http_requests_total)

Example :- Calculate the average CPU usage without considering the CPU core dimension :-

avg without (cpu) (rate(node_cpu_seconds_total[5m]))

Topk and Bottomk

The topk and bottomk functions allow you to select the highest or lowest k elements.

Example :- Select the top 3 HTTP methods by request count :-

topk(3, sum by (method) (http_requests_total))

Writing Advanced Queries to Extract Meaningful Insights

Advanced queries combine selectors, functions and aggregations to derive deeper insights from your metrics.

Use Case :- CPU Usage Analysis

Query 1 :- Average CPU Usage per Node

To calculate the average CPU usage per node excluding idle time :-

avg by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))

Query 2 :- CPU Saturation

To find nodes with high CPU saturation (e.g. nodes where CPU usage is above 80%) :-

avg by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m])) > 0.8

Use Case :- Memory Usage Monitoring

Query 3 :- Memory Usage Percentage

To calculate the percentage of used memory :-

(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

Query 4 :- Memory Usage Growth Rate

To find the rate of increase in memory usage over the last hour :-

rate(node_memory_MemTotal_bytes[1h])

Use Case :- Network Traffic Analysis

Query 5 :- Network Traffic per Interface

To calculate the total incoming network traffic per interface :-

sum by (instance, device) (rate(node_network_receive_bytes_total[5m]))

Query 6 :- Detecting Network Spikes

To identify network interfaces with a sudden increase in traffic (e.g. more than 100 MB increase in 5 minutes) :-

rate(node_network_receive_bytes_total[5m]) > 100 * 1024 * 1024

Use Cases and Examples of Complex Queries

Use Case :- Service Latency Analysis

Query 7 :- 95th Percentile Latency

To calculate the 95th percentile of HTTP request latencies :-

histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

Query 8 :- High Latency Detection

To detect instances where the 95th percentile latency exceeds a threshold (e.g. 1 second) :-

histogram_quantile(0.95, sum by (instance, le) (rate(http_request_duration_seconds_bucket[5m]))) > 1

Use Case :- Disk I/O Analysis

Query 9 :- Disk Read/Write Rates

To calculate the read and write rates for disk operations :-

rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])

Query 10 :- Disk Utilization

To find disks with utilization above 90% :-

(node_disk_io_time_seconds_total / 600) * 100 > 90

Use Case :- Application Error Rate Monitoring

Query 11 :- Error Rate Calculation

To calculate the error rate of an application :-

sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

Query 12 :- High Error Rate Detection

To detect when the error rate exceeds 5% :-

(sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100) > 5

Performance Considerations and Optimization Tips

Advanced queries can be computationally expensive and impact Prometheus' performance. Here are some tips to optimize your PromQL queries :-

Use Range Vectors Judiciously

Avoid long time ranges :- Large range vectors (e.g. [1h] or more) can be resource-intensive. Use the smallest possible range that meets your needs.
Reduce resolution :- Consider downsampling data for long-term storage to reduce query load.

Optimize Label Matching

Exact matches are faster :- Prefer exact matches (=) over regex matches (=~) for better performance.
Minimize label cardinality :- High cardinality (many unique label combinations) can slow down queries. Avoid using labels with high cardinality as filters.

Aggregate Before Functions

Aggregation reduces data volume :- Aggregating data before applying functions (e.g. sum(rate(...)) instead of rate(sum(...)) ) can improve performance by reducing the amount of data processed.

Use Recording Rules

Precompute frequently used queries :- Recording rules allow you to precompute and store the results of expensive queries making them available as regular metrics.

Example Recording Rule :- To create a recording rule for average CPU usage :-

  groups:
  - name: cpu_usage
    rules:
    - record: node:cpu_usage:avg1m
      expr: avg by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[1m]))

Shard and Scale Prometheus

Horizontal scaling :- Use Prometheus federation to aggregate data from multiple Prometheus servers distributing the load.
Vertical scaling :- Ensure your Prometheus server has sufficient CPU, memory and disk I/O capacity to handle your query load.

Profiling and Monitoring

Monitor Prometheus performance :- Use Prometheus to monitor itself paying attention to metrics like prometheus_engine_query_duration_seconds and prometheus_tsdb_compaction_duration_seconds.
Profile slow queries :- Identify and optimize slow queries using the built-in query profiler (localhost:9090/debug/pprof).

Conclusion

Mastering PromQL is essential for extracting meaningful insights from your Prometheus metrics. In this blog we've explored advanced PromQL syntax and capabilities, demonstrated how to write complex queries and provided tips for optimizing performance.

By leveraging these advanced techniques you can gain deeper visibility into your systems, troubleshoot issues more effectively and make data-driven decisions to improve performance and reliability. Stay tuned for more insights and practical tips to enhance your monitoring setup.