Prometheus Query Language (PromQL) is a powerful tool for extracting and analyzing time-series data collected by Prometheus. While basic queries are relatively straightforward mastering PromQL involves understanding its advanced syntax and capabilities. This blog will provide an in-depth tutorial on advanced PromQL querying techniques, demonstrate how to write complex queries to extract meaningful insights and offer tips on performance considerations and optimization.
Understanding PromQL Syntax and Capabilities
Basic PromQL Recap
PromQL queries consist of expressions that select and aggregate time-series data. The basic syntax involves :-
Selectors :- Identifying specific metrics and filtering by labels.
Operators :- Performing mathematical operations and transformations.
Functions :- Applying predefined functions to manipulate data.
Advanced Selectors
Selectors are the core of PromQL allowing you to filter and refine the time-series data you need.
Label Matchers
In addition to exact matches (=) PromQL supports advanced label matching :-
Regex Match (=~) :- Selects time-series where the label matches the given regular expression.
Negative Regex Match (!~) :- Excludes time-series where the label matches the given regular expression.
Example :- Select all HTTP requests except those with the GET method :-
http_requests_total{method!~"GET"}
Range Vectors
Range vectors allow you to select time-series over a specified time range. This is useful for functions that operate on a window of data.
Example :- Select the rate of increase of HTTP requests over the last 5 minutes :-
rate(http_requests_total[5m])
Advanced Aggregation
Aggregation operators enable you to summarize data across multiple dimensions.
Aggregation with Grouping
You can aggregate metrics by one or more labels using the by or without clauses.
Example :- Sum the total HTTP requests by method :-
sum by (method) (http_requests_total)
Example :- Calculate the average CPU usage without considering the CPU core dimension :-
avg without (cpu) (rate(node_cpu_seconds_total[5m]))
Topk and Bottomk
The topk and bottomk functions allow you to select the highest or lowest k elements.
Example :- Select the top 3 HTTP methods by request count :-
topk(3, sum by (method) (http_requests_total))
Writing Advanced Queries to Extract Meaningful Insights
Advanced queries combine selectors, functions and aggregations to derive deeper insights from your metrics.
Use Case :- CPU Usage Analysis
Query 1 :- Average CPU Usage per Node
To calculate the average CPU usage per node excluding idle time :-
avg by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))
Query 2 :- CPU Saturation
To find nodes with high CPU saturation (e.g. nodes where CPU usage is above 80%) :-
avg by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m])) > 0.8
Use Case :- Memory Usage Monitoring
Query 3 :- Memory Usage Percentage
To calculate the percentage of used memory :-
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
Query 4 :- Memory Usage Growth Rate
To find the rate of increase in memory usage over the last hour :-
rate(node_memory_MemTotal_bytes[1h])
Use Case :- Network Traffic Analysis
Query 5 :- Network Traffic per Interface
To calculate the total incoming network traffic per interface :-
sum by (instance, device) (rate(node_network_receive_bytes_total[5m]))
Query 6 :- Detecting Network Spikes
To identify network interfaces with a sudden increase in traffic (e.g. more than 100 MB increase in 5 minutes) :-
rate(node_network_receive_bytes_total[5m]) > 100 * 1024 * 1024
Use Cases and Examples of Complex Queries
Use Case :- Service Latency Analysis
Query 7 :- 95th Percentile Latency
To calculate the 95th percentile of HTTP request latencies :-
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
Query 8 :- High Latency Detection
To detect instances where the 95th percentile latency exceeds a threshold (e.g. 1 second) :-
histogram_quantile(0.95, sum by (instance, le) (rate(http_request_duration_seconds_bucket[5m]))) > 1
Use Case :- Disk I/O Analysis
Query 9 :- Disk Read/Write Rates
To calculate the read and write rates for disk operations :-
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])
Query 10 :- Disk Utilization
To find disks with utilization above 90% :-
(node_disk_io_time_seconds_total / 600) * 100 > 90
Use Case :- Application Error Rate Monitoring
Query 11 :- Error Rate Calculation
To calculate the error rate of an application :-
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
Query 12 :- High Error Rate Detection
To detect when the error rate exceeds 5% :-
(sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100) > 5
Performance Considerations and Optimization Tips
Advanced queries can be computationally expensive and impact Prometheus' performance. Here are some tips to optimize your PromQL queries :-
Use Range Vectors Judiciously
Avoid long time ranges :- Large range vectors (e.g. [1h] or more) can be resource-intensive. Use the smallest possible range that meets your needs.
Reduce resolution :- Consider downsampling data for long-term storage to reduce query load.
Optimize Label Matching
Exact matches are faster :- Prefer exact matches (=) over regex matches (=~) for better performance.
Minimize label cardinality :- High cardinality (many unique label combinations) can slow down queries. Avoid using labels with high cardinality as filters.
Aggregate Before Functions
- Aggregation reduces data volume :- Aggregating data before applying functions (e.g. sum(rate(...)) instead of rate(sum(...)) ) can improve performance by reducing the amount of data processed.
Use Recording Rules
Precompute frequently used queries :- Recording rules allow you to precompute and store the results of expensive queries making them available as regular metrics.
Example Recording Rule :- To create a recording rule for average CPU usage :-
groups: - name: cpu_usage rules: - record: node:cpu_usage:avg1m expr: avg by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[1m]))
Shard and Scale Prometheus
Horizontal scaling :- Use Prometheus federation to aggregate data from multiple Prometheus servers distributing the load.
Vertical scaling :- Ensure your Prometheus server has sufficient CPU, memory and disk I/O capacity to handle your query load.
Profiling and Monitoring
Monitor Prometheus performance :- Use Prometheus to monitor itself paying attention to metrics like prometheus_engine_query_duration_seconds and prometheus_tsdb_compaction_duration_seconds.
Profile slow queries :- Identify and optimize slow queries using the built-in query profiler (localhost:9090/debug/pprof).
Conclusion
Mastering PromQL is essential for extracting meaningful insights from your Prometheus metrics. In this blog we've explored advanced PromQL syntax and capabilities, demonstrated how to write complex queries and provided tips for optimizing performance.
By leveraging these advanced techniques you can gain deeper visibility into your systems, troubleshoot issues more effectively and make data-driven decisions to improve performance and reliability. Stay tuned for more insights and practical tips to enhance your monitoring setup.