In today's fast-paced tech environment staying ahead of issues is critical to maintaining system reliability and performance. This involves not just monitoring but also setting up robust alerting and notification systems. Prometheus with its powerful querying capabilities and Grafana with its intuitive visualization tools make a formidable pair for this purpose. In this blog we will explore how to set up alerting rules in Prometheus, integrate these alerts with Grafana for visual monitoring, configure notifications via various channels like email and Slack and provide real-world examples of effective alerting strategies.
Setting Up Alerting Rules in Prometheus
Prometheus is a powerful open-source monitoring and alerting toolkit designed for reliability and scalability. Setting up alerting rules in Prometheus involves defining conditions under which alerts should be triggered.
Step 1 :- Define Alerting Rules
Alerting rules are defined in Prometheus configuration files. These rules specify the conditions that should trigger alerts.
Example :- Basic CPU Usage Alert
Create or edit your Prometheus alerting rules file (e.g. alert.rules) :-
groups:
- name: dev-alerts
rules:
- alert: HighCPUUsage
expr: avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) < 0.2
for: 2m
labels:
severity: warning
annotations:
summary: "High CPU usage detected on instance {{ $labels.instance }}"
description: "CPU usage is above 80% for more than 2 minutes."
alert :- The name of the alert.
expr :- The PromQL expression that evaluates the condition.
for :- Duration for which the condition should be true before firing the alert.
labels :- Additional information like severity.
annotations :- Detailed information about the alert.
Step 2 :- Load Alerting Rules
Make sure Prometheus loads the alerting rules by specifying them in the Prometheus configuration file (prometheus.yml) :-
rule_files:
- "alert.rules"
Restart Prometheus to apply the changes.
systemctl restart prometheus
Step 3 :- Test Alerting Rules
Prometheus provides a web UI to test and view alerting rules. Access it at http://<prometheus-server>:9090/rules.
Integrating Alerting with Grafana for Visual Alerts
While Prometheus handles the backend alerting logic, Grafana provides a powerful and user-friendly interface to visualize and manage these alerts.
Step 1 :- Add Prometheus as a Data Source
Navigate to Grafana :- Go to your Grafana instance.
Add Data Source :- Go to Configuration > Data Sources > Add data source.
Select Prometheus :- Choose Prometheus from the list of available data sources.
Configure Prometheus :- Enter your Prometheus server URL (e.g. http://<prometheus-server>:9090) and save the configuration.
Step 2 :- Create a Dashboard Panel
Create Dashboard :- Create a new dashboard or edit an existing one.
Add Panel :- Add a new panel and configure it to use Prometheus as the data source.
Query Data :- Use PromQL queries to fetch the data you want to monitor. For instance to monitor CPU usage :-
avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)
Step 3 :- Configure Grafana Alerts
Alert Tab :- In the panel settings go to the "Alert" tab.
Create Alert :- Click "Create Alert" and define the alert conditions.
Set Conditions :- Specify the alert conditions using PromQL queries. For example :-
avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) < 0.2
Notification Channels :- Configure where the alerts should be sent (email, Slack, etc.).
Save Panel :- Save the panel to apply the alert configuration.
Example :- CPU Usage Alert in Grafana
Here’s an example of how to configure a CPU usage alert in Grafana :-
Create a Panel :- Add a new graph panel.
Prometheus Query :- Use the query :-
avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)
Alert Configuration :-
Create Alert :- In the Alert tab create a new alert.
Conditions :- Set the condition to trigger an alert when the CPU idle time is below 20% for 2 minutes.
Notifications :- Add notification channels like email or Slack.
Configuring Notifications via Email, Slack and Other Channels
Once alerts are set up configuring notifications ensures that the right people are notified promptly.
Step 1 :- Set Up Alertmanager
Prometheus uses Alertmanager to handle notifications. Install and configure Alertmanager to manage alerts and notifications.
Install Alertmanager
Download and extract Alertmanager :-
wget https://github.com/prometheus/alertmanager/releases/download/v0.21.0/alertmanager-0.21.0.linux-amd64.tar.gz
tar xvfz alertmanager-0.21.0.linux-amd64.tar.gz
cd alertmanager-0.21.0.linux-amd64
Configure Alertmanager
Create a configuration file (alertmanager.yml) :-
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'email-receiver'
receivers:
- name: 'email-receiver'
email_configs:
- to: 'alerts@company.com'
from: 'alertmanager@company.com'
smarthost: 'smtp.company.com:587'
auth_username: 'username'
auth_password: 'securepassword'
Start Alertmanager :-
./alertmanager --config.file=alertmanager.yml
Step 2 :- Configure Prometheus to Use Alertmanager
Update Prometheus configuration (prometheus.yml) to use Alertmanager :-
alerting:
alertmanagers:
- static_configs:
- targets:
- 'localhost:9093'
Step 3 :- Configure Notification Channels
Email Notifications
Ensure your Alertmanager configuration includes email settings as shown above.
Slack Notifications
To configure Slack notifications modify the alertmanager.yml :-
receivers:
- name: 'slack-receiver'
slack_configs:
- api_url: 'https://hooks.slack.com/services/your/slack/hook'
channel: '#alerts'
send_resolved: true
Restart Alertmanager to apply the changes.
Step 4 :- Add Notification Channels in Grafana
In Grafana set up notification channels to integrate with Alertmanager, email, Slack and other services.
Adding Notification Channels
Navigate to Notification Channels :- Go to Alerting > Notification Channels.
Add Channel :- Click "New Channel" and select the type (Email, Slack, etc.).
Configure :- Enter the necessary details like email addresses, Slack webhook URLs, etc.
Test :- Test the notification channel to ensure it is working correctly.
Real-World Examples of Alerting Strategies
Effective alerting strategies help in minimizing downtime and ensuring quick resolution of issues. Here are some real-world examples :-
Example 1 :- E-commerce Website Monitoring
For an e-commerce website monitoring the performance and availability of critical services is crucial.
Alert :- High Error Rate on Login Service
Prometheus Rule :-
- alert: HighErrorRate
expr: rate(http_requests_total{job="login_service", status="5xx"}[1m]) > 0.05
for: 1m
labels:
severity: critical
annotations:
summary: "High error rate on login service"
description: "Error rate is above 5% for more than 1 minute."
Alertmanager Configuration :-
receivers:
- name: 'slack-receiver'
slack_configs:
- api_url: 'https://hooks.slack.com/services/your/slack/hook'
channel: '#alerts'
send_resolved: true
Example 2 :- Database Performance Monitoring
For a database system monitoring query performance and resource utilization is essential.
Alert :- High Query Latency
Prometheus Rule :-
- alert: HighQueryLatency
expr: histogram_quantile(0.95, rate(query_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High query latency"
description: "95th percentile query latency is above 1 second for more than 5 minutes."
Alertmanager Configuration :-
receivers:
- name: 'email-receiver'
email_configs:
- to: 'dba-team@company.com'
from: 'alertmanager@company.com'
smarthost: 'smtp.company.com:587'
auth_username: 'username'
auth_password: 'securepassword'
Example 3 :- Infrastructure Monitoring
Monitoring the health of infrastructure components like servers and network devices is fundamental.
Alert :- Disk Space Usage
Prometheus Rule :-
- alert: HighDiskUsage
expr: (node_filesystem_size_bytes{job="node", mountpoint="/"} - node_filesystem_free_bytes{job="node", mountpoint="/"}) / node_filesystem_size_bytes{job="node", mountpoint="/"} > 0.8
for: 10m
labels:
severity: critical
annotations:
summary: "High disk space usage"
description: "Disk usage is above 80% for more than 10 minutes on instance {{ $labels.instance }}."
Alertmanager Configuration :-
receivers:
- name: 'email-receiver'
email_configs:
- to: 'sysadmin@company.com'
from: 'alertmanager@company.com'
smarthost: 'smtp.company.com:587'
auth_username: 'username'
auth_password: 'securepassword'
Conclusion
Setting up alerts with Prometheus and Grafana enables you to stay ahead of issues by proactively monitoring your systems and notifying the right people when problems arise. By defining clear alerting rules integrating with Grafana for visualization and configuring notifications via various channels, you can ensure that your team is always informed and ready to act. The examples provided demonstrate how to apply these concepts to real-world scenarios helping you design effective alerting strategies tailored to your specific needs.
By leveraging the capabilities of Prometheus and Grafana you can enhance the reliability and performance of your systems, minimizing downtime and improving user satisfaction. Start implementing these strategies today to take your monitoring and alerting to the next level.