A Guide to Unit Testing Prometheus Alerts
Alerting systems are an indispensable component of any robust monitoring setup. They function as the first line of defense, promptly notifying you of any system anomalies that require immediate attention and Prometheus alerting is the most widely used alerting system in the Kubernetes ecosystem. Making sure that your alerts are working as expected is critical to the health of your monitoring system.
In this article, we will go through the basics of unit testing of Prometheus alerts and understand a few caveats of unit testing as well.
Prerequisites
- promtool (bundled with Prometheus)
Unit Testing of Prometheus Alerts
Let’s create the below alert rule in a file called grafana-alert.yaml
which will fire when the grafana service discovery is missing for 15 minutes:
groups:
- name: Grafana
rules:
- alert: GrafanaDown
annotations:
summary: 'Grafana is missing from service discovery'
description: 'Grafana in {{ $labels.cluster }} is down'
expr: up{job="grafana",cluster="lab"} == 0
for: 15m
labels:
severity: warning
team: devops
Now, we will write a unit test for the above alert in a file called grafana-alert-test.yaml
:
rule_files:
- grafana-alert.yaml
evaluation_interval: 1m
tests:
- interval: 1m
input_series:
- series: 'up{job="grafana",cluster="lab"}'
values: '0x15 1 1'
alert_rule_test:
- alertname: GrafanaDown
eval_time: 14m
exp_alerts: []
- alertname: GrafanaDown
eval_time: 15m
exp_alerts:
- exp_labels:
severity: warning
team: devops
job: grafana
cluster: lab
exp_annotations:
description: 'Grafana in lab is down'
summary: 'Grafana is missing from service discovery'
Understanding the unit test config
interval: 1m
: The interval at which samples are evaluated. Default is 1m.
input_series
: The time series that are used as input for the test. In our case, we are using the up
metric with the job label set to grafana
and cluster label set to lab
.
values
: The input series values. In our case, we are setting the value to 0x15 1 1
. This means that the input series value will be 0
for the first 15 minutes (interval is 1m) and then 1
for the next 2 minutes.
alert_rule_test
: This is the actual test. It has the following fields:alertname
: The name of the alert being tested.
eval_time
: The time at which the alert is evaluated. In our case, we are evaluating the alert at 14m
and 15m
.
exp_alerts
: Since we have set the for
field in the alert to 15m
, At 14m
, the alert should not fire and hence exp_alerts
is set to []
.
exp_labels and exp_annotations
: The expected labels and annotations for the alert. At 15m
, the alert should fire and hence exp_alerts
is set to the expected labels and annotations.
Specifying the Input Series Values
Using Shorthand
In our example, we specified the value as 0x15 1 1
, this is a shorthand for 0+0x15
which further expands to 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
.
The syntax is : initial_value + (increment_value x increment_count)
For instance, to specify values starting at 10 and going up to 50 with an increment of 5, use 10+5x8
.
Stale and Missing Samples
To input both stale and missing values, use _
for missing and stale
for stale values.
Examples:
| Shorthand | Expanded |
|---------------|------------------|
| 1+2x3 | 1 3 5 7 |
| 0+10x100 | 0 10 20 30 ..100 |
| 1x10 | 1 1 1 (10 times) |
| 3+0x3 2+0x3 | 3 3 3 3 2 2 2 2 |
| 1 _x3 | 1 _ _ _ |
| 2 stale 5+2x3 | 2 stale 5 7 9 11 |
Running the Test
Execute the test using promtool
:
promtool test rules grafana-alert-test.yaml Unit Testing: grafana-alert.yaml SUCCESS
Debugging Failed Tests
If a test fails, the output may look like:
FAILED: alertname: GrafanaDown, time: 15m, exp:[ 0: Labels:{alertname="GrafanaDown", cluster="lab", severity="warning", team="devops"} Annotations:{description="Grafana in lab is down", summary="Grafana is missing from service discovery"} ], got:[ 0: Labels:{alertname="GrafanaDown", cluster="lab", job="grafana", severity="warning", team="devops"} Annotations:{description="Grafana in lab is down", summary="Grafana is missing from service discovery"} ]
The failure may occur due to unexpected labels or expected labels missing. In our case, the failure is due to the job
label being present in the alert but not expected i.e not present in the exp_labels
section.
Types of Alerts to Test
- Critical Alerts (P1): Alerts that directly impact business operations or suggest potential service outages should be end-to-end tested. For instance, an alert with severity critical triggered by low disk space on critical servers, high error rate on web services or critical certificate rotations, database connection limits should undergo unit testing to ensure proper detection. A simple rule can be, every alert sent to pagerduty should have unit tests.
- Complex Logic Alerts: Alerts with complex logic or dependencies on multiple metrics are prime candidates. For example, imagine you are running an e-commerce app and having an alert that triggers when a surge in abandoned carts matches with payment failure rate. Testing this alert well is important because you might be joining multiple metrics from different services.
Alerts Not Ideal for Unit Testing
- Low Impact Alerts: Alerts that have minimal impact on business or serve informational purposes but are valuable enough to keep may not justify the overhead of unit testing.
- Dynamic Environment Alerts: Alerts in dynamically changing environments, where infrastructure and metrics fluctuate frequently, pose challenges for unit testing. Although this is not suggested due to metrics cardinality issues, ensuring the reliability of such alerts through unit tests becomes cumbersome and may not yield much benefits.
Notable Points
- While unit testing is valuable, debugging failed tests can be challenging, requiring manual intervention, especially at scale.
- Updates to alerts, even minor label changes, necessitate corresponding updates to unit tests.
- If you have hundreds of alerts, it will be difficult to write unit tests for all of them and maintain them.
- SaaS providers like Grafana Cloud, Coralogix, New Relic, etc. provide alerting as a service. They have their own alerting engine, which is compatible and based on Prometheus but not Prometheus. So, if you are using any of these services and want to have unit tests for your alerts, you need to maintain these alerts in Git and write a script to do the unit testing via CI or via other means.
- Unit testing is not widely adopted in the Prometheus community. So sometimes, implementing unit tests for your alerts when you are struck can be challenging.
- Alert testing can be more realistic. For example, your test might succeed, but your alert might not fire because one of the labels, like cluster, is not present in Prometheus. Hence, running tests on actual metrics can be a more viable option.