Page MenuHomePhabricator

Kernel error metrics have overlapping definitions
Closed, ResolvedPublic

Description

The Prometheus metrics introduced in T376719: alerting: detect if a kernel had a panic to detect kernel panics and other kernel errors can overlap: for example a kernel error logged with priority=err and message=taint will increment both kernel_dmesg_err_priority and kernel_dmesg_taint.

Similarly, a message can have priority=err and also contain the word warning, incrementing both kernel_dmesg_err_priority and kernel_dmesg_warning.

This also makes it more difficult to define alerting rules in alertmanager without triggering two alerts for a single error.

We should define metrics that identify only a specific type of error, then we can tune the alerts based on that.

We can also use Prometheus labels to categorize the different types of messages.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change #1088602 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] prometheus-node-kernel-panic: use prom labels

https://gerrit.wikimedia.org/r/1088602

Change #1108091 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] prometheus-node-kernel-panic: rename to "messages"

https://gerrit.wikimedia.org/r/1108091

fnegri triaged this task as Low priority.Jan 3 2025, 6:17 PM
fnegri changed the task status from Open to In Progress.Jan 3 2025, 6:32 PM

Change #1113498 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] prometheus-node-kernel-panic: remove "absent" lines

https://gerrit.wikimedia.org/r/1113498

Change #1113508 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/alerts@master] wmcs: update kernel alerts

https://gerrit.wikimedia.org/r/1113508

Change #1088602 merged by FNegri:

[operations/puppet@production] prometheus-node-kernel-panic: use prom labels

https://gerrit.wikimedia.org/r/1088602

Change #1108091 merged by FNegri:

[operations/puppet@production] prometheus-node-kernel-panic: rename to "messages"

https://gerrit.wikimedia.org/r/1108091

Change #1113814 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] base::cloud_production: fix dep name

https://gerrit.wikimedia.org/r/1113814

Change #1113814 merged by FNegri:

[operations/puppet@production] prometheus::node_kernel_messages: fix timer params

https://gerrit.wikimedia.org/r/1113814

Change #1113508 merged by jenkins-bot:

[operations/alerts@master] wmcs: update kernel alerts

https://gerrit.wikimedia.org/r/1113508

Change #1113498 merged by FNegri:

[operations/puppet@production] prometheus-node-kernel-panic: remove "absent" lines

https://gerrit.wikimedia.org/r/1113498

Mentioned in SAL (#wikimedia-cloud) [2025-01-23T15:49:32Z] <dhinus> cumin 'P:base::cloud_production' 'rm /var/lib/prometheus/node.d/kernel-panic.prom' T382961

fnegri moved this task from In progress to Done on the cloud-services-team (FY2024/2025-Q3-Q4) board.

This is all done.