Page MenuHomePhabricator

Kernel alerts disappear too quickly
Closed, ResolvedPublic

Description

Currently the "Kernel panic", "Kernel warning", etc. alerts introduced in T376719 fire for 30 minutes after the kernel log message is detected, then they disappear.

It would be useful to show them in the alert dashboard for a little longer, given they could be low-frequency alerts that are only happening e.g. once a week and we don't want to miss them.

See also the discussion in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1088539

Details

Event Timeline

fnegri changed the task status from Open to In Progress.Nov 8 2024, 3:21 PM
fnegri claimed this task.
fnegri triaged this task as Medium priority.

Change #1088585 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/alerts@master] team-wmcs: aggregate kernel alerts over 24h

https://gerrit.wikimedia.org/r/1088585

Change #1088602 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] prometheus-node-kernel-panic: refactor and improve

https://gerrit.wikimedia.org/r/1088602

A common pattern for this types of checks (ex. counting the amount of errors in a log), is to track the date of each error too, and have a 'timestamp' of acknowledgement, so errors older than that ack are not counted, so only "new" ones are triggered.

This would mean that we have to manually do some kind of ack on the nodes (a touch of a file might be enough, probably in a small shell wrapper).

This is similar to the ceph crash entries, where you have to ack them for them to stop being counted as 'new' (and putting the cluster in warning status).

@dcaro I like this idea. How does it work for the ceph crash entries? Did you write some custom script to check if errors have been "acknowledged"?

I think relying on the auto-created phab task could also work: resolving the phab task would equal to "acknowledging" the alert, if a new error is logged the alert would trigger again and a new phab task would be created. Note: at the moment no phab task is created, but one would be created if we merge my patch above.

Change #1088585 merged by jenkins-bot:

[operations/alerts@master] team-wmcs: aggregate kernel alerts over 24h

https://gerrit.wikimedia.org/r/1088585

Change #1108091 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] promtheus-node-kernel-panic: rename to "messages"

https://gerrit.wikimedia.org/r/1108091

fnegri moved this task from In progress to Done on the cloud-services-team (FY2024/2025-Q1-Q2) board.

I'm closing this as Resolved, as the problem of alerts disappearing too quickly was solved by https://gerrit.wikimedia.org/r/1088585.

I created T382961: Kernel error metrics have overlapping definitions to track further improvements to these metrics.

There's also T380960: kernel error detector: have a way to ignore certain messages to track the idea of ignoring harmless errors.