Page MenuHomePhabricator

how to deal with cumin alias alerts
Open, MediumPublic

Description

We have the cumin aliases (profile/templates/cumin/aliases.yaml.erb) file and then there is a cumin-check-aliases script which checks it for inconsistencies.

This used to run as a cron job and send us email when it failed, along with the details which alias is problematic.

Recently this was switched from a cron job to a systemd timer (part of general work to replace all crons).

So now there is a cumin-check-aliases.service unit on cumin servers and if that finds an issue with an alias it means this unit is "failed".

This then triggers Icinga alerting that the entire systemd state is CRIT because there is a failed unit. for example: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cumin1001&service=Check+systemd+state

So I have a few questions:

  • Should this send email as it did before when it was cron?
  • Or is it totally fine that it doesn't but shows up as a systemd alert on Icinga? (This will make us check with "list-units --state=failed" and then we see it's the alias check and the status of that will give us the info which alias failed).
  • Or .. should finding a broken alias not mean that the script exists with non-zero exit code?
  • How much do we need these checks in the first place?

And also, the current example that fails is:

Nov 20 09:58:17 cumin1001 check-cumin-aliases[30663]: DC aliases do not cover all hosts: mwlog2002.codfw.wmnet

mwlog2002 is e a new host that was created today, and therefore it doesn't have the production role yet and is still on the "insetup" role.

That's why "DC aliases do not cover all hosts".

That in itself is a temporary situation and kind of a false positive in the first place.

So the ticket is maybe also about how to avoid those.

cumin1001 - CRITICAL - degraded: The system is operational but one or more units failed.

^ This looks much worse than it actually is this way.

Event Timeline

Dzahn created this task.Fri, Nov 20, 10:25 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFri, Nov 20, 10:25 PM
Dzahn triaged this task as Medium priority.Fri, Nov 20, 10:25 PM
Dzahn updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2020-11-20T22:30:19Z] <mutante> cumin1001 - sudo systemctl start cumin-check-aliases -> <+icinga-wm> RECOVERY - Check systemd state on cumin1001 is OK T268369