We have the cumin aliases (profile/templates/cumin/aliases.yaml.erb) file and then there is a cumin-check-aliases script which checks it for inconsistencies.
This used to run as a cron job and send us email when it failed, along with the details which alias is problematic.
Recently this was switched from a cron job to a systemd timer (part of general work to replace all crons).
So now there is a cumin-check-aliases.service unit on cumin servers and if that finds an issue with an alias it means this unit is "failed".
This then triggers Icinga alerting that the entire systemd state is CRIT because there is a failed unit. for example: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cumin1001&service=Check+systemd+state
So I have a few questions:
- Should this send email as it did before when it was cron?
- Or is it totally fine that it doesn't but shows up as a systemd alert on Icinga? (This will make us check with "list-units --state=failed" and then we see it's the alias check and the status of that will give us the info which alias failed).
- Or .. should finding a broken alias not mean that the script exists with non-zero exit code?
- How much do we need these checks in the first place?
And also, the current example that fails is:
Nov 20 09:58:17 cumin1001 check-cumin-aliases[30663]: DC aliases do not cover all hosts: mwlog2002.codfw.wmnet
mwlog2002 is e a new host that was created today, and therefore it doesn't have the production role yet and is still on the "insetup" role.
That's why "DC aliases do not cover all hosts".
That in itself is a temporary situation and kind of a false positive in the first place.
So the ticket is maybe also about how to avoid those.
cumin1001 - CRITICAL - degraded: The system is operational but one or more units failed.
^ This looks much worse than it actually is this way.