Page MenuHomePhabricator

how to deal with cumin alias alerts
Closed, DeclinedPublic

Description

We have the cumin aliases (profile/templates/cumin/aliases.yaml.erb) file and then there is a cumin-check-aliases script which checks it for inconsistencies.

This used to run as a cron job and send us email when it failed, along with the details which alias is problematic.

Recently this was switched from a cron job to a systemd timer (part of general work to replace all crons).

So now there is a cumin-check-aliases.service unit on cumin servers and if that finds an issue with an alias it means this unit is "failed".

This then triggers Icinga alerting that the entire systemd state is CRIT because there is a failed unit. for example: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cumin1001&service=Check+systemd+state

So I have a few questions:

  • Should this send email as it did before when it was cron?
  • Or is it totally fine that it doesn't but shows up as a systemd alert on Icinga? (This will make us check with "list-units --state=failed" and then we see it's the alias check and the status of that will give us the info which alias failed).
  • Or .. should finding a broken alias not mean that the script exists with non-zero exit code?
  • How much do we need these checks in the first place?

And also, the current example that fails is:

Nov 20 09:58:17 cumin1001 check-cumin-aliases[30663]: DC aliases do not cover all hosts: mwlog2002.codfw.wmnet

mwlog2002 is e a new host that was created today, and therefore it doesn't have the production role yet and is still on the "insetup" role.

That's why "DC aliases do not cover all hosts".

That in itself is a temporary situation and kind of a false positive in the first place.

So the ticket is maybe also about how to avoid those.

cumin1001 - CRITICAL - degraded: The system is operational but one or more units failed.

^ This looks much worse than it actually is this way.

Event Timeline

Dzahn triaged this task as Medium priority.Nov 20 2020, 10:25 PM
Dzahn updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2020-11-20T22:30:19Z] <mutante> cumin1001 - sudo systemctl start cumin-check-aliases -> <+icinga-wm> RECOVERY - Check systemd state on cumin1001 is OK T268369

@Volans any idea on how could we potentially reduce the "false positives" of this alert? we got 7 occurrences in the last 30 days that apparently weren't actionable

@Vgutierrez could you please elaborate on the non-actionable part?

The original statement about the insetup role is not correct, insetup hosts are managed by puppet and covered by the DC aliases like any other host:

$ sudo cumin 'A:insetup'
94 hosts will be targeted:
...

$ sudo cumin 'A:insetup and A:codfw'
52 hosts will be targeted:
...

From the last email:

Alias cloudbackup matched 0 hosts
Alias durum-esams matched 0 hosts
DC aliases do not cover all hosts: flink-zk2001.codfw.wmnet,kubernetes2026.codfw.wmnet
  1. cloudbackup: the role has been renamed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/948228/6/manifests/site.pp and the reference in puppet in the aliases file has not been updated accordingly, a git grep would have prevented skipping it.
  1. durum-esams, there are currently no durum hosts in esams, either they were not re-created after the knams migration and should have or it was decided not to have them there. Until they will be created the alias should be dropped from puppet.
  1. flink-zk2001.codfw.wmnet is a host that is partially installed. It's reachable via install_console but not fully puppetized ( see https://puppetboard.wikimedia.org/node/flink-zk2001.codfw.wmnet ) and T341792 and it should be fixed
  1. kubernetes2026.codfw.wmnet same issue as above, not fully installed host https://puppetboard.wikimedia.org/node/kubernetes2026.codfw.wmnet and T342534 and it should be fixed

All of the above seems actionable to me.

yeah.. clearly I didn't phrase that properly, I was saying it from the PoV of Clinic Duty.

Considering your feedback about the task itself, I'm gonna be bold and close it, feel free to re-open it if needed.

yeah.. clearly I didn't phrase that properly, I was saying it from the PoV of Clinic Duty.

From the PoV of Clinic Duty I think that the action should be to ping the host/alias owner and ask them to fix it unless is super trivial that takes less time to fix than ping other people ;)

durum-esams

The hosts were not provisioned in esams but I am fixing that by provisioning them so the durum ones should go away. Thanks for the task update!

From the PoV of Clinic Duty I think that the action should be to ping the host/alias owner and ask them to fix it unless is super trivial that takes less time to fix than ping other people ;)

Agreed.