Page MenuHomePhabricator

Puppetmaster volatile data not synced to all puppet frontends for a month and a half (2024-04-27 to 2024-06-10)
Open, MediumPublic

Description

@kostajh asked on #wikimedia-sre-foundations IRC to confirm that the new GeoLite2 files were available, as part of work on T366272. cdanis began investigating and discovered that the files were missing on most hosts in codfw where they were expected to exist: P64540

Full writeup at https://wikitech.wikimedia.org/wiki/Incidents/2024-06-10_puppet_volatile_data_broken_sync

Event Timeline

Change #1041217 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/puppet@production] enable monitoring+logging for puppetmaster syncs

https://gerrit.wikimedia.org/r/1041217

Change #1041217 merged by CDanis:

[operations/puppet@production] enable monitoring+logging for puppetmaster syncs

https://gerrit.wikimedia.org/r/1041217

Change #1041760 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/puppet@production] puppetserver syncs: also add monitoring + timeout

https://gerrit.wikimedia.org/r/1041760

Change #1041760 merged by CDanis:

[operations/puppet@production] puppetserver syncs: also add monitoring + timeout

https://gerrit.wikimedia.org/r/1041760

joanna_borun triaged this task as Medium priority.

I think the last step to do here is to validate that any rsync failures will get reported on IRC. Then we can consider all the immediate followups of this incident done, and more slowly continue on with the larger work at T367119: Install a default timeout for systemd::timer::jobs.

We could consider adding some mechanism to detect skew between the directories, but that's another moving part, and relying on rsync success should be enough.

How about adding a MAILTO to the timer and mail a specific list / team / group? I think that alerting via IRC is becoming less reliable and direct email would be more effective. (or even automatic ticket creation)