Page MenuHomePhabricator

Puppet failures on many canary machines
Closed, ResolvedPublic

Description

This weekend there were a bunch of emails saying that puppet failed.

  1 [N +] 02/22 09:33 root [Cloud VPS alert] Puppet failure on canary-wdqs1001-01.cloudvirt-canary.eqiad1.wikimedia.cloud
...
  6 [N +] 02/22 09:25 root [Cloud VPS alert] Puppet failure on canary-wdqs1002-01.cloudvirt-canary.eqiad1.wikimedia.cloud
  7 [N +] 02/22 09:23 root [Cloud VPS alert] Puppet failure on canary1017-01.cloudvirt-canary.eqiad1.wikimedia.cloud
  8 [N +] 02/22 09:15 root [Cloud VPS alert] Puppet failure on canary1026-01.cloudvirt-canary.eqiad1.wikimedia.cloud
  9 [N +] 02/22 09:15 root [Cloud VPS alert] Puppet failure on canary1033-01.cloudvirt-canary.eqiad1.wikimedia.cloud
 10 [N +] 02/22 09:15 root [Cloud VPS alert] Puppet failure on canary1021-01.cloudvirt-canary.eqiad1.wikimedia.cloud
 11 [N +] 02/22 09:15 root [Cloud VPS alert] Puppet failure on canary1014-01.cloudvirt-canary.eqiad1.wikimedia.cloud

Event Timeline

It seems that they are out of memory too, there's this daemon process that is using most of the memory that does not
seem to be needed, looking on how to get rid of it (might be just rebuilding the vms).

Same as T275111

It seems that diamond is used in colud instances to gather puppet freshness (instead of icinga as the bare metals).

There's supposed to be a systemd timer restarting the diamond process, that should have freed the memory...

Tue 2021-02-23 05:50:00 UTC  20h left      n/a                          n/a          wmf_auto_restart_diamond.timer                  wmf_auto_restart_diamond.service

Looking on another of the instances (one that I have not rebooted yet xd)...

Went to canary1017-01.cloudvirt-canary.eqiad1.wikimedia.cloud, the timer was about to retrigger in a bit more than 1h,
and it had been triggerd two days ago:

Mon 2021-02-22 10:58:00 UTC  1h 38min left Fri 2021-02-19 10:58:06 UTC  2 days ago   wmf_auto_restart_diamond.timer                  wmf_auto_restart_diamond.service

manually triggered to see if it would free any memory:

dcaro@canary1017-01:~$ sudo systemctl start wmf_auto_restart_diamond.service

But the memory usage keep at the same level... looking

Change 665995 had a related patch set uploaded (by David Caro; owner: David Caro):
[operations/puppet@production] wmf-auto-restart: Added some help to the script

https://gerrit.wikimedia.org/r/665995

Hmm... I think that the restart is not really happening, the wmf-auto-restart script seems to think it's not needed:

root@canary1033-01:~# wmf-auto-restart --dry-run --debug --servicename diamond
All references to deleted files: COMMAND PID        USER  FD   TYPE DEVICE SIZE/OFF   NODE NAME
sssd    524        root DEL    REG  254,2          791578 /var/lib/sss/mc/initgroups
sssd_be 551        root DEL    REG  254,2          791578 /var/lib/sss/mc/initgroups
nrpe    553      nagios DEL    REG  254,2          791578 /var/lib/sss/mc/initgroups
exim4   577 Debian-exim DEL    REG  254,2          791578 /var/lib/sss/mc/initgroups

systemctl is-active check returned 0
PID query for MainPID returned MainPID=527

Service pids: ['527']
No restart necessary for service diamond

It turns out the script will only restart services for which some libarry deps have changed (not a simple restart).

It seems to be already in a not-so-good state, the restart script is not working as expected, looking:

root@canary-wdqs1001-01:~# free -m
              total        used        free      shared  buff/cache   available
Mem:            481         389           4           5          87          75
Swap:             0           0           0
root@canary-wdqs1001-01:~# systemctl restart diamond

root@canary-wdqs1001-01:~#
root@canary-wdqs1001-01:~# free -m
              total        used        free      shared  buff/cache   available
Mem:            481         385           4           5          90          78
Swap:             0           0           0
root@canary-wdqs1001-01:~# htop
root@canary-wdqs1001-01:~# systemctl stop diamond
Failed to stop diamond.service: Connection timed out
See system logs and 'systemctl status diamond.service' for details.

Ok, it seems that it's not just diamond that uses memory, it's a general increase (probably from normal uptime), so
restarting one specific service will not suffice.

Will have to restart all the canary vms periodically (or increase their ram, and make it fail less often).

Mentioned in SAL (#wikimedia-cloud) [2021-02-22T11:11:57Z] <dcaro> Refreshing all the canary instances (T275354)

Change 665995 merged by David Caro:
[operations/puppet@production] wmf-auto-restart: Added some help to the script

https://gerrit.wikimedia.org/r/665995

Fixed all the failed canaries, opened a task to investigate the root cause, closing this one.