Puppet failures on many canary machines
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	dcaro
	Feb 22 2021, 8:44 AM

Description

This weekend there were a bunch of emails saying that puppet failed.

  1 [N +] 02/22 09:33 root [Cloud VPS alert] Puppet failure on canary-wdqs1001-01.cloudvirt-canary.eqiad1.wikimedia.cloud
...
  6 [N +] 02/22 09:25 root [Cloud VPS alert] Puppet failure on canary-wdqs1002-01.cloudvirt-canary.eqiad1.wikimedia.cloud
  7 [N +] 02/22 09:23 root [Cloud VPS alert] Puppet failure on canary1017-01.cloudvirt-canary.eqiad1.wikimedia.cloud
  8 [N +] 02/22 09:15 root [Cloud VPS alert] Puppet failure on canary1026-01.cloudvirt-canary.eqiad1.wikimedia.cloud
  9 [N +] 02/22 09:15 root [Cloud VPS alert] Puppet failure on canary1033-01.cloudvirt-canary.eqiad1.wikimedia.cloud
 10 [N +] 02/22 09:15 root [Cloud VPS alert] Puppet failure on canary1021-01.cloudvirt-canary.eqiad1.wikimedia.cloud
 11 [N +] 02/22 09:15 root [Cloud VPS alert] Puppet failure on canary1014-01.cloudvirt-canary.eqiad1.wikimedia.cloud

Details

	Subject	Repo	Branch	Lines +/-
	wmf-auto-restart: Added some help to the script	operations/puppet	production	+5 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	dcaro	T275354 Puppet failures on many canary machines
Resolved	dcaro	T275376 Cloudvirt instances failing to start
Resolved	dcaro	T275377 Cloudvirt instances failing to start: cloudvirt1019/1020
Resolved	dcaro	T275378 Cloudvirt instances failing to start: Image has no associated data
Resolved	dcaro	T275407 Investigate why debian 10.0 image got corrupted
Resolved	• Bstorm	T275430 Large images cloned to /var/lib/nova/instances/_base filling up disk on hypervisors
Resolved	• Bstorm	T275585 Clean-up hard-coded references to /dev/vda and friends after changing qemu block drivers
Resolved	Andrew	T275586 Update prepare_cinder_volume to handle attached devices under

Event Timeline

dcaro created this task.Feb 22 2021, 8:44 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 22 2021, 8:44 AM

RhinosF1 subscribed.Feb 22 2021, 8:52 AM

It seems that they are out of memory too, there's this daemon process that is using most of the memory that does not
seem to be needed, looking on how to get rid of it (might be just rebuilding the vms).

Same as T275111

It seems that diamond is used in colud instances to gather puppet freshness (instead of icinga as the bare metals).

There's supposed to be a systemd timer restarting the diamond process, that should have freed the memory...

Tue 2021-02-23 05:50:00 UTC  20h left      n/a                          n/a          wmf_auto_restart_diamond.timer                  wmf_auto_restart_diamond.service

Looking on another of the instances (one that I have not rebooted yet xd)...

Went to canary1017-01.cloudvirt-canary.eqiad1.wikimedia.cloud, the timer was about to retrigger in a bit more than 1h,
and it had been triggerd two days ago:

Mon 2021-02-22 10:58:00 UTC  1h 38min left Fri 2021-02-19 10:58:06 UTC  2 days ago   wmf_auto_restart_diamond.timer                  wmf_auto_restart_diamond.service

manually triggered to see if it would free any memory:

dcaro@canary1017-01:~$ sudo systemctl start wmf_auto_restart_diamond.service

But the memory usage keep at the same level... looking

Change 665995 had a related patch set uploaded (by David Caro; owner: David Caro):
[operations/puppet@production] wmf-auto-restart: Added some help to the script

https://gerrit.wikimedia.org/r/665995

gerritbot added a project: Patch-For-Review.Feb 22 2021, 9:38 AM

Hmm... I think that the restart is not really happening, the wmf-auto-restart script seems to think it's not needed:

root@canary1033-01:~# wmf-auto-restart --dry-run --debug --servicename diamond
All references to deleted files: COMMAND PID        USER  FD   TYPE DEVICE SIZE/OFF   NODE NAME
sssd    524        root DEL    REG  254,2          791578 /var/lib/sss/mc/initgroups
sssd_be 551        root DEL    REG  254,2          791578 /var/lib/sss/mc/initgroups
nrpe    553      nagios DEL    REG  254,2          791578 /var/lib/sss/mc/initgroups
exim4   577 Debian-exim DEL    REG  254,2          791578 /var/lib/sss/mc/initgroups

systemctl is-active check returned 0
PID query for MainPID returned MainPID=527

Service pids: ['527']
No restart necessary for service diamond

It turns out the script will only restart services for which some libarry deps have changed (not a simple restart).

It seems to be already in a not-so-good state, the restart script is not working as expected, looking:

root@canary-wdqs1001-01:~# free -m
              total        used        free      shared  buff/cache   available
Mem:            481         389           4           5          87          75
Swap:             0           0           0
root@canary-wdqs1001-01:~# systemctl restart diamond

root@canary-wdqs1001-01:~#
root@canary-wdqs1001-01:~# free -m
              total        used        free      shared  buff/cache   available
Mem:            481         385           4           5          90          78
Swap:             0           0           0
root@canary-wdqs1001-01:~# htop
root@canary-wdqs1001-01:~# systemctl stop diamond
Failed to stop diamond.service: Connection timed out
See system logs and 'systemctl status diamond.service' for details.

Ok, it seems that it's not just diamond that uses memory, it's a general increase (probably from normal uptime), so
restarting one specific service will not suffice.

Will have to restart all the canary vms periodically (or increase their ram, and make it fail less often).

Mentioned in SAL (#wikimedia-cloud) [2021-02-22T11:11:57Z] <dcaro> Refreshing all the canary instances (T275354)

dcaro added a subtask: T275376: Cloudvirt instances failing to start.Feb 22 2021, 11:34 AM

Change 665995 merged by David Caro:
[operations/puppet@production] wmf-auto-restart: Added some help to the script

https://gerrit.wikimedia.org/r/665995

Fixed all the failed canaries, opened a task to investigate the root cause, closing this one.

dcaro closed this task as Resolved.Feb 22 2021, 5:02 PM

dcaro closed subtask T275376: Cloudvirt instances failing to start as Resolved.

• Bstorm reopened subtask T275376: Cloudvirt instances failing to start as Open.Feb 22 2021, 7:29 PM

• Bstorm closed subtask T275376: Cloudvirt instances failing to start as Resolved.Feb 23 2021, 11:47 PM

Puppet failures on many canary machinesClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Puppet failures on many canary machines
Closed, ResolvedPublic
Actions

Related Objects
Search...