Page MenuHomePhabricator

puppet breakage on Jessie tools nodes (and probably on Jessie VMs everywhere)
Closed, ResolvedPublic

Description

unattended upgrades is trying to update initramfs-tools but there's a race in the package which causes dpkg to hang sometimes.

This has broken dpkg, and apt, and puppet in many places.

Related Objects

Event Timeline

Andrew triaged this task as High priority.Mar 17 2019, 3:33 PM

I've seen some hosts stuck in unattended-upgrade because of NFS. At some point dpkg calls sync to flush filesystems and everything stalls. The only option I've found is to hard reboot them.

Ran cumin 'P{F:lsbdistcodename = jessie}' 'ps auxwf | grep -v grep | grep dpkg' on deployment-cumin, no dpkg processes stuck running on deployment-prep's 34 jessie instances.

tools-worker-1018:~$ sudo dpkg --configure -a
Setting up initramfs-tools (0.120+deb8u3) ...
update-initramfs: deferring update (trigger activated)
Processing triggers for initramfs-tools (0.120+deb8u3) ...
update-initramfs: Generating /boot/initrd.img-4.9.0-0.bpo.6-amd64

It's sure taking its time on the actual generating there.

After a hard reboot, I was able to get it to run puppet, but I was surprised at how many files it thought needed changes (for the most part the changes aren't actual content for that matter). P8212

The DNS server is an actual change.

Sometime when it's not the weekend let's audit all instances for stuck dpkg processes. This might be happening all over the place.

Mentioned in SAL (#wikimedia-cloud) [2019-03-17T17:46:12Z] <bstorm_> depooling tools-worker-1009 and tools-worker-1012 for T218514

Mentioned in SAL (#wikimedia-cloud) [2019-03-17T17:48:10Z] <bstorm_> T218514 rebooting tools-worker-1009 and 1012

Andrew claimed this task.

This seems to be fine now. I double-checked the state of apt and dpkg and although there are a few things stuck from race conditions there's nothing comprehensive or serious going on.