Page MenuHomePhabricator

Resolve kernel hang on wcqs* instances
Open, MediumPublic

Description

While setting up the new wcqs service on wcqs[12]00[123] we started an import, the same import process that we have run many times on the wcqs-beta instance and once before on these instances. After two days half of the instances became unresponsive, and the remaining half appear to have stopped importing but have not 100% hung the instance (yet).

Attached is the dmesg for all instances that were still accessible, in all cases there is a single moment where the kernel starts complaining about hung tasks related to disk IO. The kernel doesn't seem to complain about anything before or after that ~2 minute failure period.

wcqs1001: P17667
wcqs1003: P17668
wcqs2002: P17669

Event Timeline

Some random info i looked up:

  • grafana reports free memory of at least 90G across instances. This is typical, the application leans heavily on the linux disk cache to keep it's data in memory.
  • Instances were all generally doing 4-6k write iops with a throughput of ~50MB/s for 24hr+. Minimal reads.
  • Disk utilization reports around 15%, all instances similar
  • Filesystem for application data mounted to /srv, ~15% used.
  • raid reports all drives healthy (U):
ebernhardson@wcqs1003:~$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] 
md0 : active raid10 sda2[0] sdb2[1] sdd2[3] sdc2[2]
      3749898240 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      bitmap: 1/28 pages [4KB], 65536KB chunk

unused devices: <none>

Reading from md0 works as expected:

ebernhardson@wcqs1003:~$ sudo dd if=/dev/md0 of=/dev/null bs=1048576 count=100000 status=progress
104048099328 bytes (104 GB, 97 GiB) copied, 80 s, 1.3 GB/s
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB, 98 GiB) copied, 80.7095 s, 1.3 GB/s

Attempting to sync a file to disk hangs:

root@wcqs1003:~# echo 'foo' > test
root@wcqs1003:~# sync test

Without having better information, i would guess we are triggering a deadlock somewhere in the kernel related to disk writes? But no particular locking is mentioned in the traces so perhaps not. As it affected all 6 instances it seems fairly reproducable, although perhaps we should try and take the application out of the picture and try to reproduce with some simple tools to generate disk writes? This might be fairly tedious, the error took more than a day to trigger and the only idea I have for reproducing externally is to generate random write loads of similar size.

@MoritzMuehlenhoff as someone who deals with the kernel often, any suggestions for where to investigate?

Mentioned in SAL (#wikimedia-operations) [2021-11-03T19:35:49Z] <mutante> depooled wcqs2003 (pooled=inactive) because Icinga alerts that servers are down but pooled. not in production yet but issues (T294961)

Change 736564 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wcqs: state change production->lvs_setup

https://gerrit.wikimedia.org/r/736564

Change 736585 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/dns@master] Revert \"wcqs: add discovery record\"

https://gerrit.wikimedia.org/r/736585

Change 736585 merged by Ryan Kemper:

[operations/dns@master] Revert \"wcqs: add discovery record\"

https://gerrit.wikimedia.org/r/736585

Mentioned in SAL (#wikimedia-operations) [2021-11-03T21:45:46Z] <ryankemper> T294961 [WCQS] Merged https://gerrit.wikimedia.org/r/c/operations/dns/+/736585, running ryankemper@authdns1001:~$ sudo -i authdns-update

Mentioned in SAL (#wikimedia-operations) [2021-11-03T21:47:45Z] <ryankemper> T294961 [WCQS] DNS changes rolled out, proceeding to the lvs_setup step: https://gerrit.wikimedia.org/r/c/operations/puppet/+/736564

Change 736564 merged by Ryan Kemper:

[operations/puppet@production] wcqs: state change production->lvs_setup

https://gerrit.wikimedia.org/r/736564

Mentioned in SAL (#wikimedia-operations) [2021-11-03T21:53:32Z] <ryankemper> T294961 [WCQS] Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/736564 and successfully ran ryankemper@cumin1001:~$ sudo cumin 'A:icinga or A:dns-auth' run-puppet-agent

Mentioned in SAL (#wikimedia-operations) [2021-11-03T21:56:00Z] <ryankemper> T294961 [WCQS] Forcing recheck of PyBal IPVS diff check and PyBal backends health check

Without having better information, i would guess we are triggering a deadlock somewhere in the kernel related to disk writes? But no particular locking is mentioned in the traces so perhaps not. As it affected all 6 instances it seems fairly reproducable, although perhaps we should try and take the application out of the picture and try to reproduce with some simple tools to generate disk writes? This might be fairly tedious, the error took more than a day to trigger and the only idea I have for reproducing externally is to generate random write loads of similar size.

@MoritzMuehlenhoff as someone who deals with the kernel often, any suggestions for where to investigate?

It looks to be deadlocking somewhere deep in the I/O layer, there are some sysctls and kernel settings that we could fine-tune, but given that this only happens after a full day run, that'll be a slow going process.

I see two next steps that we should try:

  1. Looking at Netbox these were purchased in March, but that doesn't necessarily mean that the system firmware is up-to-date. Often these are delivered with the firmware version once the specific server model was originally shipped. We could ask DC ops to upgrade one of the servers to the latest versions and re-test.
  2. These are Buster systems, but we can try the 5.10 kernel available from Debian backports (installable with "apt-get install linux-image-5.10.0-0.bpo.9"). We have a handful of services which also the kernel on buster already (e.g. the Hadoop and stat* hosts with the AMD GPUs).

Mentioned in SAL (#wikimedia-operations) [2021-11-04T16:48:45Z] <ryankemper> T294961 [WCQS] Power cycled all 6 wcqs* hosts via the mgmt console (racadm serveraction powercycle)

Mentioned in SAL (#wikimedia-operations) [2021-11-04T17:23:57Z] <ryankemper> T294961 [WCQS] Installed kernel version Linux 5.10.0-0.bpo.9-amd64 on all wcqs* hosts

Let's make sure the kernel version is pinned somewhere in our puppet code! Then we can wait to see if the problem is reproduced or not.

Let's make sure the kernel version is pinned somewhere in our puppet code! Then we can wait to see if the problem is reproduced or not.

<moritzm> ryankemper, ebernhardson: yeah, let's install the kernel package manually for testing and if it fixes the issue, then we can apply the profile::base::linux510

So specifically we'll apply profile::base::linux510

The import that caused everything to fall over last time completed. I'm not sure that's enough to declare this fixed (it ran once before as well) but after putting the puppet patch in place we can probably wait on this one to see if it reoccurs.

colewhite triaged this task as Medium priority.Mon, Nov 8, 10:33 PM

Started another round of imports today to see how it goes. If it doesn't fall over might as well call this done for now.

Another round of import tests completed, nothing fell over. Calling this done for now.

Another round of import tests completed, nothing fell over. Calling this done for now.

We still to add profile::base::linux419 to the WCQS roles, otherwise with the next reimage they'd get installed with Linux 4.9 again.

Change 742729 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add profile::base::linux419 to the WCQS role

https://gerrit.wikimedia.org/r/742729

Change 742729 merged by Ryan Kemper:

[operations/puppet@production] Add profile::base::linux419 to the WCQS role

https://gerrit.wikimedia.org/r/742729

Change 743223 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] Switch WCQS to profile::base::linux510

https://gerrit.wikimedia.org/r/743223

Change 743223 merged by Ryan Kemper:

[operations/puppet@production] Switch WCQS to profile::base::linux510

https://gerrit.wikimedia.org/r/743223