Page MenuHomePhabricator

cloudmetrics1003 seizes up under load
Closed, ResolvedPublic

Description

root@cloudmetrics1003:~# uptime
 17:44:41 up 13:58,  3 users,  load average: 230.34, 229.37, 226.33

This happens reliably anytime I make 1003 the primary cloudmetrics host. It seems not to happen when I make 1003 the secondary.

Here's an approximate graph of the problem appearing:

https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudmetrics1003&var-datasource=thanos&var-cluster=misc&from=1639677600000&to=1639785599000

Note that those metrics are way off because the system quickly becomes too sick to reliably update metrics.

It's especially mysterious that this happens 'under load' since the main thing this host is doing is aggregating prometheus metrics, which it is also doing as the secondary.

There's a periodic rsync from the primary to the secondary which seems to trigger the problem, but that would mean that disk /reads/ causes the issue which seems very unlikely.

Event Timeline

Change 747555 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/dns@master] make cloudmetrics1004 the primary cloudmetrics endpoint

https://gerrit.wikimedia.org/r/747555

Change 747554 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Cloudmetrics/statsd: exchange cloudmetrics1003 and 1004

https://gerrit.wikimedia.org/r/747554

Change 747554 merged by Andrew Bogott:

[operations/puppet@production] Cloudmetrics/statsd: exchange cloudmetrics1003 and 1004

https://gerrit.wikimedia.org/r/747554

Change 747555 merged by Andrew Bogott:

[operations/dns@master] make cloudmetrics1004 the primary cloudmetrics endpoint

https://gerrit.wikimedia.org/r/747555

Just in case it's not a hardware problem I'm swapping roles between the two hosts to see if the bad behavior follows the role

Change 747667 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Revert \"Cloudmetrics/statsd: exchange cloudmetrics1003 and 1004\"

https://gerrit.wikimedia.org/r/747667

Change 747668 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/dns@master] Revert \"make cloudmetrics1004 the primary cloudmetrics endpoint\"

https://gerrit.wikimedia.org/r/747668

Change 747668 merged by Andrew Bogott:

[operations/dns@master] Revert \"make cloudmetrics1004 the primary cloudmetrics endpoint\"

https://gerrit.wikimedia.org/r/747668

Change 747667 merged by Andrew Bogott:

[operations/puppet@production] Revert \"Cloudmetrics/statsd: exchange cloudmetrics1003 and 1004\"

https://gerrit.wikimedia.org/r/747667

Change 747994 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Cloudmetrics/statsd: exchange cloudmetrics1003 and 1004

https://gerrit.wikimedia.org/r/747994

Change 747995 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/dns@master] make cloudmetrics1004 the primary cloudmetrics endpoint

https://gerrit.wikimedia.org/r/747995

Change 747994 merged by Andrew Bogott:

[operations/puppet@production] Cloudmetrics/statsd: exchange cloudmetrics1003 and 1004

https://gerrit.wikimedia.org/r/747994

Change 747995 merged by Andrew Bogott:

[operations/dns@master] make cloudmetrics1004 the primary cloudmetrics endpoint

https://gerrit.wikimedia.org/r/747995

1004:

root@cloudmetrics1004:/srv# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=200G --readwrite=randrw --rwmixread=75
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [m(1)][100.0%][r=319MiB/s,w=107MiB/s][r=81.6k,w=27.3k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=22480: Fri Dec 17 04:30:13 2021
  read: IOPS=85.0k, BW=332MiB/s (348MB/s)(150GiB/462536msec)
   bw (  KiB/s): min=73744, max=415664, per=99.99%, avg=340008.05, stdev=69227.29, samples=925
   iops        : min=18436, max=103916, avg=85002.01, stdev=17306.82, samples=925
  write: IOPS=28.3k, BW=111MiB/s (116MB/s)(50.0GiB/462536msec); 0 zone resets
   bw (  KiB/s): min=24200, max=140160, per=99.99%, avg=113349.05, stdev=23122.43, samples=925
   iops        : min= 6050, max=35040, avg=28337.24, stdev=5780.61, samples=925
  cpu          : usr=8.78%, sys=89.47%, ctx=201079, majf=0, minf=5418
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=39320441,13108359,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=332MiB/s (348MB/s), 332MiB/s-332MiB/s (348MB/s-348MB/s), io=150GiB (161GB), run=462536-462536msec
  WRITE: bw=111MiB/s (116MB/s), 111MiB/s-111MiB/s (116MB/s-116MB/s), io=50.0GiB (53.7GB), run=462536-462536msec

Disk stats (read/write):
    dm-2: ios=39309308/16667231, merge=0/0, ticks=4398400/18446744070581924184, in_queue=18446744070596440428, util=100.00%, aggrios=39324832/16674361, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
    md0: ios=39324832/16674361, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=9831183/8181807, aggrmerge=25/169394, aggrticks=1116064/397702, aggrin_queue=1424308, aggrutil=96.32%
  sdd: ios=7990082/8181500, merge=30/179137, ticks=891270/428880, in_queue=1205788, util=92.93%
  sdc: ios=11672103/8183716, merge=49/176921, ticks=1367926/387594, in_queue=1659728, util=96.32%
  sdb: ios=8078391/8179921, merge=10/161846, ticks=882794/410194, in_queue=1204624, util=92.75%
  sda: ios=11584157/8182093, merge=14/159674, ticks=1322266/364141, in_queue=1627092, util=96.28%

1003:

root@cloudmetrics1003:/srv# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=200G --readwrite=randrw --rwmixread=75
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.12
Starting 1 process
Jobs: 1 (f=1): [m(1)][100.0%][r=363MiB/s,w=121MiB/s][r=93.0k,w=31.0k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=12232: Fri Dec 17 04:29:51 2021
  read: IOPS=89.6k, BW=350MiB/s (367MB/s)(150GiB/438894msec)
   bw (  KiB/s): min=204376, max=427616, per=99.99%, avg=358310.84, stdev=31381.41, samples=877
   iops        : min=51094, max=106904, avg=89577.69, stdev=7845.37, samples=877
  write: IOPS=29.9k, BW=117MiB/s (122MB/s)(50.0GiB/438894msec); 0 zone resets
   bw (  KiB/s): min=68024, max=142800, per=99.99%, avg=119451.39, stdev=10488.85, samples=877
   iops        : min=17006, max=35700, avg=29862.83, stdev=2622.22, samples=877
  cpu          : usr=9.85%, sys=87.80%, ctx=136595, majf=0, minf=95
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=39320441,13108359,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=350MiB/s (367MB/s), 350MiB/s-350MiB/s (367MB/s-367MB/s), io=150GiB (161GB), run=438894-438894msec
  WRITE: bw=117MiB/s (122MB/s), 117MiB/s-117MiB/s (122MB/s-122MB/s), io=50.0GiB (53.7GB), run=438894-438894msec

Disk stats (read/write):
    dm-2: ios=39318409/13129596, merge=0/0, ticks=4215088/812388, in_queue=5050668, util=100.00%, aggrios=39321021/13131854, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
    md0: ios=39321021/13131854, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=9837475/6568754, aggrmerge=1/777, aggrticks=1053698/227892, aggrin_queue=1247141, aggrutil=96.12%
  sdd: ios=8085892/6568859, merge=0/850, ticks=852263/241039, in_queue=1039804, util=92.09%
  sdc: ios=11588852/6568860, merge=2/849, ticks=1267173/223769, in_queue=1448300, util=96.01%
  sdb: ios=8164384/6568646, merge=2/707, ticks=849460/232673, in_queue=1049988, util=92.28%
  sda: ios=11510775/6568652, merge=2/703, ticks=1245897/214089, in_queue=1450472, util=96.12%
Andrew renamed this task from cloudmetrics1003 so slow as to be useless to cloudmetrics1003 seizes up under load.Dec 20 2021, 3:27 PM
Andrew updated the task description. (Show Details)
Andrew added a subscriber: wiki_willy.

@wiki_willy, I'm unable to learn much about what's causing this issue but I strongly suspect a hardware issue has the same workload on cloudmetrics1004 (purchased at the same time) does not cause the failure.

Cloudmetrics1003 (the affected host) is a failover host and can be rebooted at any time.

No errors on system event log (so no large scale hardware fault detected by mainboard)

racadm>>getsel
Record:      1
Date/Time:   09/19/2021 15:55:22
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   09/19/2021 16:00:57
Source:      system
Severity:    Ok
Description: C: boot completed.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   09/19/2021 16:00:57
Source:      system
Severity:    Ok
Description: OEM software event.
-------------------------------------------------------------------------------
racadm>>

Next steps would likely be a reboot into the Dell test suite to see if it detects anything.

Also a quick comparison of the firmware revisions between the two hosts to ensure they are matched and we are comparing apples to apples.

Moritz says:

"yeah, I think having DC ops upgrade the firmware to the latest version is good next step. generally current kernels seem to be dependent on current firmware, more than half the Ganeti servers in codfw which I tried to reimage to buster needed a firmware update to PXE boot, without it it simply fell back to the installed OS
9:38 AM and same for the kafka/main reimages by Luca"

RobH changed the task status from Open to In Progress.Dec 20 2021, 6:27 PM
RobH claimed this task.

stealing to update the firmware and compare to the cloudmetrics1004

firmware is identical across both hosts

Screen Shot 2021-12-20 at 10.29.14 AM.png (1×2 px, 389 KB)

Screen Shot 2021-12-20 at 10.29.07 AM.png (1×2 px, 367 KB)

Firmware check:

Bios is already newest revision 2.12.2
network card is not newest, 21.81.1 updating to 21.81.3
idrac is 5.00.10.00, updating to 5.10.00.00
backplane is already newest 4.35 A00
powersupply already newest 00.31.50

updating the network and idrac firmware

These were setup via T289888 for reference

Updating idrac has broken the https interface, so we'll need an onsite to use the lifecycle crash cart routine to roll back the idrac version.

I'm rebooting this now into the dell epsa test via ssh/serial redirection.

No errors in fast test, continuing full test.

Screen Shot 2021-12-20 at 11.13.13 AM.png (1×2 px, 756 KB)

RobH changed the task status from In Progress to Open.Dec 20 2021, 8:09 PM
RobH removed RobH as the assignee of this task.

Screen Shot 2021-12-20 at 11.50.34 AM.png (692×980 px, 217 KB)

All tests passed.

At this point, we should have the onsite roll back the idrac so its accessible via HTTPS.

Additionally, I suppose we need to figure out a good/approved way to test full load on systems, other than their hosted service. Perhaps something like bonnie to test cpu, etc..?

I'm not sure what to do next other than either push this under load with monitoring to see what is causing the crash, or running a hardware test suite that is more intensive than the Dell ePSA. the ePSA suite doesn't appear to do much CPU testing, and focuses more on memory.

Reedy renamed this task from cloudmetrics1003 seizes up under load to cloudmetrics1003 seizes up under load.Dec 20 2021, 8:39 PM

The idrac is now accessible again

I just brought up the issue on my regular call with our Dell account reps today. If things still don't work after Rob previously updated all the firmware, we can shoot open a Dell Tech Direct ticket and let them know what we've done to troubleshoot. If they don't find any known bugs with any of the components or firmware versions, they'll most likely expedite a motherboard or CPU replacement. It might still make some sense though if we can have a sync up meeting between WMCS and either I/F or the Dell team, just to ensure the hardware being used is sufficient for the type of services when these servers are running at full load. Thanks, Willy

Thanks for your ongoing attention on this! I'm frustrated by how easily I can produce this issue in production but can't find a non-downtime-causing test case that produces the same issue. If we fully run out of other troubleshooting ideas I can spend a few days trying to simulate the workload and see if we can make it fail reliably.

Yup, for sure. I definitely hear ya on that @Andrew. @Cmjohnson - maybe we can open a Dell Tech Direct ticket simultaneously, tell them we've updated the firmware, that we're suspecting a CPU or motherboard issue, and see if they're aware of any known bugs with these components or firmware versions when under load. I can then escalate through our reps to push things along, so that it's a little less painful. It might fix it and it might not, but at least we'll be able to rule out that possible issue. Thanks, Willy

I've created a ticket with Dell Self-Dispatch for a new CPU. I am doubtful they will send me one since I can not show any HW logs that shows there is an issue. \
You have successfully submitted request SR1082611015.

Thanks @Cmjohnson - let's see how things go with T299744 first, before I escalate that Dell ticket up to the Dell account rep. I want to make sure we've covered things internally first. Thanks, Willy

the ticket was rejected, I am not sure how I can troubleshoot this for them, @wiki_willy you may want to escalate this with our reps. I will need something that shows that there is an issue with the CPU.

1.Troubleshooting/System Failure information provided is insufficient for Processor. The TechDirect Program Agreement requires troubleshooting to the failed component. You can refer to support.dell.com for helpful troubleshooting assistance and resubmit your request. 2.As we checked in TSR report, there is no failure reported for Processor. However, in the Troubleshooting note you have mentioned Processor issue. Kindly confirm the same and resubmit the request. Thank you

Cookbook cookbooks.sre.hosts.reimage was started by root@cumin1001 for host cloudmetrics1003.eqiad.wmnet with OS bullseye

I'm going to reimage this host with Bullseye and see if I can get any different behavior out of it.

Cookbook cookbooks.sre.hosts.reimage started by root@cumin1001 for host cloudmetrics1003.eqiad.wmnet with OS bullseye executed with errors:

  • cloudmetrics1003 (FAIL)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201242118_root_23613_cloudmetrics1003.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by root@cumin1001 for host cloudmetrics1003.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by root@cumin1001 for host cloudmetrics1003.eqiad.wmnet with OS buster completed:

  • cloudmetrics1003 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201242157_root_30199_cloudmetrics1003.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-cloud) [2022-01-25T10:49:46Z] <arturo> made cloudmetrics1001/1002 primary/backup respectively (T299744, T297814, T300011)

Change 757102 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/dns@master] Make cloudmetrics1003 the primary cloudmetrics host, again.

https://gerrit.wikimedia.org/r/757102

Change 757102 merged by Andrew Bogott:

[operations/dns@master] Make cloudmetrics1003 the primary cloudmetrics host, again.

https://gerrit.wikimedia.org/r/757102

Resolving task since the kernel upgrade seems to have fixed this. Thanks, Willly

I'm reopening because we need to decide if this kernel can be user long-term and, if so, the same should be applied to 1004. No tasks here for dcops though, this is all stuff for me and Moritz.

via IRC chat: We'll move cloudmetrics1004 to the same kernel as 1003 and call this done.

1003 and 1004 are both running 5.10 and seem happy.