Review I/O setup on stat1008
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	bking
	Aug 20 2024, 9:18 PM

Description

stat1008 locked up again today, with extremely high load . More details in this Slack thread (non-public, but I'll post the highlights of @BTullis 's discoveries here).

Related Objects
Search...

Status	Assigned	Task
Open	None	T362922 Audit/consider enabling CPU performance governor on DPE SRE-owned hosts
Resolved	bking	T373446 Improve developer experience on stat hosts (SRE-scoped)
Resolved	bking	T372941 Review I/O setup on stat1008

Event Timeline

bking created this task.Aug 20 2024, 9:18 PM

RobH unsubscribed.Aug 20 2024, 10:00 PM

Observations from the Slack thread:

The load is seemingly caused by I/O. It shows up as on the graph as CPU system% as opposed to wait%, but I've seen that happen in the past in similar situations.
iotop showed lots of processes reading from the local disk at a combined 375 MB/s (which sounds suspiciously like the 3 Gbps limit of SATA II, but could be a red herring)

Further observations:

stat1008 has a hardware RAID controller, but is configured with software RAID. Ideally we'd:
- reconfigure with hardware RAID (as we apparently do w/the Hadoop workers) This would save CPU, as the dedicated hardware controller would manage RAID operations (striping/mirroring/recovery)
- set the I/O scheduler to none. This delegates the I/O scheduling to the RAID controller, which should save CPU as well (the controller always makes the I/O decisions anyway, this just prevents the host from wasting cycles with IOPS that will be rescheduled by the controller).
Procurement ticket T242149 shows the disks as 2TB 7.2K RPM SATA 6Gbps 512n 2.5in Hot-plug Hard Drive 400-ASHT , that should give us some idea of what access speeds to expect.

I don't know that the changes above would have prevented the lockup we saw today, but they should get us closer to peak performance on this hardware.

bking removed subscribers: wiki_willy, VRiley-WMF, Jhancock.wm and 2 others.Aug 21 2024, 9:32 PM

bking mentioned this in T373446: Improve developer experience on stat hosts (SRE-scoped).Aug 27 2024, 2:34 PM

bking added a parent task: T373446: Improve developer experience on stat hosts (SRE-scoped).Aug 27 2024, 2:37 PM

Per conversation with @BTullis , This host is due for refresh in February 2025. At that point, its GPU will be moved to a replacement host.

Unless we start to see these lockups a lot more often, I don't think it's worth the effort to reimage the host as it has ~5.6 TB of data and we'd need to backup and restore over its 1Gbps NIC (which would take a full 24 hours under ideal circumstances.)

Since we're not going to reimage the host, I'm closing out this ticket. Work to stabilize the host via cgroups continues in T372416 ....

Review I/O setup on stat1008Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Review I/O setup on stat1008
Closed, ResolvedPublic
Actions

Related Objects
Search...