stat1008 locked up again today, with extremely high load . More details in this Slack thread (non-public, but I'll post the highlights of @BTullis 's discoveries here).
Description
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T362922 Audit/consider enabling CPU performance governor on DPE SRE-owned hosts | |||
Resolved | bking | T373446 Improve developer experience on stat hosts (SRE-scoped) | |||
Resolved | bking | T372941 Review I/O setup on stat1008 |
Event Timeline
Comment Actions
Observations from the Slack thread:
- The load is seemingly caused by I/O. It shows up as on the graph as CPU system% as opposed to wait%, but I've seen that happen in the past in similar situations.
- iotop showed lots of processes reading from the local disk at a combined 375 MB/s (which sounds suspiciously like the 3 Gbps limit of SATA II, but could be a red herring)
Further observations:
- stat1008 has a hardware RAID controller, but is configured with software RAID. Ideally we'd:
- reconfigure with hardware RAID (as we apparently do w/the Hadoop workers) This would save CPU, as the dedicated hardware controller would manage RAID operations (striping/mirroring/recovery)
- set the I/O scheduler to none. This delegates the I/O scheduling to the RAID controller, which should save CPU as well (the controller always makes the I/O decisions anyway, this just prevents the host from wasting cycles with IOPS that will be rescheduled by the controller).
- Procurement ticket T242149 shows the disks as 2TB 7.2K RPM SATA 6Gbps 512n 2.5in Hot-plug Hard Drive 400-ASHT , that should give us some idea of what access speeds to expect.
I don't know that the changes above would have prevented the lockup we saw today, but they should get us closer to peak performance on this hardware.
Comment Actions
Per conversation with @BTullis , This host is due for refresh in February 2025. At that point, its GPU will be moved to a replacement host.
Unless we start to see these lockups a lot more often, I don't think it's worth the effort to reimage the host as it has ~5.6 TB of data and we'd need to backup and restore over its 1Gbps NIC (which would take a full 24 hours under ideal circumstances.)
Since we're not going to reimage the host, I'm closing out this ticket. Work to stabilize the host via cgroups continues in T372416 ....