Page MenuHomePhabricator

Review I/O setup on stat1008
Closed, ResolvedPublic

Description

stat1008 locked up again today, with extremely high load . More details in this Slack thread (non-public, but I'll post the highlights of @BTullis 's discoveries here).

Event Timeline

Observations from the Slack thread:

  • The load is seemingly caused by I/O. It shows up as on the graph as CPU system% as opposed to wait%, but I've seen that happen in the past in similar situations.
  • iotop showed lots of processes reading from the local disk at a combined 375 MB/s (which sounds suspiciously like the 3 Gbps limit of SATA II, but could be a red herring)

Further observations:

  • stat1008 has a hardware RAID controller, but is configured with software RAID. Ideally we'd:
    • reconfigure with hardware RAID (as we apparently do w/the Hadoop workers) This would save CPU, as the dedicated hardware controller would manage RAID operations (striping/mirroring/recovery)
    • set the I/O scheduler to none. This delegates the I/O scheduling to the RAID controller, which should save CPU as well (the controller always makes the I/O decisions anyway, this just prevents the host from wasting cycles with IOPS that will be rescheduled by the controller).
  • Procurement ticket T242149 shows the disks as 2TB 7.2K RPM SATA 6Gbps 512n 2.5in Hot-plug Hard Drive 400-ASHT , that should give us some idea of what access speeds to expect.

I don't know that the changes above would have prevented the lockup we saw today, but they should get us closer to peak performance on this hardware.

bking closed this task as Resolved.EditedAug 27 2024, 9:55 PM

Per conversation with @BTullis , This host is due for refresh in February 2025. At that point, its GPU will be moved to a replacement host.

Unless we start to see these lockups a lot more often, I don't think it's worth the effort to reimage the host as it has ~5.6 TB of data and we'd need to backup and restore over its 1Gbps NIC (which would take a full 24 hours under ideal circumstances.)

Since we're not going to reimage the host, I'm closing out this ticket. Work to stabilize the host via cgroups continues in T372416 ....