Page MenuHomePhabricator

Implement non-cgroups-related performance optimizations on stat hosts
Closed, ResolvedPublic

Description

As I look closer at the stat hosts configurations, it becomes clear that these hosts would benefit from some simple performance optimizations. However, the differences in hardware means there will be some variations.

Creating this ticket to research and non-cgroups-related performance optimizations on the stat hosts. See T376653 for the cgroups-specific work.

Event Timeline

bking changed the task status from Open to In Progress.Oct 9 2024, 3:14 PM
bking triaged this task as Medium priority.
bking created this task.

Change #1078973 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] stat hosts: enable zRAM-based swap

https://gerrit.wikimedia.org/r/1078973

Change #1078973 merged by Bking:

[operations/puppet@production] stat hosts: enable zRAM-based swap

https://gerrit.wikimedia.org/r/1078973

Per the above patch, we've enabled zRAM, which should give the hosts a bit of protection under extreme memory pressure. I had planned on exploring more I/O-related optimizations...but as mentioned in T376653, it's likely these hosts will use Ceph mounts for their homedirs instead of local disks. As such, I don't think it's worth the effort to invest much more time on this issue. We can always revisit if need be. Closing...

Change #1080769 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] stat hosts: Permit zRAM swapping

https://gerrit.wikimedia.org/r/1080769

Change #1080769 merged by Bking:

[operations/puppet@production] stat hosts: Permit zRAM swapping

https://gerrit.wikimedia.org/r/1080769

Mentioned in SAL (#wikimedia-operations) [2024-10-16T19:14:36Z] <inflatador> bking@stat1011 racadm>>racadm set BIOS.MemSettings.NodeInterleave Enabled T376813

Mentioned in SAL (#wikimedia-operations) [2024-10-16T19:16:01Z] <inflatador> bking@stat1011 racadm>>racadm jobqueue create BIOS.Setup.1-1 Commit JID = JID_291241139935 T376813

Reopening, as enabling node interleaving did improve stability on stat1011. We should apply this to the other stat hosts.

For future hosts, it's a BIOS-level setting, so we'll probably need to ask DC Ops to flip this switch as part of the provisioning process.

Mentioned in SAL (#wikimedia-operations) [2024-10-21T13:24:57Z] <inflatador> bking@stat1008.mgmt racadm>>racadm set BIOS.MemSettings.NodeInterleave Enabled T376813

Mentioned in SAL (#wikimedia-operations) [2024-10-21T13:33:11Z] <inflatador> bking@stat1009,stat1010.mgmt racadm>>racadm set BIOS.MemSettings.NodeInterleave Enabled && racadm jobqueue create BIOS.Setup.1-1 T376813

Change #1083187 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] kafka-stretch2001: grant access to analytics-research-admins

https://gerrit.wikimedia.org/r/1083187

Change #1083187 merged by Bking:

[operations/puppet@production] kafka-stretch2001: grant access to analytics-research-admins

https://gerrit.wikimedia.org/r/1083187

Mentioned in SAL (#wikimedia-operations) [2024-10-30T18:39:43Z] <inflatador> bking@stat1008,stat1009,stat1010.mgmt racadm jobqueue delete -i $job T376813

Contrary to my prior statement, I no longer believe that disabling numa is necessary (see this comment for more details).

As such, I've cancelled all the prior jobs I've scheduled above, so numa will no longer be enable on these hosts the next time they reboot. numa remains disabled on stat1011 , but I did schedule a job to apply it at its next reboot. I don't believe we need to schedule this reboot, as the numa status of the host does not affect performance enough to make a difference.

That takes care of all the immediately actionable steps in this ticket, so I'm closing it out. @fkaelin , I'm still open to working together on optimizations if/when we both have time in the future. Feel free to ping me in Slack if you have any feedback on this.