The problem
As discovered by @faidon in T210723: Address recurrent service check time out for "HP RAID" on swift backend hosts at least on ms-be HP hosts the scaling governor ondemand isn't performing very well, namely load average is high and reported cpu utilization % is also high.
On further research the problem is that default bios settings for power control on HP Gen9 ("dynamic") leads linux to loading pcc-cpufreq driver, which doesn't scale with > 4 CPUs and the ondemand governor. Using "os control" for power settings lets linux fully control of scaling, the end result being that intel_pstate driver is loaded and powersave is the default governor. This configuration also matches what happens both on Dell and HP Gen10 for the rest of the fleet (see below for a full audit)
The fix
Issuing set /system1/oemhp_power1 oemhp_powerreg=os from ilo ssh on HP Gen9 hosts and rebooting will switch to intel_pstate driver + powersave governor.
When a reboot is invasive/time consuming (e.g. database hosts) a temporary fix is to set the governor to performance (setting powersave isn't possible, the governors available without a reboot are ondemand performance schedutil) and change the ilo settings. On the next reboot then powersave will get loaded. While temporary, the fix should get pretty close to a preview on what's going to happen in terms of cpu utilization on next reboot.
performance vs powersave
We are forcing some hosts to use performance governor via puppet class cpufrequtils (e.g. lvs/cp), choosing between performance and powersave for a particular class of hosts is outside the scope of this task though, the goal here is to get the fleet to a standard baseline (i.e. intel_pstate + powersave).
Audit
Fleetwide audit below (Dell + powersave + intel_pstate skipped, since that's the desired/default state already)
Dell
cumin -b100 'F:virtual ~ physical and F:manufacturer ~ Dell' 'cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor || true'
ondemand
eeden.wikimedia.org: old host in esams, unused
labsdb[1006-1007].eqiad.wmnet: the acpi_cpufreq module has been loaded, I'm guessing depending on bios settings. Hosts are being decom'd in T220144 so we can let them be.
No governor
bast3002.wikimedia.org,cp1008.wikimedia.org,db2114.codfw.wmnet,db1138.eqiad.wmnet,dbproxy2001.codfw.wmnet,dbpro xy[1001-1011].eqiad.wmnet,dns1002.wikimedia.org,es[2001-2004].codfw.wmnet,helium.eqiad.wmnet,iron.wikimedia.org,labstore[2001-2004].codfw.wmnet,lvs[1001-1006].wikimedia.org,maerlant.wikimedia.org,multatuli.wikimedia.org,nescio.wikimedia.org,rhenium.wikimedia.org,rhodium.eqiad.wmnet
perhaps disabled via bios settings, will need to be audited
performance
cp[2001-2002,2004-2008,2010-2014,2016-2020,2022-2026].codfw.wmnet,cp[1075-1090].eqiad.wmnet,cp[5001-5012].eqsin.wmnet,cp[3030,3032-3036,3038-3047,3049].esams.wmnet,cp[4021-4032].ulsfo.wmnet lvs[1013-1016].eqiad.wmnet,lvs[5001-5003].eqsin.wmnet,lvs[3001-3004].esams.wmnet,lvs[4005-4007].ulsfo.wmnet
expected
analytics1070.eqiad.wmnet,kafka-main[2001-2003].codfw.wmnet,labstore[1004-1005].eqiad.wmnet
manually set for tests or due to bios settings
HP
cumin -b100 'F:virtual ~ physical and F:manufacturer ~ HP' 'cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor || true'
powersave
db[2097-2102].codfw.wmnet,db[1139-1140].eqiad.wmnet DL360 Gen 10, looks like this generation already works out of the box (i.e. intel_pstate is the driver) even when power control is set to dynamic in the bios.
labsdb1012.eqiad.wmnet ditto as above, host is Gen 10 but DL380 not DL360 (and default settings, i.e. power control is dynamic)
ms-be2037.codfw.wmnet DL380 Gen9 but fixed bios settings as part of this task to be "os control"
No governor
mc[1022,1031].eqiad.wmnet likely due to bios settings?
performance
lvs[2001-2006].codfw.wmnet expected
ms-be[2016,2031,2033,2034-2035,2038].codfw.wmnet,ms-be1036.eqiad.wmnet due to tests, will be fixed with bios settings + reboot
ondemand
Will need to be fixed via bios settings (i.e. set /system1/oemhp_power1 oemhp_powerreg=os from ilo over ssh) and reboot.
If reboot is problematic or requires coordination (e.g. databases) then setting the governor to performance via for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor ; do echo performance > $i ; done will get similar performance (higher) until the next reboot when powersave will be used instead.
- aqs[1004-1006].eqiad.wmnet
- cloudcontrol2003-dev.wikimedia.org,cloudcontrol1004.wikimedia.org,cloudweb2001-dev.wikimedia.org
- cloudcontrol1003.wikimedia.org
- clouddb2001-dev.codfw.wmnet
- cloudnet2002-dev.codfw.wmnet,cloudnet[1003-1004].eqiad.wmnet
- cloudservices2002-dev.wikimedia.org
- cloudservices1003.wikimedia.org
- cloudvirt[1001-1009,1012-1013,1019-1020].eqiad.wmnet
- cloudvirt1014.eqiad.wmnet
- conf[1004-1006].eqiad.wmnet
- db[1074-1095].eqiad.wmnet,db[2043-2063,2065-2070].codfw.wmnet (db[2034-2038,2040-2042].codfw.wmnet T221533, dbstore[2001-2002].codfw.wmnet T220002 are to be decom., do that instead)
- druid[1001-1003].eqiad.wmnet
- elastic1041.eqiad.wmnet,elastic[1032-1040,1042-1052].eqiad.wmnet,elastic[2025-2036].codfw.wmnet
- labmon[1001-1002].eqiad.wmnet
- labpuppetmaster[1001-1002].wikimedia.org
- labsdb[1009-1011].eqiad.wmnet
- labstore[1006-1007].wikimedia.org
- labtestpuppetmaster2001.wikimedia.org,labtestservices2003.wikimedia.org,labtestvirt2003.codfw.wmnet
- maps2002.codfw.wmnet,maps[1001-1004].eqiad.wmnet,maps[2001,2003-2004].codfw.wmnet
- mc[1019-1021,1023-1030,1032-1036].eqiad.wmnet,mc[2019-2036].codfw.wmnet
- ms-be[1016-1035,1037-1039].eqiad.wmnet
- mwmaint2001.codfw.wmnet
- netmon2001.wikimedia.org
- oresrdb2002.codfw.wmnet
- rdb[2005-2006].codfw.wmnet
- relforge[1001-1002].eqiad.wmnet
- restbase2009.codfw.wmnet,restbase[1010-1015].eqiad.wmnet
- restbase-dev[1004-1006].eqiad.wmnet
- snapshot[1005-1007].eqiad.wmnet
- stat1006.eqiad.wmnet
- wdqs2003.codfw.wmnet,wdqs1003.eqiad.wmnet
- wezen.codfw.wmnet