Page MenuHomePhabricator

CPU scaling governor audit
Closed, DeclinedPublic

Description

The problem

As discovered by @faidon in T210723: Address recurrent service check time out for "HP RAID" on swift backend hosts at least on ms-be HP hosts the scaling governor ondemand isn't performing very well, namely load average is high and reported cpu utilization % is also high.

On further research the problem is that default bios settings for power control on HP Gen9 ("dynamic") leads linux to loading pcc-cpufreq driver, which doesn't scale with > 4 CPUs and the ondemand governor. Using "os control" for power settings lets linux fully control of scaling, the end result being that intel_pstate driver is loaded and powersave is the default governor. This configuration also matches what happens both on Dell and HP Gen10 for the rest of the fleet (see below for a full audit)

The fix

Issuing set /system1/oemhp_power1 oemhp_powerreg=os from ilo ssh on HP Gen9 hosts and rebooting will switch to intel_pstate driver + powersave governor.

When a reboot is invasive/time consuming (e.g. database hosts) a temporary fix is to set the governor to performance (setting powersave isn't possible, the governors available without a reboot are ondemand performance schedutil) and change the ilo settings. On the next reboot then powersave will get loaded. While temporary, the fix should get pretty close to a preview on what's going to happen in terms of cpu utilization on next reboot.

performance vs powersave

We are forcing some hosts to use performance governor via puppet class cpufrequtils (e.g. lvs/cp), choosing between performance and powersave for a particular class of hosts is outside the scope of this task though, the goal here is to get the fleet to a standard baseline (i.e. intel_pstate + powersave).

Audit

Fleetwide audit below (Dell + powersave + intel_pstate skipped, since that's the desired/default state already)

Dell

cumin -b100 'F:virtual ~ physical and F:manufacturer ~ Dell' 'cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor || true'

ondemand

eeden.wikimedia.org: old host in esams, unused
labsdb[1006-1007].eqiad.wmnet: the acpi_cpufreq module has been loaded, I'm guessing depending on bios settings. Hosts are being decom'd in T220144 so we can let them be.

No governor

bast3002.wikimedia.org,cp1008.wikimedia.org,db2114.codfw.wmnet,db1138.eqiad.wmnet,dbproxy2001.codfw.wmnet,dbpro
xy[1001-1011].eqiad.wmnet,dns1002.wikimedia.org,es[2001-2004].codfw.wmnet,helium.eqiad.wmnet,iron.wikimedia.org,labstore[2001-2004].codfw.wmnet,lvs[1001-1006].wikimedia.org,maerlant.wikimedia.org,multatuli.wikimedia.org,nescio.wikimedia.org,rhenium.wikimedia.org,rhodium.eqiad.wmnet

perhaps disabled via bios settings, will need to be audited

performance

cp[2001-2002,2004-2008,2010-2014,2016-2020,2022-2026].codfw.wmnet,cp[1075-1090].eqiad.wmnet,cp[5001-5012].eqsin.wmnet,cp[3030,3032-3036,3038-3047,3049].esams.wmnet,cp[4021-4032].ulsfo.wmnet
lvs[1013-1016].eqiad.wmnet,lvs[5001-5003].eqsin.wmnet,lvs[3001-3004].esams.wmnet,lvs[4005-4007].ulsfo.wmnet

expected

analytics1070.eqiad.wmnet,kafka-main[2001-2003].codfw.wmnet,labstore[1004-1005].eqiad.wmnet
manually set for tests or due to bios settings

HP

cumin -b100 'F:virtual ~ physical and F:manufacturer ~ HP' 'cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor || true'

powersave

db[2097-2102].codfw.wmnet,db[1139-1140].eqiad.wmnet DL360 Gen 10, looks like this generation already works out of the box (i.e. intel_pstate is the driver) even when power control is set to dynamic in the bios.

labsdb1012.eqiad.wmnet ditto as above, host is Gen 10 but DL380 not DL360 (and default settings, i.e. power control is dynamic)

ms-be2037.codfw.wmnet DL380 Gen9 but fixed bios settings as part of this task to be "os control"

No governor

mc[1022,1031].eqiad.wmnet likely due to bios settings?

performance

lvs[2001-2006].codfw.wmnet expected

ms-be[2016,2031,2033,2034-2035,2038].codfw.wmnet,ms-be1036.eqiad.wmnet due to tests, will be fixed with bios settings + reboot

ondemand

Will need to be fixed via bios settings (i.e. set /system1/oemhp_power1 oemhp_powerreg=os from ilo over ssh) and reboot.

If reboot is problematic or requires coordination (e.g. databases) then setting the governor to performance via for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor ; do echo performance > $i ; done will get similar performance (higher) until the next reboot when powersave will be used instead.

  • aqs[1004-1006].eqiad.wmnet
  • cloudcontrol2003-dev.wikimedia.org,cloudcontrol1004.wikimedia.org,cloudweb2001-dev.wikimedia.org
  • cloudcontrol1003.wikimedia.org
  • clouddb2001-dev.codfw.wmnet
  • cloudnet2002-dev.codfw.wmnet,cloudnet[1003-1004].eqiad.wmnet
  • cloudservices2002-dev.wikimedia.org
  • cloudservices1003.wikimedia.org
  • cloudvirt[1001-1009,1012-1013,1019-1020].eqiad.wmnet
  • cloudvirt1014.eqiad.wmnet
  • conf[1004-1006].eqiad.wmnet
  • db[1074-1095].eqiad.wmnet,db[2043-2063,2065-2070].codfw.wmnet (db[2034-2038,2040-2042].codfw.wmnet T221533, dbstore[2001-2002].codfw.wmnet T220002 are to be decom., do that instead)
  • druid[1001-1003].eqiad.wmnet
  • elastic1041.eqiad.wmnet,elastic[1032-1040,1042-1052].eqiad.wmnet,elastic[2025-2036].codfw.wmnet
  • labmon[1001-1002].eqiad.wmnet
  • labpuppetmaster[1001-1002].wikimedia.org
  • labsdb[1009-1011].eqiad.wmnet
  • labstore[1006-1007].wikimedia.org
  • labtestpuppetmaster2001.wikimedia.org,labtestservices2003.wikimedia.org,labtestvirt2003.codfw.wmnet
  • maps2002.codfw.wmnet,maps[1001-1004].eqiad.wmnet,maps[2001,2003-2004].codfw.wmnet
  • mc[1019-1021,1023-1030,1032-1036].eqiad.wmnet,mc[2019-2036].codfw.wmnet
  • ms-be[1016-1035,1037-1039].eqiad.wmnet
  • mwmaint2001.codfw.wmnet
  • netmon2001.wikimedia.org
  • oresrdb2002.codfw.wmnet
  • rdb[2005-2006].codfw.wmnet
  • relforge[1001-1002].eqiad.wmnet
  • restbase2009.codfw.wmnet,restbase[1010-1015].eqiad.wmnet
  • restbase-dev[1004-1006].eqiad.wmnet
  • snapshot[1005-1007].eqiad.wmnet
  • stat1006.eqiad.wmnet
  • wdqs2003.codfw.wmnet,wdqs1003.eqiad.wmnet
  • wezen.codfw.wmnet

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 520885 merged by Jbond:
[operations/puppet@production] facter - cpu_details: add governor and scaling_driver facts

https://gerrit.wikimedia.org/r/520885

Mentioned in SAL (#wikimedia-operations) [2019-07-09T15:22:41Z] <godog> reboot ms-be2023 with oemhp_powerreg=os - T225713

Mentioned in SAL (#wikimedia-operations) [2019-07-09T15:27:47Z] <godog> reboot ms-be2024 with oemhp_powerreg=os - T225713

Mentioned in SAL (#wikimedia-operations) [2019-07-09T16:29:28Z] <godog> reboot ms-be2025 with oemhp_powerreg=os - T225713

Mentioned in SAL (#wikimedia-operations) [2019-07-09T16:42:37Z] <godog> reboot ms-be2026 with oemhp_powerreg=os - T225713

Mentioned in SAL (#wikimedia-operations) [2019-07-09T16:54:03Z] <godog> reboot ms-be2027 with oemhp_powerreg=os - T225713

Mentioned in SAL (#wikimedia-operations) [2019-07-09T16:59:58Z] <godog> reboot ms-be2039 with oemhp_powerreg=os - T225713

For some reasons I don't seem to be able to set oemhp_powerreg on ms-be2022, I'll try rebooting

</>hpiLO-> show /system1/oemhp_power1
                                     
status=0
status_tag=COMMAND COMPLETED
Wed Jul 10 12:45:14 2019
                        


/system1/oemhp_power1
  Targets
  Properties
    oemhp_powerreg=unavailable
    iLO 4 license is required.
    oemhp_PresentPower=257 Watts
    oemhp_power_micro_ver=1.0.9
    oemhp_auto_pwr=ON (Minimum delay)
  Verbs
    cd version exit show set

Ok now all codfw row D for ms-be hosts is running with powersave, will leave it like that for a little while, no adverse effects observed so far. If the trend continue I'll do all ms-be hosts in codfw.

I've set oemhp_power1 for all ms-be hosts in codfw now, will start a rolling reboot of those:

ms-be2016 ms-be2017 ms-be2018 ms-be2019 ms-be2020 ms-be2021 ms-be2028 ms-be2029 ms-be2030 ms-be2031 ms-be2032 ms-be2033 ms-be2034 ms-be2035 ms-be2036

Mentioned in SAL (#wikimedia-operations) [2019-07-11T13:40:49Z] <godog> roll restart ms-be2016 ms-be2017 ms-be2018 ms-be2019 ms-be2020 ms-be2021 ms-be2028 ms-be2029 ms-be2030 ms-be2031 ms-be2032 ms-be2033 ms-be2034 ms-be2035 ms-be2036 - T225713

Mentioned in SAL (#wikimedia-operations) [2019-07-11T14:55:45Z] <gehel> setting CPU governor to performance for elastic1052 - T225713

Mentioned in SAL (#wikimedia-operations) [2019-07-11T14:55:45Z] <gehel> setting CPU governor to performance for elastic1052 - T225713

this is just setting the governor to performance via /sys/... Once testing show this works well, I'll go through the full operation (bios + restart).

I observe a pretty significant drop in CPU usage on elastic1052 (>50% to ~25%), so that looks good. I'll wait until Monday to apply to the whole cluster.

Mentioned in SAL (#wikimedia-operations) [2019-07-11T15:28:35Z] <gehel> setting CPU governor to performance for wdqs1004 - T225713

Mentioned in SAL (#wikimedia-operations) [2019-07-12T18:49:43Z] <gehel> setting CPU governor to performance for wdqs1010 - T225713

Mentioned in SAL (#wikimedia-operations) [2019-07-15T08:22:34Z] <godog> set oemhp_powerreg=os on ms-be10[16-39] - T225713

Mentioned in SAL (#wikimedia-operations) [2019-07-15T08:48:50Z] <gehel> set oemhp_powerreg=os + reboot for elastic1054 - T225713

Mentioned in SAL (#wikimedia-operations) [2019-07-15T08:49:42Z] <gehel> correction: set oemhp_powerreg=os + reboot for elastic1052 (NOT elastic1054) - T225713

Mentioned in SAL (#wikimedia-operations) [2019-07-15T12:55:05Z] <gehel> shutting down tilerator on maps eqiad to free some CPU - T225713

Mentioned in SAL (#wikimedia-operations) [2019-07-15T12:59:19Z] <gehel> re-enabling kartotherian codfw - T225713

Mentioned in SAL (#wikimedia-operations) [2019-07-15T12:59:48Z] <gehel> depooling kartotherian eqiad - T225713

Have you planned the cloudvirt yet? I guess that is a bit more challenging since instances would have to be moved ahead of time, but I am genuinely interested in seeing whether that improves the bad CPU experience I have noticed.

Oops, the 3 logs above about maps shoudl have been on T218097

Mentioned in SAL (#wikimedia-operations) [2019-07-15T16:58:20Z] <jynus> setting labsdb1009/10/11 to performance scaling_governor T225713

FYI, after applying the above change, I expected a huge shift on reported load (even if performance didn't change) or on temperatures, given this (wikireplicas on labs) are our busiest databases on cpu resources due to long-running queries, however, I didn't see much difference, unlike other reporters, except on the temperatures of labsdb1011, none on the load or the temperatures of the others. Maybe CPU was already a problem in scaling for database load or something else? https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&from=1563087571551&to=1563260371552&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql&var-instance=labsdb1009&var-instance=labsdb1010&var-instance=labsdb1011

Mentioned in SAL (#wikimedia-operations) [2019-07-16T18:02:51Z] <andrewbogott> rebooting cloudcontrol2003-dev, cloudweb2001-dev, cloudcontrol1004 for T225713

Mentioned in SAL (#wikimedia-operations) [2019-07-17T10:30:55Z] <godog> start rolling reboot of ms-be eqiad hosts - T225713

Mentioned in SAL (#wikimedia-operations) [2019-07-18T09:09:08Z] <godog> resume swift ms-be rolling restarts - T225713

FYI, after applying the above change, I expected a huge shift on reported load (even if performance didn't change) or on temperatures, given this (wikireplicas on labs) are our busiest databases on cpu resources due to long-running queries, however, I didn't see much difference, unlike other reporters, except on the temperatures of labsdb1011, none on the load or the temperatures of the others. Maybe CPU was already a problem in scaling for database load or something else? https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&from=1563087571551&to=1563260371552&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql&var-instance=labsdb1009&var-instance=labsdb1010&var-instance=labsdb1011

It is indeed possible that the CPU was already very utilized, I can see a small decrease in system CPU % but other than that things seems unchanged. I'm curious to know/see what powersave will do at the next reboot!

2019-07-19-113720_514x283_scrot.png (283×514 px, 30 KB)

Mentioned in SAL (#wikimedia-operations) [2019-07-26T13:41:43Z] <jeh> updated labstore100[67].wikimedia.org performance scaling_governor T225713

FYI, after applying the above change, I expected a huge shift on reported load (even if performance didn't change) or on temperatures, given this (wikireplicas on labs) are our busiest databases on cpu resources due to long-running queries, however, I didn't see much difference, unlike other reporters, except on the temperatures of labsdb1011, none on the load or the temperatures of the others. Maybe CPU was already a problem in scaling for database load or something else? https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&from=1563087571551&to=1563260371552&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql&var-instance=labsdb1009&var-instance=labsdb1010&var-instance=labsdb1011

I did the same test during the offsite in Dublin with labsdb1009, and also didn't see any major changes.

elastic[1032-1052].eqiad.wmnet,elastic[2025-2036].codfw.wmnet have been configured with set /system1/oemhp_power1 oemhp_powerreg=os. This will take effect after next rolling restart.

+ cloud-services-team for the hosts: cloudvirt[1001-1009,1012-1013,1019-1020].eqiad.wmnet

cloudvirt1014 has already been updated and cloudvirt1013 has the same CPU. From T223971, testing with a busy loop: time $(i=1; while (( i < 2000000 )); do (( i ++ )); done):

Hostmw2139cloudvirt1006cloudvirt1013cloudvirt1014 (updated)
CPUE5-2450E5-2697 v2E5-2697 v3E5-2697 v3
Speed2.10 GHz2.70 GHz2.60GHz2.60GHz
Turbo2.9 GHz3.5 GHz3.6GHz3.6GHz
Time10s17s12s7s

If this task audit hold true, when doing the change to cloudvirt1013 it should be faster as a result and the setting should be applied on all the other affected cloudvirt hosts.

@Andrew @bd808 @aborrero can you look at updating the bios setting for some of the affected cloudvirt?

Based on the benchmark in my previous comment, that seems to dramatically improve the CPU performance (12 seconds down to 7 seconds?!)

Not actively working on this, respective service/hw owners assess the need/feasibility of this change

Dzahn subscribed.

tungsten has been decom'ed today. removed from the list

I was doing comparative benchmarks of eqiad and codfw. @ori suggested that I look at CPU scaling as a possible reason for the discrepancy. The performance impact of setting scaling_governor to performance is indeed significant. The median service time for a page view served by mw2377 dropped from 177ms to 126ms, measured by ab with a concurrency of 1.

I looked at power usage with ipmi-oem dell get-instantaneous-power-consumption-data, every 10 seconds while 1000 requests were served. After an initial ramp up, power usage averaged 97W in powersave mode, and 127W in performance mode. The time to serve 1000 requests was 178s in powersave mode and 127s in performance mode.

The idle power consumption was similar.

powersave is also used in eqiad. It seems to me that for any kind of bursty latency-sensitive workload, powersave is an inappropriate choice.

The performance impact of setting scaling_governor to performance is indeed significant. The median service time for a page view served by mw2377 dropped from 177ms to 126ms, measured by ab with a concurrency of 1.

That's very interesting! The effect that you're seeing with ab with a concurrency of 1 on a depooled appserver could be the result of the hardware boosting individual cores to a turbo frequency that exceeds the level the processor could sustain on multiple cores or over a prolonged period of time, so it may not be very predictive of the kind of gains we'd see in production, but definitely worth investigating further.

Can we segment performance metrics by appserver? The simplest thing might be to turn this on on 5% of machines and see what it does to power and performance.

The hardware in this task been replaced, closing the task. I've opened T338944 for a more generic followup.