⚓ T225713 CPU scaling governor audit

	Subject	Repo	Branch	Lines +/-
	facter - cpu_details: add governor and scaling_driver facts	operations/puppet	production	+31 -4

Status	Assigned	Task
Open	pmiazga	T302623 FY2022-2023: Improve Backend Pageview Timing
Declined	None	T225713 CPU scaling governor audit
Resolved	fgiunchedi	T227667 ms-be2022 misbehaving / error on boot
Resolved	tstarling	T315398 Set MW appserver scaling_governor to performance

gerritbot added a project: Patch-For-Review.Jul 5 2019, 1:47 PM

Change 520885 merged by Jbond:
[operations/puppet@production] facter - cpu_details: add governor and scaling_driver facts

https://gerrit.wikimedia.org/r/520885

Maintenance_bot removed a project: Patch-For-Review.Jul 5 2019, 4:12 PM

CDanis subscribed.Jul 6 2019, 12:22 AM

fgiunchedi updated the task description. (Show Details)Jul 8 2019, 8:57 AM

Mentioned in SAL (#wikimedia-operations) [2019-07-09T15:22:41Z] <godog> reboot ms-be2023 with oemhp_powerreg=os - T225713

Mentioned in SAL (#wikimedia-operations) [2019-07-09T15:27:47Z] <godog> reboot ms-be2024 with oemhp_powerreg=os - T225713

Mentioned in SAL (#wikimedia-operations) [2019-07-09T16:29:28Z] <godog> reboot ms-be2025 with oemhp_powerreg=os - T225713

Mentioned in SAL (#wikimedia-operations) [2019-07-09T16:42:37Z] <godog> reboot ms-be2026 with oemhp_powerreg=os - T225713

Mentioned in SAL (#wikimedia-operations) [2019-07-09T16:54:03Z] <godog> reboot ms-be2027 with oemhp_powerreg=os - T225713

Mentioned in SAL (#wikimedia-operations) [2019-07-09T16:59:58Z] <godog> reboot ms-be2039 with oemhp_powerreg=os - T225713

For some reasons I don't seem to be able to set oemhp_powerreg on ms-be2022, I'll try rebooting

</>hpiLO-> show /system1/oemhp_power1
                                     
status=0
status_tag=COMMAND COMPLETED
Wed Jul 10 12:45:14 2019
                        


/system1/oemhp_power1
  Targets
  Properties
    oemhp_powerreg=unavailable
    iLO 4 license is required.
    oemhp_PresentPower=257 Watts
    oemhp_power_micro_ver=1.0.9
    oemhp_auto_pwr=ON (Minimum delay)
  Verbs
    cd version exit show set

jcrespo subscribed.Jul 10 2019, 12:48 PM

Mentioned in SAL (#wikimedia-operations) [2019-07-10T12:49:27Z] <godog> reboot ms-be2022 - T225713

fgiunchedi mentioned this in T227667: ms-be2022 misbehaving / error on boot.Jul 10 2019, 1:50 PM

Ok now all codfw row D for ms-be hosts is running with powersave, will leave it like that for a little while, no adverse effects observed so far. If the trend continue I'll do all ms-be hosts in codfw.

I've set oemhp_power1 for all ms-be hosts in codfw now, will start a rolling reboot of those:

ms-be2016 ms-be2017 ms-be2018 ms-be2019 ms-be2020 ms-be2021 ms-be2028 ms-be2029 ms-be2030 ms-be2031 ms-be2032 ms-be2033 ms-be2034 ms-be2035 ms-be2036

Mentioned in SAL (#wikimedia-operations) [2019-07-11T13:40:49Z] <godog> roll restart ms-be2016 ms-be2017 ms-be2018 ms-be2019 ms-be2020 ms-be2021 ms-be2028 ms-be2029 ms-be2030 ms-be2031 ms-be2032 ms-be2033 ms-be2034 ms-be2035 ms-be2036 - T225713

fgiunchedi updated the task description. (Show Details)Jul 11 2019, 2:45 PM

fgiunchedi updated the task description. (Show Details)Jul 11 2019, 2:54 PM

Mentioned in SAL (#wikimedia-operations) [2019-07-11T14:55:45Z] <gehel> setting CPU governor to performance for elastic1052 - T225713

In T225713#5324717, @Stashbot wrote:

Mentioned in SAL (#wikimedia-operations) [2019-07-11T14:55:45Z] <gehel> setting CPU governor to performance for elastic1052 - T225713

this is just setting the governor to performance via /sys/... Once testing show this works well, I'll go through the full operation (bios + restart).

fgiunchedi updated the task description. (Show Details)Jul 11 2019, 3:06 PM

I observe a pretty significant drop in CPU usage on elastic1052 (>50% to ~25%), so that looks good. I'll wait until Monday to apply to the whole cluster.

Mentioned in SAL (#wikimedia-operations) [2019-07-11T15:28:35Z] <gehel> setting CPU governor to performance for wdqs1004 - T225713

Mentioned in SAL (#wikimedia-operations) [2019-07-12T18:49:43Z] <gehel> setting CPU governor to performance for wdqs1010 - T225713

Mentioned in SAL (#wikimedia-operations) [2019-07-15T08:22:34Z] <godog> set oemhp_powerreg=os on ms-be10[16-39] - T225713

Mentioned in SAL (#wikimedia-operations) [2019-07-15T08:48:50Z] <gehel> set oemhp_powerreg=os + reboot for elastic1054 - T225713

Mentioned in SAL (#wikimedia-operations) [2019-07-15T08:49:42Z] <gehel> correction: set oemhp_powerreg=os + reboot for elastic1052 (NOT elastic1054) - T225713

Mentioned in SAL (#wikimedia-operations) [2019-07-15T12:55:05Z] <gehel> shutting down tilerator on maps eqiad to free some CPU - T225713

Mentioned in SAL (#wikimedia-operations) [2019-07-15T12:59:19Z] <gehel> re-enabling kartotherian codfw - T225713

Mentioned in SAL (#wikimedia-operations) [2019-07-15T12:59:48Z] <gehel> depooling kartotherian eqiad - T225713

fgiunchedi updated the task description. (Show Details)Jul 15 2019, 3:01 PM

Have you planned the cloudvirt yet? I guess that is a bit more challenging since instances would have to be moved ahead of time, but I am genuinely interested in seeing whether that improves the bad CPU experience I have noticed.

Oops, the 3 logs above about maps shoudl have been on T218097

jcrespo updated the task description. (Show Details)Jul 15 2019, 4:44 PM

Mentioned in SAL (#wikimedia-operations) [2019-07-15T16:58:20Z] <jynus> setting labsdb1009/10/11 to performance scaling_governor T225713

FYI, after applying the above change, I expected a huge shift on reported load (even if performance didn't change) or on temperatures, given this (wikireplicas on labs) are our busiest databases on cpu resources due to long-running queries, however, I didn't see much difference, unlike other reporters, except on the temperatures of labsdb1011, none on the load or the temperatures of the others. Maybe CPU was already a problem in scaling for database load or something else? https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&from=1563087571551&to=1563260371552&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql&var-instance=labsdb1009&var-instance=labsdb1010&var-instance=labsdb1011

Andrew updated the task description. (Show Details)Jul 16 2019, 3:44 PM

Mentioned in SAL (#wikimedia-operations) [2019-07-16T18:02:51Z] <andrewbogott> rebooting cloudcontrol2003-dev, cloudweb2001-dev, cloudcontrol1004 for T225713

Andrew updated the task description. (Show Details)Jul 16 2019, 6:04 PM

Andrew updated the task description. (Show Details)Jul 16 2019, 6:19 PM

Mentioned in SAL (#wikimedia-operations) [2019-07-17T10:30:55Z] <godog> start rolling reboot of ms-be eqiad hosts - T225713

fgiunchedi closed subtask T227667: ms-be2022 misbehaving / error on boot as Resolved.Jul 17 2019, 3:37 PM

Mentioned in SAL (#wikimedia-operations) [2019-07-18T09:09:08Z] <godog> resume swift ms-be rolling restarts - T225713

fgiunchedi updated the task description. (Show Details)Jul 18 2019, 10:26 AM

Mentioned in SAL (#wikimedia-operations) [2019-07-18T10:29:12Z] <godog> reboot wezen.codfw.wmnet - T225713

fgiunchedi updated the task description. (Show Details)Jul 18 2019, 10:33 AM

In T225713#5335975, @jcrespo wrote:

FYI, after applying the above change, I expected a huge shift on reported load (even if performance didn't change) or on temperatures, given this (wikireplicas on labs) are our busiest databases on cpu resources due to long-running queries, however, I didn't see much difference, unlike other reporters, except on the temperatures of labsdb1011, none on the load or the temperatures of the others. Maybe CPU was already a problem in scaling for database load or something else? https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&from=1563087571551&to=1563260371552&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql&var-instance=labsdb1009&var-instance=labsdb1010&var-instance=labsdb1011

It is indeed possible that the CPU was already very utilized, I can see a small decrease in system CPU % but other than that things seems unchanged. I'm curious to know/see what powersave will do at the next reboot!

2019-07-19-113720_514x283_scrot.png (283×514 px, 30 KB)

• JHedden subscribed.Jul 25 2019, 1:32 PM

Mentioned in SAL (#wikimedia-operations) [2019-07-26T13:41:43Z] <jeh> updated labstore100[67].wikimedia.org performance scaling_governor T225713

• JHedden updated the task description. (Show Details)Jul 26 2019, 1:42 PM

In T225713#5335975, @jcrespo wrote:

FYI, after applying the above change, I expected a huge shift on reported load (even if performance didn't change) or on temperatures, given this (wikireplicas on labs) are our busiest databases on cpu resources due to long-running queries, however, I didn't see much difference, unlike other reporters, except on the temperatures of labsdb1011, none on the load or the temperatures of the others. Maybe CPU was already a problem in scaling for database load or something else? https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&from=1563087571551&to=1563260371552&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql&var-instance=labsdb1009&var-instance=labsdb1010&var-instance=labsdb1011

I did the same test during the offsite in Dublin with labsdb1009, and also didn't see any major changes.

elastic[1032-1052].eqiad.wmnet,elastic[2025-2036].codfw.wmnet have been configured with set /system1/oemhp_power1 oemhp_powerreg=os. This will take effect after next rolling restart.

fgiunchedi moved this task from Doing to Radar on the User-fgiunchedi board.Aug 13 2019, 1:16 PM

Gehel updated the task description. (Show Details)Aug 15 2019, 2:18 PM

Gehel updated the task description. (Show Details)Aug 15 2019, 7:30 PM

hashar mentioned this in T188375: castor rsync's taking 3-5 minutes for mwgate-npm jobs.Sep 11 2019, 6:20 PM

+ cloud-services-team for the hosts: cloudvirt[1001-1009,1012-1013,1019-1020].eqiad.wmnet

cloudvirt1014 has already been updated and cloudvirt1013 has the same CPU. From T223971, testing with a busy loop: time $(i=1; while (( i < 2000000 )); do (( i ++ )); done):

Host	mw2139	cloudvirt1006	cloudvirt1013	cloudvirt1014 (updated)
CPU	`E5-2450`	`E5-2697 v2`	`E5-2697 v3`	`E5-2697 v3`
Speed	2.10 GHz	2.70 GHz	2.60GHz	2.60GHz
Turbo	2.9 GHz	3.5 GHz	3.6GHz	3.6GHz
Time	10s	17s	12s	7s

If this task audit hold true, when doing the change to cloudvirt1013 it should be faster as a result and the setting should be applied on all the other affected cloudvirt hosts.

Andrew edited projects, added cloud-services-team (Kanban); removed cloud-services-team.Sep 27 2019, 8:25 PM

bd808 moved this task from Inbox to Watching on the cloud-services-team (Kanban) board.Oct 7 2019, 11:13 PM

Jdforrester-WMF subscribed.Oct 9 2019, 5:40 PM

@Andrew @bd808 @aborrero can you look at updating the bios setting for some of the affected cloudvirt?

Based on the benchmark in my previous comment, that seems to dramatically improve the CPU performance (12 seconds down to 7 seconds?!)

MoritzMuehlenhoff mentioned this in T240177: backup2001 crashed 2019-12-08.Dec 11 2019, 3:03 PM

hashar mentioned this in T249726: operations-puppet-tests-buster-docker times out after 5 minutes.Apr 8 2020, 3:18 PM

hashar mentioned this in T249727: Migrate integration-agent-puppet-docker-1001 to a different cloudvirt machine.Apr 8 2020, 3:26 PM

Not actively working on this, respective service/hw owners assess the need/feasibility of this change

hashar unsubscribed.May 2 2020, 8:56 PM

tungsten has been decom'ed today. removed from the list

JMeybohm subscribed.Sep 18 2020, 7:58 AM

fgiunchedi removed a project: User-fgiunchedi.Jan 25 2021, 3:28 PM

fgiunchedi updated the task description. (Show Details)

I was doing comparative benchmarks of eqiad and codfw. @ori suggested that I look at CPU scaling as a possible reason for the discrepancy. The performance impact of setting scaling_governor to performance is indeed significant. The median service time for a page view served by mw2377 dropped from 177ms to 126ms, measured by ab with a concurrency of 1.

I looked at power usage with ipmi-oem dell get-instantaneous-power-consumption-data, every 10 seconds while 1000 requests were served. After an initial ramp up, power usage averaged 97W in powersave mode, and 127W in performance mode. The time to serve 1000 requests was 178s in powersave mode and 127s in performance mode.

The idle power consumption was similar.

powersave is also used in eqiad. It seems to me that for any kind of bursty latency-sensitive workload, powersave is an inappropriate choice.

Krinkle added a project: Performance-Team (Radar).Aug 10 2022, 12:36 AM

Krinkle moved this task from Limbo to Perf recommendation on the Performance-Team (Radar) board.

In T225713#8130251, @tstarling wrote:

The performance impact of setting scaling_governor to performance is indeed significant. The median service time for a page view served by mw2377 dropped from 177ms to 126ms, measured by ab with a concurrency of 1.

That's very interesting! The effect that you're seeing with ab with a concurrency of 1 on a depooled appserver could be the result of the hardware boosting individual cores to a turbo frequency that exceeds the level the processor could sustain on multiple cores or over a prolonged period of time, so it may not be very predictive of the kind of gains we'd see in production, but definitely worth investigating further.

Can we segment performance metrics by appserver? The simplest thing might be to turn this on on 5% of machines and see what it does to power and performance.

Krinkle added a parent task: T302623: FY2022-2023: Improve Backend Pageview Timing.Aug 17 2022, 12:26 AM

tstarling closed subtask T315398: Set MW appserver scaling_governor to performance as Resolved.Aug 30 2022, 2:39 AM

jbond added a project: Puppet.Nov 4 2022, 1:31 PM

Restricted Application added a project: Infrastructure-Foundations. · View Herald TranscriptNov 4 2022, 1:31 PM

fnegri edited projects, added cloud-services-team; removed cloud-services-team (Kanban).Jan 18 2023, 6:41 PM

fnegri moved this task from Kanban to Watching on the cloud-services-team board.

bking mentioned this in T336443: Investigate performance differences between wdqs2022 and older hosts.May 10 2023, 9:39 PM

joanna_borun removed a project: Puppet.Jun 12 2023, 2:55 PM

The hardware in this task been replaced, closing the task. I've opened T338944 for a more generic followup.

bking mentioned this in T340554: Determine whether or not to change CPU frequency governor on Search Platform-owned hosts.Jun 27 2023, 3:06 PM

bking mentioned this in T362922: Audit/consider enabling CPU performance governor on DPE SRE-owned hosts.Apr 18 2024, 6:30 PM

CPU scaling governor audit
Closed, DeclinedPublic
Actions

Description

The problem

The fix

performance vs powersave

Audit

Dell

ondemand

No governor

performance

HP

powersave

No governor

performance

ondemand

Details

Related Objects
Search...

Event Timeline

	F29812006: 2019-07-19-113720_514x283_scrot.png
	Jul 19 2019, 9:39 AM

	F29681442: 2019-07-05-103059_1194x333_scrot.png
	Jul 5 2019, 8:33 AM

	F29681441: 2019-07-05-103109_598x354_scrot.png
	Jul 5 2019, 8:33 AM

	fgiunchedi
	Jun 13 2019, 11:01 AM

CPU scaling governor auditClosed, DeclinedPublicActions

Description

The problem

The fix

performance vs powersave

Audit

Dell

ondemand

No governor

performance

HP

powersave

No governor

performance

ondemand

Details

Related ObjectsSearch...

Event Timeline

CPU scaling governor audit
Closed, DeclinedPublic
Actions

Related Objects
Search...