Page MenuHomePhabricator

Set MW appserver scaling_governor to performance
Closed, ResolvedPublic

Assigned To
Authored By
tstarling
Aug 17 2022, 12:44 AM
Referenced Files
F35496065: scaling_governor rollout.png
Aug 30 2022, 2:39 AM
F35484662: wtp performance.png
Aug 23 2022, 4:20 AM
F35484606: balance_performance.png
Aug 23 2022, 2:01 AM
F35470930: T315398-3.png
Aug 18 2022, 5:36 AM
F35470912: T315398-2.png
Aug 18 2022, 5:10 AM
F35470849: T315398-1.png
Aug 18 2022, 2:53 AM

Description

My experiment on a single depooled server indicated that a significant benefit to end-user latency might occur if we set scaling_governor to performance on MW appservers.

I propose making this change on all eqiad appservers in soft state, with cumin. Our latency metrics are noisy so changing it everywhere at once will give us the best chance of measuring a benefit.

We don't collect per-server power usage for these servers, but we can look at power usage at higher levels of aggregation. We can use the power usage data to calculate the cost of this change.

If a cost/benefit analysis supports the change, we can puppetize it.

Event Timeline

I propose making this change on all eqiad appservers in soft state, with cumin. Our latency metrics are noisy so changing it everywhere at once will give us the best chance of measuring a benefit.

Changing the voltage, frequency, power consumption, and thermal production of all application server CPUs all at once is needlessly risky. I think this has to be done incrementally, for safety reasons.

After it is proven to be safe, perhaps then you can toggle it on and off for the whole fleet, to measure the impact on latency. But I am skeptical that there could be a significant effect that you could measure on the entire fleet that you couldn't also measure on, say, half of the fleet, with the other half serving as the control.

There are many interesting performance experiments that could be run on the application servers, so putting in some additional work to make running this kind of experiment easier would pay big dividends. I'll file a separate task with some additional thoughts on this.

Based on https://www.kernel.org/doc/html/v5.6/admin-guide/pm/intel_pstate.html#operation-modes the scaling behavior will be different for systems depending on whether or not hardware-managed P-states (HWP) support is available and enabled. It looks like it is not available on 56 out of 265 app servers: P32411.

cpuinfo on eqiad appservers:

$ sudo cumin 'A:mw-eqiad' 'grep '\''model name'\'' /proc/cpuinfo | head -n1'

===== NODE GROUP =====                                                                                                                                                    
(2) mwdebug[1001-1002].eqiad.wmnet                                                                                                                                        
----- OUTPUT of 'grep 'model name...uinfo | head -n1' -----                                                                                                               
model name      : Intel Xeon E3-12xx v2 (Ivy Bridge)                                                                                                                      
===== NODE GROUP =====                                                                                                                                                    
(18) mw[1349-1355,1364-1373,1384].eqiad.wmnet                                                                                                                             
----- OUTPUT of 'grep 'model name...uinfo | head -n1' -----                                                                                                               
model name      : Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz                                                                                                              
===== NODE GROUP =====                                                                                                                                                    
(15) mw[1319-1333].eqiad.wmnet                                                                                                                                            
----- OUTPUT of 'grep 'model name...uinfo | head -n1' -----                                                                                                               
model name      : Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz                                                                                                               
===== NODE GROUP =====                                                                                                                                                    
(37) mw[1385,1387,1389,1391,1393,1395,1397,1399,1401,1403,1405,1407,1409,1411,1413-1420,1429-1436,1441-1442,1451-1454,1456].eqiad.wmnet                                   
----- OUTPUT of 'grep 'model name...uinfo | head -n1' -----                                                                                                               
model name      : Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz                                                                                                              
===== NODE GROUP =====                                                                                                                                                    
(1) mw1455.eqiad.wmnet                                                                                                                                                    
----- OUTPUT of 'grep 'model name...uinfo | head -n1' -----                                                                                                               
Warning: Permanently added the ECDSA host key for IP address '2620:0:861:101:10:64:0:62' to the list of known hosts.                                                      
model name      : Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz                                                                                                              
================

I chose the following groups of hosts:

  • Intervention group: mw1411, mw1413, mw1419, mw1429, mw1431, mw1433
  • Control group: mw1414, mw1416, mw1418, mw1430, mw1432, mw1436

I excluded mw1415 since it was an outlier for power usage, averaging 229W while the others were using around 165W.

I excluded mw1417 and mw1434 since they were outliers on latency, averaging 237ms while the others were around 220ms.

Mentioned in SAL (#wikimedia-operations) [2022-08-18T02:15:56Z] <TimStarling> on mw1411, mw1413, mw1419, mw1429, mw1431, mw1433: set scaling_governor to performance T315398

We're seeing no effect.

T315398-1.png (648×774 px, 93 KB)

From custom dashboard

Note that this is the same model of processor that I used for the codfw benchmarks, where I saw a ~30% drop in latency.

I didn't apply the policy correctly. I only managed to set it on cpu0. On the codfw benchmark I did it correctly.

Mentioned in SAL (#wikimedia-operations) [2022-08-18T04:30:02Z] <TimStarling> on mw1411, mw1413, mw1419, mw1429, mw1431, mw1433: set scaling_governor to performance, attempt 2, T315398

That did the trick. I applied a 20 minute moving average so the ramp rate seen here is not real.

T315398-2.png (597×773 px, 50 KB)

Mean latency fell from 220ms to 185ms, a 16% drop. Power consumption rose from 959W to 1040W, an 8% rise.

The amount of time cores spend with a clock speed over 2GHz increased from 31% to 87%. That seems excessive given that CPU utilization is only ~16%. But it's hard to argue with those latency numbers.

T315398-3.png (609×778 px, 76 KB)

Congratulations, this is a huge win! I think we should dig deeper to see if we can get the same or similar performance benefit, but waste less power.

The intel_pstate docs state that in HWP+performance mode, "the range of P-states available [...] is always restricted to the upper boundary" (ref). I checked this on a depooled app server in codfw (mw2307) and found that with 'powersave' frequency scaled all the way down to 1GHz whereas with 'performance' it never dropped below 2.5GHz.

On CPUs that feature EPP (Energy-Performance Preference), the performance governor sets EPP to 0, and blocks any attempts to change it via the sysfs interface. With 'powersave' EPP defaults to 128.

Interestingly, it is possible to set EPP to 0 with the 'powersave' governor, and the result is not the same as 'performance'. With 'powersave' and EPP=0 the idle frequency goes all the way down to 1GHz. You can see why by looking at the effect of the governor selection on the processor's frequency-scaling MSRs:

# With performance:
$ x86_energy_perf_policy --cpu 0
cpu0: HWP_REQ: min 32 max 32 des 0 epp 0 window 0x0 (0*10^0us) use_pkg 0

# With powersave:
$ x86_energy_perf_policy --cpu 0
cpu0: HWP_REQ: min 10 max 32 des 0 epp 128 window 0x0 (0*10^0us) use_pkg 0

# powersave + EPP 0 -- note the different min:
$ x86_energy_perf_policy --hwp-epp 0
$ x86_energy_perf_policy --cpu 0
cpu0: HWP_REQ: min 10 max 32 des 0 epp 0 window 0x0 (0*10^0us) use_pkg 0

(x86_energy_perf_policy is part of the linux-cpupower Debian package)

So 'powersave' with EPP=0 gives a broader range of operating frequencies than 'performance'. We should see if in this mode the frequency scaling is still responsive enough for the workload.

We can also try EPP values that fall between 0 and 128. 'powersave' with EPP=64 would be interesting.

AFAICT the sysfs interface (/sys/devices/system/cpu/cpu*/cpufreq/energy_performance_preference) can only be used to select one of the four named values (performance/0, balance-performance/128, balance-power/192 and power/255). You can select other values using x86_energy_perf_policy --hwp-epp <0-255>.

Note that if you switch the governor to performance and then back to powersave, EPP is set to its previous value (i.e. not necessarily the initial value of 128).

Ladsgroup triaged this task as Medium priority.Aug 22 2022, 5:29 AM

(don't mind me, SRE clinic duty)

AFAICT the sysfs interface (/sys/devices/system/cpu/cpu*/cpufreq/energy_performance_preference) can only be used to select one of the four named values (performance/0, balance-performance/128, balance-power/192 and power/255).

I checked the source, and it does look that way, yes.

I will try scaling_governor=powersave with energy_performance_preference=performance and balance_performance on the same set of servers that I previously used. Then I will move on to different processor models. @Joe asked me to try it on the Parsoid servers, which are all Xeon E5-2640. There's plenty to do without getting into fine-grained testing of EPP values.

Something (puppet?) is randomly setting energy_performance_preference back to balance_performance after I set it to performance.

Probably not puppet or anything in the userspace. I did a manual puppet run on mw1411, and there was no change. Some time between 01:27 and 01:35, it changed to balance_performance, but the modification timestamp on the sysfs file was still 01:20:11, which is when I manually changed it to performance.

I will set them manually to balance_performance.

Mentioned in SAL (#wikimedia-operations) [2022-08-23T01:41:19Z] <TimStarling> on mw1411, mw1413, mw1419, mw1429, mw1431, mw1433: set energy_performance_preference to balance_performance T315398

balance_performance is in use in the control group of servers, so it shouldn't be surprising that the metrics are converging.

balance_performance.png (340×763 px, 30 KB)

Mentioned in SAL (#wikimedia-operations) [2022-08-23T03:00:25Z] <TimStarling> on wtp1025,wtp1027,wtp1029,wtp1031,wtp1033,wtp1035: set scaling_governor to performance T315398

Parsoid results

Since there's more going on in the control group, I added graphs of the differences.

wtp performance.png (1×1 px, 324 KB)

The hosts1 group was 22ms slower on average than hosts2 over the hour before the change. After the change was applied to hosts1, it was 77ms faster than hosts2. So the latency reduction was 99ms or 9%.

Power consumption data was noisy, but probably increased by about 8%, from 922W to 994W.

These aggregate numbers came from CSV exports.

Cost/benefit analysis

From the 2021 sustainability report we can derive an emissions intensity of 0.308 kgCO2e/kWh at eqiad and 0.396 kgCO2e/kWh at codfw. Applying a PUE of 1.46 to derive emissions due to server energy usage, and supposing that the power usage of the appserver, api_appserver and parsoid clusters will increase by 8% due to this change, we may estimate additional CO2 emissions of 13,200 kg p.a., or 1.2% of datacenter related emissions.

Performance is expected to improve by more than 8%, and so performance per watt is expected to improve. CPU utilisation will be reduced, informing future purchasing decisions. If we assume that the appserver and API appserver performance is improved by 16%, and the parsoid cluster performance is improved by 9%, then turning off that percentage of servers would give a net reduction in CO2 emissions of 1,300 kg p.a.

Change 826405 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/puppet@production] Apply scaling_governor=performance to MediaWiki appservers

https://gerrit.wikimedia.org/r/826405

So 'powersave' with EPP=0 gives a broader range of operating frequencies than 'performance'. We should see if in this mode the frequency scaling is still responsive enough for the workload.

Tim, this is the experiment I was suggesting — could we please try this one?

Since it's Friday for you, if I don't hear back, I'll set it on group of servers you used in T315398#8176582.

Actually, let me not step on your toes. But if you can tolerate a short extension of this task, I would very much like to see this setting tested. I think there is a good chance it will give the same or very similar performance increase with less waste of power. Just to be fully explicit, the setting is:

  1. Scaling governor set to 'powersave'
  2. epp set to 0 via x86_energy_perf_policy --hwp-epp 0

The two settings have to be applied in that order.

So 'powersave' with EPP=0 gives a broader range of operating frequencies than 'performance'. We should see if in this mode the frequency scaling is still responsive enough for the workload.

Tim, this is the experiment I was suggesting — could we please try this one?

As I said at T315398#8176574, when I used sysfs to set scaling_governor to powersave and energy_performance_preference to performance, something (the BIOS?) kept setting it back to balance_performance (128) every 10 minutes or so. I'm skeptical anyway, I think performance is the right answer. I think the increased power usage is tolerable and well-justified. It's a conventional configuration, commonly used at WMF and elsewhere. My next step here is to roll it out everywhere with puppet, I'm not planning on doing any more experiments. But feel free to try it yourself.

Mentioned in SAL (#wikimedia-operations) [2022-08-28T20:18:56Z] <ori> mw1411, mw1413, mw1419, mw1429, mw1431, mw1433: set energy-performance preference to 0 via 'x86_energy_perf_policy --hwp-epp 0' T315398

I tried setting EPP to 0 using x86_energy_perf_policy, thinking that bypassing the sysfs interface and writing directly to the MSR would make the setting sticky. Unfortunately this does not seem to be the case -- the EPP is gradually reset to 128, same as when you tried changing it via sysfs. At this point I also don't see value in further experimentation with the EPP knob and agree that performance is the way to go.

Change 826405 merged by Tim Starling:

[operations/puppet@production] Apply scaling_governor=performance to MediaWiki servers

https://gerrit.wikimedia.org/r/826405

Change 829040 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] O:mediawiki::common: Exclude VM from cpufrequtils

https://gerrit.wikimedia.org/r/829040

Change 829040 merged by Clément Goubert:

[operations/puppet@production] C:cpufrequtils: Exclude VM from cpufrequtils

https://gerrit.wikimedia.org/r/829040