Page MenuHomePhabricator

Investigate possible performance degradation on mediawiki servers after Debian Buster upgrade
Closed, ResolvedPublic

Description

It appears that appservers, especially API servers, have their performance negatively affected by upgrading from stretch to buster.

This ticket is to investigate why that is.

some graph screenshots here: https://wikitech.wikimedia.org/wiki/Appserver-buster-upgrade-2021#Is_it_actually_getting_slower?

numbers provided by joe:

Taking mw1403 (stretch) vs mw1405 - appservers. Values in milliseconds. Average values over one day as measured by envoy:
              p50    p75  p90 p99
stretch   144   218   383 1879
buster   149    224  415  1996
 diff  %  +3.5  +2.7  +8.3  +6

Now the same for API servers, specifically mw1404 (buster) vs mw1402 (stretch)
              p50    p75  p90 p99
stretch   73       95   207  1028
buster   81       129  239 1533
 diff  %  +11     +36  +11  +49

spreadsheet with data which server is buster and which is stretch, when they have been purchased, what the reimage progress is:

https://docs.google.com/spreadsheets/d/1Ris18-joRFfd3OHjGJIraVUk-bpmIRORsPoms9D7BcM/edit?usp=sharing

Related Objects

Event Timeline

jijiki renamed this task from performance degradation on appservers after buster upgrade? to Investigate possible performance degradation mediawiki servers after Debian Buster upgrade.Jan 29 2021, 8:00 PM
jijiki renamed this task from Investigate possible performance degradation mediawiki servers after Debian Buster upgrade to Investigate possible performance degradation on mediawiki servers after Debian Buster upgrade.
This comment was removed by jijiki.

Here are overall, whole-cluster mean latency milliseconds, summed across all machines and then broken down by kernel version.

appservers: https://w.wiki/war
apiservers: https://w.wiki/wa$

4.19.0-13-amd64, the kernel version corresponding to Buster, does indeed seem to be reporting higher mean latency than the others.

(This could be prettier/easier to read with https://gerrit.wikimedia.org/r/c/operations/puppet/+/659991 in place.)

Something I haven't looked at yet, but which someone definitely should, is if the machines moved to Buster predominantly correspond to older/slower hardware.

Something I haven't looked at yet, but which someone definitely should, is if the machines moved to Buster predominantly correspond to older/slower hardware.

All mw14xx servers were purchased last year.

From #wikimedia-perf:

04:45:33 <legoktm> how hard would it be to split the flamegraphs by stretch vs buster (on a temporary basis)?
04:45:43 <legoktm> the excimer flamegraphs (https://performance.wikimedia.org/php-profiling/)
08:56:24 <+Krinkle> legoktm: not very, we've done it before. Why?
08:58:29 <+Krinkle> We could emit from wmf config with extra virtual frame, and generate extra graphs from it like for some other factors we do already.
09:22:13 <+Krinkle> excimer + excimer-stretch  + excimer-buster, as channels
09:22:35 <+Krinkle> Would be simplest probably especially with new headroom we have so don't have to squeeze it in a single log file
...
13:11:20 <+Krinkle> legoktm: ok, I understand now :)
14:24:28 <legoktm> Krinkle: yep, I was going to do a bit of CLI profling with parse.php but I think the excimer flamegraphs would make it pretty obvious where the difference
14:25:32 <+Krinkle> legoktm: do we have an mwdebug server of each flavour?
14:25:41 <+Krinkle> could do some xhgui profiles ad-hoc
14:25:43 <legoktm> yeah, let me check
14:25:46 <+Krinkle> might be quicker if it's obvious
14:26:00 <+Krinkle> otherwise yeah we can set up some new profiling but probably not by Tue/Wed next week
14:26:04 <legoktm> mwdebug1003.eqiad.mwnet is buster
14:26:10 <legoktm> 1001 and 1002 are still stretch
14:26:28 <+Krinkle> couple of warm ups first and then profile with xhgui wikimeidadebug :)

Something I haven't looked at yet, but which someone definitely should, is if the machines moved to Buster predominantly correspond to older/slower hardware.

I added the purchase date to our spreadsheet and for example:

mw2291 through mw2376 are all from the same purchase date / hardware order, 2019-12-17.

23 of them are on stretch, 73 are on buster.

Will add more detailed data.

I started by profiling api.php?action=parse&page=Australia:

I was kind of surprised how much overhead ApiResult has, I'll look into that later. Also I realized we've been sitting on a potential perf improvement in utfnormal for a year now, filed T273338: Release utfnormal 3.0.0 for that.

Then I switched to index.php?action=render&title=Australia which should be the same without the ApiResult interference

Not sure what to make of these yet, though it did surprise me that FormatJson::parse showed up in the tree, because that's intended for user input.

I think the excimer flamegraphs will definitely be useful.

Something I haven't looked at yet, but which someone definitely should, is if the machines moved to Buster predominantly correspond to older/slower hardware.

I still have to merge purchase dates into groups of the same hardware type and am somehow off by 4 servers but this for now.

Here is the data which server is on what and which groups have been done to what percentage. For anyone to view:

https://docs.google.com/spreadsheets/d/1Ris18-joRFfd3OHjGJIraVUk-bpmIRORsPoms9D7BcM/edit?usp=sharing

A couple thoughts:

  • There is a patch up for splitting latency data by endpoint, that should help narrowing down where the problem is, but at the same time...
  • It's possible that there is a systematic effect (thus, adding a constant distribution of additional latency on buster), and that would be more evident on endpoints with lower average latencies like API

Change 660863 had a related patch set uploaded (by Legoktm; owner: Legoktm):
[operations/puppet@production] arclamp: Add excimer-{stretch,buster} pipelines

https://gerrit.wikimedia.org/r/660863

Change 660865 had a related patch set uploaded (by Legoktm; owner: Legoktm):
[operations/mediawiki-config@master] profiler: Send data to excimer-{stretch,buster} pipelines

https://gerrit.wikimedia.org/r/660865

Here's the list of most of the stuff that PHP links to that differs: curl, openssl, zlib, gd, freetype, libpng, geoip, iconv, libsodium, libxslt.

km@cashew ~/tmp [1]> diff stretch buster -u
--- stretch	2021-02-01 09:58:43.780917745 -0800
+++ buster	2021-02-01 09:59:21.153210661 -0800
@@ -1,8 +1,8 @@
 phpinfo()
-PHP Version => 7.2.31-1+0~20200514.41+debian9~1.gbpe2a56b+wmf1+icu63
+PHP Version => 7.2.31-1+0~20200514.41+debian9~1.gbpe2a56b+wmf1+buster1
 
-System => Linux mwdebug1001 4.9.0-12-amd64 #1 SMP Debian 4.9.210-1+deb9u1 (2020-06-07) x86_64
-Build Date => Oct 12 2020 09:43:06
+System => Linux mwdebug1003 4.19.0-13-amd64 #1 SMP Debian 4.19.160-2 (2020-11-28) x86_64
+Build Date => Oct 27 2020 12:53:01
 Server API => Command Line Interface
 Virtual Directory Support => disabled
 Configuration File (php.ini) Path => /etc/php/7.2/cli
@@ -129,7 +129,7 @@
 
 Core
 
-PHP Version => 7.2.31-1+0~20200514.41+debian9~1.gbpe2a56b+wmf1+icu63
+PHP Version => 7.2.31-1+0~20200514.41+debian9~1.gbpe2a56b+wmf1+buster1
 
 Directive => Local Value => Master Value
 allow_url_fopen => On => On
@@ -226,8 +226,8 @@
 curl
 
 cURL support => enabled
-cURL Information => 7.52.1
-Age => 3
+cURL Information => 7.64.0
+Age => 4
 Features
 AsynchDNS => Yes
 CharConv => No
@@ -251,9 +251,9 @@
 PSL => Yes
 Protocols => dict, file, ftp, ftps, gopher, http, https, imap, imaps, ldap, ldaps, pop3, pop3s, rtmp, rtsp, scp, sftp, smb, smbs, smtp, smtps, telnet, tftp
 Host => x86_64-pc-linux-gnu
-SSL Version => OpenSSL/1.0.2u
-ZLib Version => 1.2.8
-libSSH Version => libssh2/1.7.0
+SSL Version => OpenSSL/1.1.1d
+ZLib Version => 1.2.11
+libSSH Version => libssh2/1.8.0
 
 date
 
@@ -298,7 +298,7 @@
 exif
 
 EXIF Support => enabled
-EXIF Version => 7.2.31-1+0~20200514.41+debian9~1.gbpe2a56b+wmf1+icu63
+EXIF Version => 7.2.31-1+0~20200514.41+debian9~1.gbpe2a56b+wmf1+buster1
 Supported EXIF Version => 0220
 Supported filetypes => JPEG, TIFF
 Multibyte decoding support using mbstring => enabled
@@ -335,17 +335,17 @@
 gd
 
 GD Support => enabled
-GD headers Version => 2.2.4
-GD library Version => 2.2.4
+GD headers Version => 2.2.5
+GD library Version => 2.2.5
 FreeType Support => enabled
 FreeType Linkage => with freetype
-FreeType Version => 2.6.3
+FreeType Version => 2.9.1
 GIF Read Support => enabled
 GIF Create Support => enabled
 JPEG Support => enabled
 libJPEG Version => 6b
 PNG Support => enabled
-libPNG Version => 1.6.28
+libPNG Version => 1.6.36
 WBMP Support => enabled
 XPM Support => enabled
 libXpm Version => 30411
@@ -359,7 +359,7 @@
 
 geoip support => enabled
 geoip extension version => 1.1.1
-geoip library version => 1006009
+geoip library version => 1006012
 
 Directive => Local Value => Master Value
 geoip.custom_directory => no value => no value
@@ -385,7 +385,7 @@
 
 iconv support => enabled
 iconv implementation => glibc
-iconv library version => 2.24
+iconv library version => 2.28
 
 Directive => Local Value => Master Value
 iconv.input_encoding => no value => no value
@@ -715,8 +715,8 @@
 openssl
 
 OpenSSL support => enabled
-OpenSSL Library Version => OpenSSL 1.1.0l  10 Sep 2019
-OpenSSL Header Version => OpenSSL 1.1.0l  10 Sep 2019
+OpenSSL Library Version => OpenSSL 1.1.1d  10 Sep 2019
+OpenSSL Header Version => OpenSSL 1.1.1d  10 Sep 2019
 Openssl default config => /usr/lib/ssl/openssl.cnf
 
 Directive => Local Value => Master Value
@@ -873,8 +873,8 @@
 sodium
 
 sodium support => enabled
-libsodium headers version => 1.0.11
-libsodium library version => 1.0.11
+libsodium headers version => 1.0.17
+libsodium library version => 1.0.17
 
 SPL
 
@@ -960,10 +960,10 @@
 xsl
 
 XSL => enabled
-libxslt Version => 1.1.29
+libxslt Version => 1.1.32
 libxslt compiled against libxml Version => 2.9.4
 EXSLT => enabled
-libexslt Version => 1.1.29
+libexslt Version => 1.1.32
 
 Zend OPcache
 
@@ -1013,8 +1013,8 @@
 ZLib Support => enabled
 Stream Wrapper => compress.zlib://
 Stream Filter => zlib.inflate, zlib.deflate
-Compiled Version => 1.2.8
-Linked Version => 1.2.8
+Compiled Version => 1.2.11
+Linked Version => 1.2.11
 
 Directive => Local Value => Master Value
 zlib.output_compression => Off => Off

I started by profiling api.php?action=parse&page=Australia:

After looking at Joe's scripts, turns out I was doing this wrong. The current request to profile was api.php?action=parse&text={{:Australia}}.

The profiles are basically identical.

Change 660863 merged by Legoktm:
[operations/puppet@production] arclamp: Add excimer-buster pipeline

https://gerrit.wikimedia.org/r/660863

Change 660865 merged by jenkins-bot:
[operations/mediawiki-config@master] profiler: Send data to excimer-buster pipeline

https://gerrit.wikimedia.org/r/660865

Mentioned in SAL (#wikimedia-operations) [2021-02-01T23:54:45Z] <legoktm@deploy1001> Synchronized wmf-config/profiler.php: profiler: Send data to excimer-buster pipeline (T273312) (duration: 00m 57s)

@Legoktm ran some tests, analogous to the one we ran when evaluating php-fpm on hosts with buster and stretch, the results are here:

https://people.wikimedia.org/~legoktm/T273312/data/clean/images/

my inital takeaways are:

  • Stretch hosts consistently outperform buster ones, but the effect is much more severe on older hardware (2016 hosts, that we will likely replace this year) than on recent one. Actually on the latest hardware the difference is not as clear-cut in some cases.
  • The penalty is much more clear-cut for the load.php calls than for anything else. There the penalty is almost 40% on older hardware, and similar on newer one

Both these facts make me suspect that the thing at play here could be better mitigations for various CPU vulnerabilities in buster's kernel than in stretch, causing a higher penalty for syscall-heavy code like load.php.

I propose we re-run a few of the above benchmarks after disabling CPU mitigations on one stretch and one buster host.

One thing I feel we can exclude at this point is definitely a difference in curl versions.

  • The penalty is much more clear-cut for the load.php calls than for anything else. There the penalty is almost 40% on older hardware, and similar on newer one

Comparing https://performance.wikimedia.org/arclamp/svgs/hourly/2021-02-02_06.excimer.load.svgz and https://performance.wikimedia.org/arclamp/svgs/hourly/2021-02-02_06.excimer-buster.load.svgz quickly the main thing that stands out to me is that MediaWikiServices::getResourceLoader() goes from ~7% on stretch to ~15% on buster.

All architectual mitigations are applied to Stretch as they are to Buster, but one notable exception is retpoline annotations, most of those don't get backported to older LTS kernel branches. OTOH there are also some performance optimisations for the CPU mitigations, which didn't get backported, it's hard to tell.

I propose we re-run a few of the above benchmarks after disabling CPU mitigations on one stretch and one buster host.

This disables all CPU mitigations via the kernel config:

"noibrs noibpb nopti nospectre_v2 nospectre_v1 l1tf=off nospec_store_bypass_disable no_stf_barrier mds=off tsx=on tsx_async_abort=off mitigations=off"

One thing I feel we can exclude at this point is definitely a difference in curl versions.

Or alternatively let's install the 4.19 kernels from backports on a Stretch host. The differences between 4.19/stretch and 4.19/buster are just the compiler in use and the performance impact of that should be hardly measurable.

In fact, we still have an API server running 4.19 (mw1381, back from debugging the cgroups kernel mem leak), so that one can be used for comparison?

Mentioned in SAL (#wikimedia-operations) [2021-02-02T08:02:27Z] <legoktm> depooled mw1381.eqiad.wmnet for perf testing (T273312)

In fact, we still have an API server running 4.19 (mw1381, back from debugging the cgroups kernel mem leak), so that one can be used for comparison?

Thanks for the pointer. I benchmarked it and it seems like its the 4.19 kernel that's at fault for the slowdown (a few outliers, like the main_page request):

Images in https://people.wikimedia.org/~legoktm/T273312/data/clean/images/ see the stretch19_ ones.

This kind-of seals the deal. Upgrading the kernel and including the mitigations for CPU vulnerabilities is non-negotiable, and we'll have to take the hit. It might be interesting to do a further, detailed analysis of the effects by looking at excimer-produced flamegraphs to find out which code might need some specific tuning.

Maybe mw devs (including yours truly) can spend a bit a time to do some improvements on performance (specially reducing syscalls of load.php maybe?) to offset the performance hit? Obviously won't be a blocker.

Maybe mw devs (including yours truly) can spend a bit a time to do some improvements on performance (specially reducing syscalls of load.php maybe?) to offset the performance hit? Obviously won't be a blocker.

That was the idea that I was floating in my previous comment. Excimer flamegraphs should be able to hint us at what specific functions have degraded performance with the retpoline mitigations, and we could work on optimizing those.

  • The penalty is much more clear-cut for the load.php calls than for anything else. There the penalty is almost 40% on older hardware, and similar on newer one

Comparing https://performance.wikimedia.org/arclamp/svgs/hourly/2021-02-02_06.excimer.load.svgz and https://performance.wikimedia.org/arclamp/svgs/hourly/2021-02-02_06.excimer-buster.load.svgz quickly the main thing that stands out to me is that MediaWikiServices::getResourceLoader() goes from ~7% on stretch to ~15% on buster.

There seem to be differences in the two flamegraphs. In particular, for MediaWikiGadgetsDefinitionRepo::loadGadgets I'm seeing Memcached and WANCache queries on the stretch flamegraph that are completely absent on buster (which seems to retrieve data from APCU only). I think we might need a bit more samples on buster to account for cache misses etc.

Comparing https://performance.wikimedia.org/arclamp/svgs/hourly/2021-02-02_06.excimer.load.svgz and https://performance.wikimedia.org/arclamp/svgs/hourly/2021-02-02_06.excimer-buster.load.svgz quickly the main thing that stands out to me is that MediaWikiServices::getResourceLoader() goes from ~7% on stretch to ~15% on buster.

There seem to be differences in the two flamegraphs. In particular, for MediaWikiGadgetsDefinitionRepo::loadGadgets I'm seeing Memcached and WANCache queries on the stretch flamegraph that are completely absent on buster (which seems to retrieve data from APCU only). I think we might need a bit more samples on buster to account for cache misses etc.

Yeah, the daily SVGs should have better data and will be available shortly after midnight UTC. I have some old scripts for parsing the logs that I'll look into reviving.

This kind-of seals the deal. Upgrading the kernel and including the mitigations for CPU vulnerabilities is non-negotiable, and we'll have to take the hit. It might be interesting to do a further, detailed analysis of the effects by looking at excimer-produced flamegraphs to find out which code might need some specific tuning.

Not sure if it's on the table, but assuming there are performance improvements across the board coming in PHP 7.3/7.4/8.0 we might want to at least benchmark it and consider re-prioritizing that migration.

Not sure if it's on the table, but assuming there are performance improvements across the board coming in PHP 7.3/7.4/8.0 we might want to at least benchmark it and consider re-prioritizing that migration.

We need to finish the transition to buster first, but yes, we might want to consider switching to php 7.3 earlier than scheduled (which was after we transitioned to kubernetes).

I think that we should proceed as follows:

  • Try to extract some useful information from flamegraphs to pass on to the MediaWiki developers.
  • Once we've done that, finish the reimage
  • Ramp up our work on having a functional helm chart, targeting php 7.3 (buster's default) directly
  • Start testing php 7.3 in puppet

I guess we will need to re-evaluate the work we have ahead of us for this quarter.

I generated some differential flame graphs that show the comparison of stretch vs buster: https://people.wikimedia.org/~legoktm/T273312/data/clean/images/flamegraphs/

Based on that might be worth relooking at AutoLoader performance again. When I added the PSR-4 implementation back in 2017 I used https://github.com/legoktm/autoload-bench to test various optimizations/strategies/shortcuts.

In load.php, a lot of the regression seems to be in getVersionHash, which eventually hits syscalls via filemtime and on cache miss, file_get_contents. Also in the JS minifier...

Legoktm claimed this task.

Conclusion: Upgrading to Debian Buster is going to come with a performance regression.

Comparing the envoy telemetry for the past day on two appservers:

p50p75p95p99
stretch (mw1403)142ms216ms364ms1.743s
buster (mw1405)151ms225ms415ms1.947s
increase6%4%14%12%

Looking at two API servers:

p50p75p95p99
stretch (mw1402)73ms96ms231ms1.274s
buster (mw1404)80ms117ms263ms1.696s
increase10%22%14%33%

After benchmarking a stretch server running the basically same 4.19 kernel that buster uses, we came to the conclusion this is caused by the additional CPU vulnerability security measures, specifically the retpoline stuff that wasn't in the normal stretch kernels (see T273312#6795042).

(full benchmarking results and data can be found in https://people.wikimedia.org/~legoktm/T273312/data/clean/. we'll move the scripts to a git repository soon)

We also created a separate buster-only pipeline for the excimer sampling profiling and created some differential flame graphs that identify where there are regressions compared to the overall pipeline that's 90% stretch.

These mostly confirmed that the slowdowns are in functions that invoke syscalls, usually by stat-ing and reading files. Interactive versions can be found at https://people.wikimedia.org/~legoktm/T273312/data/clean/images/flamegraphs/

Thank you very much for the flamegraphs and full data, awesome. Also thanks a lot to everybody else who contributed to resolving this ticket.