Page MenuHomePhabricator

Jobrunner on Buster occasional timeout on codfw file upload
Open, MediumPublic

Description

This is a followup investigation of T274589: No atomic section is open (got LocalFile::lockingTransaction) to be more focused on the specific "slow PUTs" problem.

The problem: mw times out on a database transaction e.g. on writing stashed files to swift. Turns out that files are being written from eqiad to codfw at about ~2Mbps, thus a ~600MB file can hit the 300s mw timeout (more context at https://phabricator.wikimedia.org/T274589#6850912).

One example of this problem is the following upload from mw1305 (a jobrunner), which takes 6s in eqiad and 300s in codfw:

Feb 23 10:52:40 ms-fe1007 proxy-server: 10.64.16.105 10.64.32.220 23/Feb/2021/10/52/40 PUT /v1/AUTH_mw/wikipedia-commons-local-public.9d/9/9d/TheLostWorld1925.webm HTTP/1.0 201 - wikimedia/multi-http-client%20v1.0 AUTH_tk6fb0dd3ce... 624059195 - 3a6f10dc474ed8113e1498a5751bb075 tx4a56556fd5e44e01bab13-006034de71 - 7.8468 - - 1614077553.124433041 1614077560.971237898 0

Feb 23 10:58:13 ms-fe2007 proxy-server: 10.64.16.105 10.192.32.155 23/Feb/2021/10/58/13 PUT /v1/AUTH_mw/wikipedia-commons-local-public.9d/9/9d/TheLostWorld1925.webm HTTP/1.0 201 - wikimedia/multi-http-client%20v1.0 AUTH_tk5c39f2dc9... 624059195 - 3a6f10dc474ed8113e1498a5751bb075 tx3a392c4173db430aa9e05-006034de82 - 323.8699 - - 1614077570.121001959 1614077893.990885019 0

I've tallied on centrallog hosts the average transfer time in codfw for commons PUTs larger than 300MB, to get a better idea of the time frames involved (note that the day boundaries are not exact, log files are rotated at ~6 UTC)

zcat ms-fe2*/swift.log-2021${date}.gz | awk '$16 > 300000000 && /PUT \/v1\/AUTH_mw\/wikipedia-commons-local-public/ {print $21 }'
jan 01 7.72208
jan 02 8.49384333333333
jan 03 7.75212173913044
...
jan 23 7.75591935483871
jan 24 6.15061379310345
jan 25 9.28589795918367
jan 26 99.3627666666667
jan 27 23.3935756097561
jan 28 9.05734482758621
jan 29 28.5930095238095
jan 30 149.271148
jan 31 70.0038066666667
feb 01 15.1718
feb 02 156.918878571429
feb 03 24.7632038461538
feb 04 52.9359363636364
feb 05 40.7238714285714
feb 06 125.034576923077
feb 07 120.699621428571
feb 08 171.219108333333
feb 09 198.86314
feb 10 320.42824
feb 11 234.13601875
feb 12 349.262411504425
feb 13 384.73485648855
feb 14 390.028958823529
feb 15 409.759209090909
feb 16 346.23785
feb 17 346.446857142857
feb 18 313.29496
feb 19 335.622576190476
feb 20 233.52412
feb 21 355.615288
feb 22 270.534185365854
feb 23 270.328245454545
feb 24 229.021636842105

During the Feb 11th time window we were, among other things, in the process of decom'ing swift codfw hosts (T272837) and thus pushing more weight to existing hosts. The codfw swift cluster is now in steady state, in the sense that there are no planned com/decom.

The timeframe also coincided with Buster upgrades for mw hosts, and indeed the reported slow jobrunners are all the Buster ones. See also https://phabricator.wikimedia.org/T275752#6864889 for futher context.

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptFeb 25 2021, 11:20 AM
jbond triaged this task as Medium priority.Feb 25 2021, 1:22 PM
jbond added a subscriber: jbond.

looking at triaging this task do we have a gut feel of where the issue may be e.g. mw server, general networking, swift, something else?

looking at triaging this task do we have a gut feel of where the issue may be e.g. mw server, general networking, swift, something else?

Good question, I'm in the process of ruling out candidates, starting with swift, by testing manual uploads

looking at triaging this task do we have a gut feel of where the issue may be e.g. mw server, general networking, swift, something else?

Good question, I'm in the process of ruling out candidates, starting with swift, by testing manual uploads

ack thanks will leave as is for now

Testing up/down loads with https://commons.wikimedia.org/wiki/File:The_Lost_World_(1925).webm (300MB) file from mw1305 (using a different swift account, not mediawiki's to protect against accidental damage)

4s (!) upload time eqiad -> codfw

time swift  upload test_T275752_$(date -Is) The_Lost_World_\(1925\).webm 
The_Lost_World_(1925).webm

real	0m4.108s
user	0m1.902s
sys	0m0.353s

And ~8s for download

swift download --output /dev/null 'test_T275752_2021-02-25T14:45:49+00:00' 'The_Lost_World_(1925).webm'
The_Lost_World_(1925).webm [auth 0.153s, headers 0.360s, total 8.572s, 35.742 MB/s]

Status update: I've been going through the object-server PUT logs on the codfw backend to investigate for performance differences related to e.g. filesystem aging. So far nothing stands out as obviously wrong, both new and old hosts for example can occasionally see transfer times of 900s

Inspecting the client side, all requests that take > 80s are from eqiad jobrunners, it is unclear to me why yet:

$ zcat ms-fe*/swift.log-20210224.gz  | awk '/proxy-server:.*PUT/ && $21 > 80 {print $6}' | bar_chart.py 
# each โˆŽ represents a count of 1. total 124
 10.64.0.169 [     5] โˆŽโˆŽโˆŽโˆŽโˆŽ
 10.64.0.170 [     8] โˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽ
 10.64.0.171 [     2] โˆŽโˆŽ
 10.64.0.172 [     5] โˆŽโˆŽโˆŽโˆŽโˆŽ
 10.64.0.173 [     9] โˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽ
10.64.16.105 [     4] โˆŽโˆŽโˆŽโˆŽ
10.64.16.106 [     6] โˆŽโˆŽโˆŽโˆŽโˆŽโˆŽ
10.64.16.199 [     7] โˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽ
 10.64.16.59 [     5] โˆŽโˆŽโˆŽโˆŽโˆŽ
 10.64.16.60 [     3] โˆŽโˆŽโˆŽ
 10.64.16.61 [     4] โˆŽโˆŽโˆŽโˆŽ
 10.64.16.63 [     3] โˆŽโˆŽโˆŽ
 10.64.16.64 [     5] โˆŽโˆŽโˆŽโˆŽโˆŽ
 10.64.16.65 [     5] โˆŽโˆŽโˆŽโˆŽโˆŽ
 10.64.16.66 [     6] โˆŽโˆŽโˆŽโˆŽโˆŽโˆŽ
 10.64.16.67 [     3] โˆŽโˆŽโˆŽ
 10.64.16.68 [     7] โˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽ
 10.64.16.69 [     6] โˆŽโˆŽโˆŽโˆŽโˆŽโˆŽ
 10.64.32.36 [     2] โˆŽโˆŽ
 10.64.32.37 [    11] โˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽ
 10.64.32.38 [     3] โˆŽโˆŽโˆŽ
 10.64.32.39 [     5] โˆŽโˆŽโˆŽโˆŽโˆŽ
 10.64.32.50 [    10] โˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽ

Also these four jobrunners made the majority of requests on e.g. Feb 24th, also unclear to me why yet

# each โˆŽ represents a count of 6259. total 1754045
  10.192.16.58 [413007] โˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽ
  10.192.16.59 [413030] โˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽ   
  10.192.32.36 [413030] โˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽ
  10.192.32.37 [412957] โˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽ

Looking back a few days, e.g. Feb 4-5th, the list of hosts that take > 80s is still eqiad jobrunners, and suspiciously all have been running buster by that date:

$ zcat ms-fe2*/swift.log-20210205.gz  | awk '/proxy-server:.*PUT/ && $21 > 80 {print $6}' | bar_chart.py 
# each โˆŽ represents a count of 1. total 25
 10.64.0.172 [     2] โˆŽโˆŽ
10.64.16.199 [     6] โˆŽโˆŽโˆŽโˆŽโˆŽโˆŽ
 10.64.32.36 [     4] โˆŽโˆŽโˆŽโˆŽ
 10.64.32.37 [     4] โˆŽโˆŽโˆŽโˆŽ
 10.64.32.38 [     4] โˆŽโˆŽโˆŽโˆŽ
 10.64.32.39 [     1] โˆŽ
 10.64.32.50 [     4] โˆŽโˆŽโˆŽโˆŽ

The ip addresses are (respectively) mw1310.eqiad.wmnet mw1318.eqiad.wmnet mw1334.eqiad.wmnet mw1335.eqiad.wmnet mw1336.eqiad.wmnet mw1337.eqiad.wmnet mw1338.eqiad.wmnet

All running buster by Feb 4-5th:

Debian GNU/Linux 10 auto-installed on Wed Feb 3 23:58:44 UTC 2021.
Debian GNU/Linux 10 auto-installed on Wed Feb 3 23:52:49 UTC 2021.
Debian GNU/Linux 10 auto-installed on Wed Feb 3 19:05:43 UTC 2021.
Debian GNU/Linux 10 auto-installed on Tue Feb 2 21:02:31 UTC 2021.
Debian GNU/Linux 10 auto-installed on Tue Feb 2 21:01:25 UTC 2021.
Debian GNU/Linux 10 auto-installed on Tue Jan 26 20:39:55 UTC 2021.
Debian GNU/Linux 10 auto-installed on Tue Jan 26 20:40:12 UTC 2021.

Similarly on Feb 8th:

$ zcat ms-fe2*/swift.log-20210208.gz | awk '/proxy-server:.*PUT/ && $21 > 8
0 {print $6}' | bar_chart.py                                                                     
# each โˆŽ represents a count of 1. total 40
 10.64.0.169 [     2] โˆŽโˆŽ
 10.64.0.170 [     3] โˆŽโˆŽโˆŽ
 10.64.0.171 [     2] โˆŽโˆŽ
 10.64.0.172 [     2] โˆŽโˆŽ
 10.64.0.173 [     3] โˆŽโˆŽโˆŽ
10.64.16.106 [     2] โˆŽโˆŽ
10.64.16.194 [     1] โˆŽ
10.64.16.199 [     6] โˆŽโˆŽโˆŽโˆŽโˆŽโˆŽ
 10.64.32.36 [     3] โˆŽโˆŽโˆŽ
 10.64.32.37 [     3] โˆŽโˆŽโˆŽ
 10.64.32.38 [     3] โˆŽโˆŽโˆŽ
 10.64.32.39 [     9] โˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽโˆŽ
 10.64.32.50 [     1] โˆŽ

And similar list of mw1307.eqiad.wmnet mw1308.eqiad.wmnet mw1309.eqiad.wmnet mw1310.eqiad.wmnet mw1311.eqiad.wmnet mw1306.eqiad.wmnet mw1313.eqiad.wmnet mw1318.eqiad.wmnet mw1334.eqiad.wmnet mw1335.eqiad.wmnet mw1336.eqiad.wmnet mw1337.eqiad.wmnet mw1338.eqiad.wmnet also all running buster by that time.

It is hard to say conclusively because swift was rebalancing at the time but the list of slow jobrunners on Feb 5th makes me think that buster has sth to do with this behaviour.

fgiunchedi renamed this task from Mediawiki Swift PUTs from eqiad to codfw reported slow to Jobrunner on Buster occasional timeout on codfw file upload.Mar 1 2021, 9:21 AM
fgiunchedi updated the task description. (Show Details)
fgiunchedi added a project: serviceops.

Renamed task to be jobrunner+buster specific and looping in serviceops

I can't imagine a single valid reason for a distro upgrade meaning that data transfer would slow down so much.

My suggestion is we re-image one jobrunner to stretch and we check if that changes things dramatically. @Legoktm / @Dzahn can you take care of this?

I'll also note that the behavior is generally quite rare compared to the number of PUTs from jobrunners, e.g. I haven't been able to reproduce using the swift python client. +1 to re-image a stretch jobrunner

Looking back a few days, e.g. Feb 4-5th, the list of hosts that take > 80s is still eqiad jobrunners, and suspiciously all have been running buster by that date:

AFAICS, T245757 says we're almost fully on buster, so it isn't a surprise that all of them are running buster. mwdebug1003 is stretch, as well as mw1402 (api server) and mw1403 (app server), all of them are special-cases exactly for the purpose of ruling-out/confirming a distro upgrade breaking something.

If it's possible to try it at a given mw server, maybe you can try with those? Or do we need some traffic there (at which point we'd need to reimage a random jobrunner back to stretch)?

Looking back a few days, e.g. Feb 4-5th, the list of hosts that take > 80s is still eqiad jobrunners, and suspiciously all have been running buster by that date:

AFAICS, T245757 says we're almost fully on buster, so it isn't a surprise that all of them are running buster. mwdebug1003 is stretch, as well as mw1402 (api server) and mw1403 (app server), all of them are special-cases exactly for the purpose of ruling-out/confirming a distro upgrade breaking something.

If it's possible to try it at a given mw server, maybe you can try with those? Or do we need some traffic there (at which point we'd need to reimage a random jobrunner back to stretch)?

From my findings in the swift logs the slow uploads happen occasionally only on buster jobrunners, not other mw hosts. It does indeed look like a jobrunner with traffic is needed.

Change 667631 had a related patch set uploaded (by Legoktm; owner: Legoktm):
[operations/puppet@production] install_server: Swtich mw1307 (jobrunner) back to stretch

https://gerrit.wikimedia.org/r/667631

Change 667631 merged by Dzahn:
[operations/puppet@production] install_server: Swtich mw1307 (jobrunner) back to stretch

https://gerrit.wikimedia.org/r/667631

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

mw1307.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202103011718_dzahn_13229_mw1307_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['mw1307.eqiad.wmnet']

and were ALL successful.

My suggestion is we re-image one jobrunner to stretch and we check if that changes things dramatically. @Legoktm / @Dzahn can you take care of this?

done! mw1307.eqiad.wmnet is back on stretch and a jobrunner.

@fgiunchedi could you re-run your analysis to see if mw1307 (10.64.0.169) is still exhibiting the issue?

@fgiunchedi could you re-run your analysis to see if mw1307 (10.64.0.169) is still exhibiting the issue?

For sure! Just did on centrallog1001 (bar_chart.py is part of https://github.com/bitly/data_hacks and lives in my home). See below, doesn't look like mw1307 is in the list at 80s threshold

centrallog1001:/srv/syslog$ zcat ms-fe2*/swift.log-20210325.gz  | awk '/proxy-server:.*PUT/ && $21 > 80 {print $6}' | ~/bar_chart.py 
# each โˆŽ represents a count of 1. total 54
 10.64.0.170 [     1] โˆŽ
 10.64.0.171 [     2] โˆŽโˆŽ
 10.64.0.172 [     2] โˆŽโˆŽ
 10.64.0.173 [     5] โˆŽโˆŽโˆŽโˆŽโˆŽ
10.64.16.105 [     5] โˆŽโˆŽโˆŽโˆŽโˆŽ
10.64.16.106 [     3] โˆŽโˆŽโˆŽ
10.64.16.199 [     2] โˆŽโˆŽ
 10.64.16.58 [     3] โˆŽโˆŽโˆŽ
 10.64.16.59 [     2] โˆŽโˆŽ
 10.64.16.60 [     1] โˆŽ
 10.64.16.61 [     1] โˆŽ
 10.64.16.63 [     4] โˆŽโˆŽโˆŽโˆŽ
 10.64.16.65 [     3] โˆŽโˆŽโˆŽ
 10.64.16.66 [     2] โˆŽโˆŽ
 10.64.16.67 [     3] โˆŽโˆŽโˆŽ
 10.64.16.68 [     2] โˆŽโˆŽ
 10.64.16.69 [     2] โˆŽโˆŽ
 10.64.32.36 [     1] โˆŽ
 10.64.32.37 [     3] โˆŽโˆŽโˆŽ
 10.64.32.38 [     3] โˆŽโˆŽโˆŽ
 10.64.32.39 [     2] โˆŽโˆŽ
 10.64.32.53 [     1] โˆŽ
10.64.48.222 [     1] โˆŽ

@fgiunchedi could you re-run your analysis to see if mw1307 (10.64.0.169) is still exhibiting the issue?

For sure! Just did on centrallog1001 (bar_chart.py is part of https://github.com/bitly/data_hacks and lives in my home). See below, doesn't look like mw1307 is in the list at 80s threshold

Thanks, I re-ran it just now and can confirm the same. I note that the overall occurrence of this has shot up a lot, 496 on 2021-04-03, 194 on 2021-04-04, and 168 on 2021-04-05. But I think that could easily be explained by the increased usage of video2commons, which normally results in large uploads.