This is a followup investigation of {T274589} to be more focused on the specific "slow PUTs" problem.
The problem: mw times out on a database transaction e.g. on writing stashed files to swift. Turns out that files are being written from eqiad to codfw at about ~2Mbps, thus a ~600MB file can hit the 300s mw timeout (more context at https://phabricator.wikimedia.org/T274589#6850912).
One example of this problem is the following upload from mw1305 (a jobrunner), which takes 6s in eqiad and 300s in codfw:
```
Feb 23 10:52:40 ms-fe1007 proxy-server: 10.64.16.105 10.64.32.220 23/Feb/2021/10/52/40 PUT /v1/AUTH_mw/wikipedia-commons-local-public.9d/9/9d/TheLostWorld1925.webm HTTP/1.0 201 - wikimedia/multi-http-client%20v1.0 AUTH_tk6fb0dd3ce... 624059195 - 3a6f10dc474ed8113e1498a5751bb075 tx4a56556fd5e44e01bab13-006034de71 - 7.8468 - - 1614077553.124433041 1614077560.971237898 0
Feb 23 10:58:13 ms-fe2007 proxy-server: 10.64.16.105 10.192.32.155 23/Feb/2021/10/58/13 PUT /v1/AUTH_mw/wikipedia-commons-local-public.9d/9/9d/TheLostWorld1925.webm HTTP/1.0 201 - wikimedia/multi-http-client%20v1.0 AUTH_tk5c39f2dc9... 624059195 - 3a6f10dc474ed8113e1498a5751bb075 tx3a392c4173db430aa9e05-006034de82 - 323.8699 - - 1614077570.121001959 1614077893.990885019 0
```
I've tallied on centrallog hosts the average transfer time in codfw for commons PUTs larger than 300MB, to get a better idea of the time frames involved (note that the day boundaries are not exact, log files are rotated at ~6 UTC)
```
zcat ms-fe2*/swift.log-2021${date}.gz | awk '$16 > 300000000 && /PUT \/v1\/AUTH_mw\/wikipedia-commons-local-public/ {print $21 }'
```
```lines=10
jan 01 7.72208
jan 02 8.49384333333333
jan 03 7.75212173913044
...
jan 23 7.75591935483871
jan 24 6.15061379310345
jan 25 9.28589795918367
jan 26 99.3627666666667
jan 27 23.3935756097561
jan 28 9.05734482758621
jan 29 28.5930095238095
jan 30 149.271148
jan 31 70.0038066666667
feb 01 15.1718
feb 02 156.918878571429
feb 03 24.7632038461538
feb 04 52.9359363636364
feb 05 40.7238714285714
feb 06 125.034576923077
feb 07 120.699621428571
feb 08 171.219108333333
feb 09 198.86314
feb 10 320.42824
feb 11 234.13601875
feb 12 349.262411504425
feb 13 384.73485648855
feb 14 390.028958823529
feb 15 409.759209090909
feb 16 346.23785
feb 17 346.446857142857
feb 18 313.29496
feb 19 335.622576190476
feb 20 233.52412
feb 21 355.615288
feb 22 270.534185365854
feb 23 270.328245454545
feb 24 229.021636842105
```
During the Feb 11th time window we were, among other things, in the process of decom'ing swift codfw hosts (T272837) and thus pushing more weight to existing hosts. The codfw swift cluster is now in steady state, in the sense that there are no planned com/decom.
The timeframe also coincided with Buster upgrades for mw hosts, and indeed the reported slow jobrunners are all the Buster ones. See also https://phabricator.wikimedia.org/T275752#6864889 for futher context.