Page MenuHomePhabricator

Swift proxy server misbehaviour (no longer calling `accept`?)
Open, HighPublic

Description

We have over time sometimes seen rises in 5xx error codes reported by ATS from swift. Since moving TLS termination from nginx to envoy (T317616) this has happened much less frequently; and a restart of the affected frontend(s) has been sufficient to resume normal service.

One frontend, ms-fe2010 is in this failed state and depooled, which allowed a little more debugging to take place; it will need to be restored to service at some point, though.

The effect of briefly repooling the server a couple of times can be seen on grafana - connection timeouts and failures.

We can eliminate envoy from blame by using the usual test process of trying to download a kitten picture using curl to port 80:

mvernon@ms-fe2011:~$ curl -4 -o /tmp/foo -v -H "Host: upload.wikimedia.org" http://ms-fe2010.codfw.wmnet/wikipedia/commons/thumb/1/1d/1-month-old_kittens_32.jpg/800px-1-month-old_kittens_32.jpg
* Uses proxy env variable no_proxy == 'wikipedia.org,wikimedia.org,wikibooks.org,wikinews.org,wikiquote.org,wikisource.org,wikiversity.org,wikivoyage.org,wikidata.org,wikiworkshop.org,wikifunctions.org,wiktionary.org,mediawiki.org,wmfusercontent.org,w.wiki,wmnet,127.0.0.1,::1'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 10.192.16.76:80...
  0     0    0     0    0     0      0      0 --:--:--  0:00:07 --:--:--     0* connect to 10.192.16.76 port 80 failed: Connection timed out
* Failed to connect to ms-fe2010.codfw.wmnet port 80: Connection timed out
  0     0    0     0    0     0      0      0 --:--:--  0:00:07 --:--:--     0
* Closing connection 0
curl: (28) Failed to connect to ms-fe2010.codfw.wmnet port 80: Connection timed out

likewise from the affected host:

mvernon@ms-fe2010:~$ curl -o /tmp/foo -v -H "Host: upload.wikimedia.org" http://$(hostname -f)/wikipedia/commons/thumb/1/1d/1-month-old_kittens_32.jpg/800px-1-month-old_kittens_32.jpg
* Uses proxy env variable no_proxy == 'wikipedia.org,wikimedia.org,wikibooks.org,wikinews.org,wikiquote.org,wikisource.org,wikiversity.org,wikivoyage.org,wikidata.org,wikiworkshop.org,wikifunctions.org,wiktionary.org,mediawiki.org,wmfusercontent.org,w.wiki,wmnet,127.0.0.1,::1'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 10.192.16.76:80...
  0     0    0     0    0     0      0      0 --:--:--  0:00:07 --:--:--     0* connect to 10.192.16.76 port 80 failed: Connection timed out
* Failed to connect to ms-fe2010.codfw.wmnet port 80: Connection timed out
  0     0    0     0    0     0      0      0 --:--:--  0:00:07 --:--:--     0
* Closing connection 0
curl: (28) Failed to connect to ms-fe2010.codfw.wmnet port 80: Connection timed out

By running tcpdump -vvvn and then grepping for 10.192.16.76.80 (ms-fe2010 port 80) we can see the connection attempts from ms-fe2011 (10.192.32.36):

mvernon@ms-fe2010:~$ grep 10.192.16.76.80 /tmp/tcpdumpwithmulti 
14:07:04.328210 IP6 (flowlabel 0x849dd, hlim 63, next-header TCP (6) payload length: 40) 2620:0:860:103:10:192:32:36.30820 > 2620:0:860:102:10:192:16:76.80: Flags [S], cksum 0xb15b (correct), seq 1466290441, win 43200, options [mss 1440,sackOK,TS val 3747940912 ecr 0,nop,wscale 9], length 0
14:07:04.328255 IP6 (flowlabel 0x30ebf, hlim 64, next-header TCP (6) payload length: 20) 2620:0:860:102:10:192:16:76.80 > 2620:0:860:103:10:192:32:36.30820: Flags [R.], cksum 0x6357 (incorrect -> 0xab6f), seq 0, ack 1466290442, win 0, length 0
    10.192.32.36.64898 > 10.192.16.76.80: Flags [S], cksum 0xeb4b (correct), seq 4191233004, win 42340, options [mss 1460,sackOK,TS val 69620135 ecr 0,nop,wscale 9], length 0
    10.192.17.9.41056 > 10.192.16.76.80: Flags [S], cksum 0x7206 (correct), seq 2405708278, win 42340, options [mss 1460,sackOK,TS val 3504064215 ecr 0,nop,wscale 9], length 0
    10.192.32.36.64898 > 10.192.16.76.80: Flags [S], cksum 0xe754 (correct), seq 4191233004, win 42340, options [mss 1460,sackOK,TS val 69621150 ecr 0,nop,wscale 9], length 0
    10.192.32.36.64898 > 10.192.16.76.80: Flags [S], cksum 0xdf78 (correct), seq 4191233004, win 42340, options [mss 1460,sackOK,TS val 69623162 ecr 0,nop,wscale 9], length 0
    10.192.17.9.35362 > 10.192.16.76.80: Flags [.], cksum 0xb4e9 (correct), seq 308995049, ack 4026536304, win 83, options [nop,nop,TS val 3504067031 ecr 425282664], length 0
    10.192.16.76.80 > 10.192.17.9.35362: Flags [.], cksum 0x36fb (incorrect -> 0x0223), seq 1, ack 1, win 85, options [nop,nop,TS val 425313380 ecr 3504016543], length 0

[I forgot -4 when generating that packet dump, so you see the initial expected refusal on IPv6 first]

Simultaneously, we can also strace all the swift-proxy processes that have a socket in LISTEN state on port 80:

sudo lsof -i :80 | grep LISTEN | cut -f 2 -d ' ' | sort | uniq | sed -e 's/^/-p /' | xargs sudo strace -o /tmp/multitrace -tt

There is no accept call in that strace (if there were processes stuck in accept that might indicate a kernel issue), so it seems that for some reason the swift-proxy server has largely stopped accepting connections.

It's not clear why this is - the server is very lightly loaded since it is depooled (and an inspection of the various straces doesn't show anything obvious).

Event Timeline

MatthewVernon renamed this task from Swift server misbehaviour (no longer calling `accept`?) to Swift proxy server misbehaviour (no longer calling `accept`?).Mar 25 2024, 3:59 PM
MatthewVernon triaged this task as Medium priority.

Mentioned in SAL (#wikimedia-operations) [2024-03-27T16:31:56Z] <Emperor> depool and restart swift-proxy on moss-fe2001 then repool T360913

Mentioned in SAL (#wikimedia-operations) [2024-03-27T16:34:48Z] <Emperor> restart swift-proxy on ms-fe2010 then repool T360913

Mentioned in SAL (#wikimedia-operations) [2024-03-27T16:37:03Z] <Emperor> depool and restart swift-proxy on ms-fe2011 then repool T360913

Mentioned in SAL (#wikimedia-operations) [2024-03-27T16:37:36Z] <Emperor> depool and restart swift-proxy on ms-fe2012 then repool T360913

Mentioned in SAL (#wikimedia-operations) [2024-03-27T16:38:15Z] <Emperor> depool and restart swift-proxy on ms-fe2013 then repool T360913

Mentioned in SAL (#wikimedia-operations) [2024-05-16T14:03:33Z] <Emperor> depool, restart swift-proxy, repool ms-fe1010 as ~12% connection failures reported by envoy since late 14th May T360913

Mentioned in SAL (#wikimedia-operations) [2024-05-20T13:04:40Z] <Emperor> depool, restart swift-proxy, repool moss-fe1001 as ~12% connection failures reported by envoy since late 14th May T360913

Mentioned in SAL (#wikimedia-operations) [2024-06-03T12:55:39Z] <Emperor> depool/restart swift-proxy/repool ms-fe10{09,11,12,14} due to rising connection failures T360913

Mentioned in SAL (#wikimedia-operations) [2024-06-07T14:02:35Z] <Emperor> restart swift-proxy on ms-fe1009 ms-fe1011 ms-fe1012 ms-fe1014 T360913

Mentioned in SAL (#wikimedia-operations) [2024-06-12T12:17:19Z] <Emperor> restart swift-proxy on ms-fe2011 ms-fe2014 T360913

Mentioned in SAL (#wikimedia-operations) [2024-06-12T12:18:55Z] <Emperor> restart swift-proxy on ms-fe1013 T360913

Mentioned in SAL (#wikimedia-operations) [2024-06-14T21:33:33Z] <Emperor> restart swift-proxy on ms-fe1010 T360913

Mentioned in SAL (#wikimedia-operations) [2024-06-21T10:56:11Z] <Emperor> restart swift-proxy on ms-fe1010 T360913

Mentioned in SAL (#wikimedia-operations) [2024-06-21T10:57:38Z] <Emperor> restart swift-proxy on ms-fe2011 ms-fe2012 T360913

We saw a spike of 504s on ms-fe1012 today that resulted in paging.

Mentioned in SAL (#wikimedia-operations) [2024-06-24T18:16:56Z] <mutante> ms-fe1012:~] $ sudo systemctl restart swift-proxy T360913

Mentioned in SAL (#wikimedia-operations) [2024-06-24T18:50:32Z] <mutante> ms-fe1010,ms-fe1013 - restart swift-proxy - T360913

Dzahn raised the priority of this task from Medium to High.Jun 24 2024, 6:51 PM

Mentioned in SAL (#wikimedia-operations) [2024-06-24T19:02:39Z] <sukhe> ms-fe1009: restart swift-proxy: T360913

Mentioned in SAL (#wikimedia-operations) [2024-06-25T20:11:06Z] <Emperor> restart swift-proxy on ms-fe2010 ms-fe1011 T360913

Mentioned in SAL (#wikimedia-operations) [2024-07-29T16:30:11Z] <Emperor> restart swift-proxy on ms-fe2011 T360913

Mentioned in SAL (#wikimedia-operations) [2024-08-20T20:00:48Z] <Emperor> depool/restart/repool ms-fe2009 T360913

Mentioned in SAL (#wikimedia-operations) [2024-08-20T20:01:37Z] <Emperor> depool/restart/repool ms-fe2011 T360913

Mentioned in SAL (#wikimedia-operations) [2024-08-20T20:02:04Z] <Emperor> depool/restart/repool ms-fe2012 T360913

Mentioned in SAL (#wikimedia-operations) [2024-08-20T20:02:33Z] <Emperor> depool/restart/repool ms-fe2014 T360913

Mentioned in SAL (#wikimedia-operations) [2024-08-30T10:44:32Z] <Emperor> restart swift-proxy on ms-fe2009 and ms-fe2014 T360913

Mentioned in SAL (#wikimedia-operations) [2024-09-02T07:51:17Z] <Emperor> restart swift-proxy on ms-fe2012 T360913

Mentioned in SAL (#wikimedia-operations) [2024-09-05T15:28:47Z] <Emperor> restart swift-proxy on ms-fe1014 T360913

Mentioned in SAL (#wikimedia-operations) [2025-01-17T09:08:18Z] <Emperor> depool / restart / repool ms-fe2010 T360913

Mentioned in SAL (#wikimedia-operations) [2025-03-20T08:32:11Z] <Emperor> restart swift-proxy on ms-fe2010 T360913

Mentioned in SAL (#wikimedia-operations) [2025-03-20T08:55:11Z] <Emperor> restart swift-proxy on ms-fe1010 T360913

Jelto subscribed.

I'm adding Sustainability (Incident Followup) and SRE-OnFire tags here because this task was mentioned during one of the last swift incidents (brief incident report here: 2025-03-29_Upload_cache_unavailability). See also T390385.

MatthewVernon closed subtask Restricted Task as Resolved.Jul 29 2025, 12:16 PM

Mentioned in SAL (#wikimedia-operations) [2025-08-28T13:08:40Z] <Emperor> restart swift on ms-fe2011 T360913

Mentioned in SAL (#wikimedia-operations) [2025-08-28T13:10:58Z] <Emperor> restart swift on ms-fe2012 T360913

Mentioned in SAL (#wikimedia-operations) [2025-08-28T14:22:32Z] <Emperor> install python3.9-dbg for temporary debugging ms-fe2010 T360913

Mentioned in SAL (#wikimedia-operations) [2025-08-28T14:32:09Z] <Emperor> remove dbg packages & repool ms-fe2010 T360913

Mentioned in SAL (#wikimedia-operations) [2025-09-05T08:48:59Z] <Emperor> remove dbg packages & repool ms-fe2016 T360913

Mentioned in SAL (#wikimedia-operations) [2025-09-23T15:28:23Z] <Emperor> restart swift-proxy ms-fe1010 ms-fe2010 ms-fe2011 ms-fe2015 T360913

Mentioned in SAL (#wikimedia-operations) [2025-10-01T08:10:27Z] <Emperor> restart swift on ms-fe2012 T360913