Swift proxy server misbehaviour (no longer calling `accept`?)
Open, MediumPublic
Actions

Assigned To

None

Authored By

	MatthewVernon
	Mar 25 2024, 3:53 PM

Description

We have over time sometimes seen rises in 5xx error codes reported by ATS from swift. Since moving TLS termination from nginx to envoy (T317616) this has happened much less frequently; and a restart of the affected frontend(s) has been sufficient to resume normal service.

One frontend, ms-fe2010 is in this failed state and depooled, which allowed a little more debugging to take place; it will need to be restored to service at some point, though.

The effect of briefly repooling the server a couple of times can be seen on grafana - connection timeouts and failures.

We can eliminate envoy from blame by using the usual test process of trying to download a kitten picture using curl to port 80:

mvernon@ms-fe2011:~$ curl -4 -o /tmp/foo -v -H "Host: upload.wikimedia.org" http://ms-fe2010.codfw.wmnet/wikipedia/commons/thumb/1/1d/1-month-old_kittens_32.jpg/800px-1-month-old_kittens_32.jpg
* Uses proxy env variable no_proxy == 'wikipedia.org,wikimedia.org,wikibooks.org,wikinews.org,wikiquote.org,wikisource.org,wikiversity.org,wikivoyage.org,wikidata.org,wikiworkshop.org,wikifunctions.org,wiktionary.org,mediawiki.org,wmfusercontent.org,w.wiki,wmnet,127.0.0.1,::1'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 10.192.16.76:80...
  0     0    0     0    0     0      0      0 --:--:--  0:00:07 --:--:--     0* connect to 10.192.16.76 port 80 failed: Connection timed out
* Failed to connect to ms-fe2010.codfw.wmnet port 80: Connection timed out
  0     0    0     0    0     0      0      0 --:--:--  0:00:07 --:--:--     0
* Closing connection 0
curl: (28) Failed to connect to ms-fe2010.codfw.wmnet port 80: Connection timed out

likewise from the affected host:

mvernon@ms-fe2010:~$ curl -o /tmp/foo -v -H "Host: upload.wikimedia.org" http://$(hostname -f)/wikipedia/commons/thumb/1/1d/1-month-old_kittens_32.jpg/800px-1-month-old_kittens_32.jpg
* Uses proxy env variable no_proxy == 'wikipedia.org,wikimedia.org,wikibooks.org,wikinews.org,wikiquote.org,wikisource.org,wikiversity.org,wikivoyage.org,wikidata.org,wikiworkshop.org,wikifunctions.org,wiktionary.org,mediawiki.org,wmfusercontent.org,w.wiki,wmnet,127.0.0.1,::1'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 10.192.16.76:80...
  0     0    0     0    0     0      0      0 --:--:--  0:00:07 --:--:--     0* connect to 10.192.16.76 port 80 failed: Connection timed out
* Failed to connect to ms-fe2010.codfw.wmnet port 80: Connection timed out
  0     0    0     0    0     0      0      0 --:--:--  0:00:07 --:--:--     0
* Closing connection 0
curl: (28) Failed to connect to ms-fe2010.codfw.wmnet port 80: Connection timed out

By running tcpdump -vvvn and then grepping for 10.192.16.76.80 (ms-fe2010 port 80) we can see the connection attempts from ms-fe2011 (10.192.32.36):

mvernon@ms-fe2010:~$ grep 10.192.16.76.80 /tmp/tcpdumpwithmulti 
14:07:04.328210 IP6 (flowlabel 0x849dd, hlim 63, next-header TCP (6) payload length: 40) 2620:0:860:103:10:192:32:36.30820 > 2620:0:860:102:10:192:16:76.80: Flags [S], cksum 0xb15b (correct), seq 1466290441, win 43200, options [mss 1440,sackOK,TS val 3747940912 ecr 0,nop,wscale 9], length 0
14:07:04.328255 IP6 (flowlabel 0x30ebf, hlim 64, next-header TCP (6) payload length: 20) 2620:0:860:102:10:192:16:76.80 > 2620:0:860:103:10:192:32:36.30820: Flags [R.], cksum 0x6357 (incorrect -> 0xab6f), seq 0, ack 1466290442, win 0, length 0
    10.192.32.36.64898 > 10.192.16.76.80: Flags [S], cksum 0xeb4b (correct), seq 4191233004, win 42340, options [mss 1460,sackOK,TS val 69620135 ecr 0,nop,wscale 9], length 0
    10.192.17.9.41056 > 10.192.16.76.80: Flags [S], cksum 0x7206 (correct), seq 2405708278, win 42340, options [mss 1460,sackOK,TS val 3504064215 ecr 0,nop,wscale 9], length 0
    10.192.32.36.64898 > 10.192.16.76.80: Flags [S], cksum 0xe754 (correct), seq 4191233004, win 42340, options [mss 1460,sackOK,TS val 69621150 ecr 0,nop,wscale 9], length 0
    10.192.32.36.64898 > 10.192.16.76.80: Flags [S], cksum 0xdf78 (correct), seq 4191233004, win 42340, options [mss 1460,sackOK,TS val 69623162 ecr 0,nop,wscale 9], length 0
    10.192.17.9.35362 > 10.192.16.76.80: Flags [.], cksum 0xb4e9 (correct), seq 308995049, ack 4026536304, win 83, options [nop,nop,TS val 3504067031 ecr 425282664], length 0
    10.192.16.76.80 > 10.192.17.9.35362: Flags [.], cksum 0x36fb (incorrect -> 0x0223), seq 1, ack 1, win 85, options [nop,nop,TS val 425313380 ecr 3504016543], length 0

[I forgot -4 when generating that packet dump, so you see the initial expected refusal on IPv6 first]

Simultaneously, we can also strace all the swift-proxy processes that have a socket in LISTEN state on port 80:

sudo lsof -i :80 | grep LISTEN | cut -f 2 -d ' ' | sort | uniq | sed -e 's/^/-p /' | xargs sudo strace -o /tmp/multitrace -tt

There is no accept call in that strace (if there were processes stuck in accept that might indicate a kernel issue), so it seems that for some reason the swift-proxy server has largely stopped accepting connections.

It's not clear why this is - the server is very lightly loaded since it is depooled (and an inspection of the various straces doesn't show anything obvious).

Related Objects

Mentioned Here: T317616: Revisit CDN<-->Swift communication

Event Timeline

MatthewVernon created this task.Mar 25 2024, 3:53 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 25 2024, 3:53 PM

MatthewVernon renamed this task from Swift server misbehaviour (no longer calling `accept`?) to Swift proxy server misbehaviour (no longer calling `accept`?).Mar 25 2024, 3:59 PM

MatthewVernon triaged this task as Medium priority.

Reported upstream as Bug #2058945.

Mentioned in SAL (#wikimedia-operations) [2024-03-27T16:31:56Z] <Emperor> depool and restart swift-proxy on moss-fe2001 then repool T360913

Mentioned in SAL (#wikimedia-operations) [2024-03-27T16:34:48Z] <Emperor> restart swift-proxy on ms-fe2010 then repool T360913

Mentioned in SAL (#wikimedia-operations) [2024-03-27T16:37:03Z] <Emperor> depool and restart swift-proxy on ms-fe2011 then repool T360913

Mentioned in SAL (#wikimedia-operations) [2024-03-27T16:37:36Z] <Emperor> depool and restart swift-proxy on ms-fe2012 then repool T360913

Mentioned in SAL (#wikimedia-operations) [2024-03-27T16:38:15Z] <Emperor> depool and restart swift-proxy on ms-fe2013 then repool T360913

Swift proxy server misbehaviour (no longer calling `accept`?)Open, MediumPublicActions

Description

Related Objects

Event Timeline

Swift proxy server misbehaviour (no longer calling `accept`?)
Open, MediumPublic
Actions