We have over time sometimes seen rises in 5xx error codes reported by ATS from swift. Since moving TLS termination from nginx to envoy (T317616) this has happened much less frequently; and a restart of the affected frontend(s) has been sufficient to resume normal service.
One frontend, ms-fe2010 is in this failed state and depooled, which allowed a little more debugging to take place; it will need to be restored to service at some point, though.
The effect of briefly repooling the server a couple of times can be seen on grafana - connection timeouts and failures.
We can eliminate envoy from blame by using the usual test process of trying to download a kitten picture using curl to port 80:
mvernon@ms-fe2011:~$ curl -4 -o /tmp/foo -v -H "Host: upload.wikimedia.org" http://ms-fe2010.codfw.wmnet/wikipedia/commons/thumb/1/1d/1-month-old_kittens_32.jpg/800px-1-month-old_kittens_32.jpg * Uses proxy env variable no_proxy == 'wikipedia.org,wikimedia.org,wikibooks.org,wikinews.org,wikiquote.org,wikisource.org,wikiversity.org,wikivoyage.org,wikidata.org,wikiworkshop.org,wikifunctions.org,wiktionary.org,mediawiki.org,wmfusercontent.org,w.wiki,wmnet,127.0.0.1,::1' % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying 10.192.16.76:80... 0 0 0 0 0 0 0 0 --:--:-- 0:00:07 --:--:-- 0* connect to 10.192.16.76 port 80 failed: Connection timed out * Failed to connect to ms-fe2010.codfw.wmnet port 80: Connection timed out 0 0 0 0 0 0 0 0 --:--:-- 0:00:07 --:--:-- 0 * Closing connection 0 curl: (28) Failed to connect to ms-fe2010.codfw.wmnet port 80: Connection timed out
likewise from the affected host:
mvernon@ms-fe2010:~$ curl -o /tmp/foo -v -H "Host: upload.wikimedia.org" http://$(hostname -f)/wikipedia/commons/thumb/1/1d/1-month-old_kittens_32.jpg/800px-1-month-old_kittens_32.jpg * Uses proxy env variable no_proxy == 'wikipedia.org,wikimedia.org,wikibooks.org,wikinews.org,wikiquote.org,wikisource.org,wikiversity.org,wikivoyage.org,wikidata.org,wikiworkshop.org,wikifunctions.org,wiktionary.org,mediawiki.org,wmfusercontent.org,w.wiki,wmnet,127.0.0.1,::1' % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying 10.192.16.76:80... 0 0 0 0 0 0 0 0 --:--:-- 0:00:07 --:--:-- 0* connect to 10.192.16.76 port 80 failed: Connection timed out * Failed to connect to ms-fe2010.codfw.wmnet port 80: Connection timed out 0 0 0 0 0 0 0 0 --:--:-- 0:00:07 --:--:-- 0 * Closing connection 0 curl: (28) Failed to connect to ms-fe2010.codfw.wmnet port 80: Connection timed out
By running tcpdump -vvvn and then grepping for 10.192.16.76.80 (ms-fe2010 port 80) we can see the connection attempts from ms-fe2011 (10.192.32.36):
mvernon@ms-fe2010:~$ grep 10.192.16.76.80 /tmp/tcpdumpwithmulti 14:07:04.328210 IP6 (flowlabel 0x849dd, hlim 63, next-header TCP (6) payload length: 40) 2620:0:860:103:10:192:32:36.30820 > 2620:0:860:102:10:192:16:76.80: Flags [S], cksum 0xb15b (correct), seq 1466290441, win 43200, options [mss 1440,sackOK,TS val 3747940912 ecr 0,nop,wscale 9], length 0 14:07:04.328255 IP6 (flowlabel 0x30ebf, hlim 64, next-header TCP (6) payload length: 20) 2620:0:860:102:10:192:16:76.80 > 2620:0:860:103:10:192:32:36.30820: Flags [R.], cksum 0x6357 (incorrect -> 0xab6f), seq 0, ack 1466290442, win 0, length 0 10.192.32.36.64898 > 10.192.16.76.80: Flags [S], cksum 0xeb4b (correct), seq 4191233004, win 42340, options [mss 1460,sackOK,TS val 69620135 ecr 0,nop,wscale 9], length 0 10.192.17.9.41056 > 10.192.16.76.80: Flags [S], cksum 0x7206 (correct), seq 2405708278, win 42340, options [mss 1460,sackOK,TS val 3504064215 ecr 0,nop,wscale 9], length 0 10.192.32.36.64898 > 10.192.16.76.80: Flags [S], cksum 0xe754 (correct), seq 4191233004, win 42340, options [mss 1460,sackOK,TS val 69621150 ecr 0,nop,wscale 9], length 0 10.192.32.36.64898 > 10.192.16.76.80: Flags [S], cksum 0xdf78 (correct), seq 4191233004, win 42340, options [mss 1460,sackOK,TS val 69623162 ecr 0,nop,wscale 9], length 0 10.192.17.9.35362 > 10.192.16.76.80: Flags [.], cksum 0xb4e9 (correct), seq 308995049, ack 4026536304, win 83, options [nop,nop,TS val 3504067031 ecr 425282664], length 0 10.192.16.76.80 > 10.192.17.9.35362: Flags [.], cksum 0x36fb (incorrect -> 0x0223), seq 1, ack 1, win 85, options [nop,nop,TS val 425313380 ecr 3504016543], length 0
[I forgot -4 when generating that packet dump, so you see the initial expected refusal on IPv6 first]
Simultaneously, we can also strace all the swift-proxy processes that have a socket in LISTEN state on port 80:
sudo lsof -i :80 | grep LISTEN | cut -f 2 -d ' ' | sort | uniq | sed -e 's/^/-p /' | xargs sudo strace -o /tmp/multitrace -tt
There is no accept call in that strace (if there were processes stuck in accept that might indicate a kernel issue), so it seems that for some reason the swift-proxy server has largely stopped accepting connections.
It's not clear why this is - the server is very lightly loaded since it is depooled (and an inspection of the various straces doesn't show anything obvious).