Page MenuHomePhabricator

mcrouter codfw proxies sometimes lead to TKOs
Open, NormalPublic0 Story Points

Description

After cleaning up all the noise related to high spikes of TKOs, I noticed that sometimes we still register little spikes (10~100 requests affected max) related to codfw proxies. For example:

Jul  4 00:28:58 mw1235 mcrouter[13474]: I0704 00:28:58.840715 13476 AsyncSSLSocket.cpp:119] TCP connect failed: AsyncSocketException: connect timed out after 1000ms, type = Timed out
Jul  4 00:28:58 mw1235 mcrouter[13474]: I0704 00:28:58.840767 13476 ProxyDestination.cpp:453] 10.192.0.61:11214 marked hard TKO. Total hard TKOs: 1; soft TKOs: 0. Reply: mc_res_connect_error
Jul  4 00:29:02 mw1235 mcrouter[13474]: I0704 00:29:02.877496 13476 ProxyDestination.cpp:453] 10.192.0.61:11214 unmarked TKO. Total hard TKOs: 0; soft TKOs: 0. Reply: mc_res_ok

(Note the 11214 port, that is one one used by the codfw proxies)

It is not a big deal at the moment, but we should figure out if we are currently reaching some limit (like too many requests at the same time, etc..). It might be the case that more codfw proxies are needed, or simply the proxy's available connections slots increased.

Event Timeline

elukey triaged this task as Normal priority.Jul 4 2019, 10:36 AM
elukey created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 4 2019, 10:36 AM
jijiki added a subscriber: jijiki.
elukey moved this task from Backlog to Mcrouter/Memcached on the User-Elukey board.
kchapman moved this task from Inbox to Radar on the Performance-Team board.Jul 8 2019, 7:52 PM
kchapman edited projects, added Performance-Team (Radar); removed Performance-Team.
elukey added a comment.Mon, Aug 5, 7:10 AM

We deployed all the changes for T225642, so async settings for codfw replication was not the culprit.

After T225059 we have per-shard metrics for mcrouter so it is easier to track down issues. I analyzed one of the last TKO event, it was related to mw1280:

https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&from=1564930491689&to=1564932592601&panelId=9&fullscreen&var-source=eqiad%20prometheus%2Fops&var-cluster=All&var-instance=All&var-memcached_server=All

If I set the instance to mw1280 in the grafana's variables, I get a nice rise in connect timeouts matching the TKOs, only for 10.192.16.54:11214 (mw2255.codfw.wmnet):

https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&from=1564930491689&to=1564932592601&var-source=eqiad%20prometheus%2Fops&var-cluster=All&var-instance=mw1280&var-memcached_server=All&panelId=30&fullscreen

I I set also the target shard variable in grafana (mw2255) I don't get much:

https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&from=1564930491689&to=1564932592601&var-source=eqiad%20prometheus%2Fops&var-cluster=All&var-instance=mw1280&var-memcached_server=10.192.16.54:11214:ascii:ssl:notcompressed-1000

codfw metrics for the host are not telling much:

https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&from=1564930491689&to=1564932592601&var-source=codfw%20prometheus%2Fops&var-cluster=All&var-instance=mw2255&var-memcached_server=All

Error log on mw1280 (none found on mw2250):

elukey@mw1280:~$ cat /var/log/mcrouter.log.1
Aug  4 15:12:26 mw1280 mcrouter[945]: I0804 15:12:26.124703  1044 AsyncSSLSocket.cpp:119] TCP connect failed: AsyncSocketException: connect timed out after 1000ms, type = Timed out
Aug  4 15:12:26 mw1280 mcrouter[945]: I0804 15:12:26.124752  1044 ProxyDestination.cpp:453] 10.192.16.54:11214 marked hard TKO. Total hard TKOs: 1; soft TKOs: 0. Reply: mc_res_connect_error
Aug  4 15:12:29 mw1280 mcrouter[945]: I0804 15:12:29.767894  1044 ProxyDestination.cpp:453] 10.192.16.54:11214 unmarked TKO. Total hard TKOs: 0; soft TKOs: 0. Reply: mc_res_ok

The only thing noticeable to report is that some minutes before the connect timeout there was a rise in FOUND responses from eqiad shards for mw1280:

https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&from=1564930491689&to=1564932592601&var-source=eqiad%20prometheus%2Fops&var-cluster=All&var-instance=mw1280&var-memcached_server=All&panelId=27&fullscreen

But in theory I'd have expected a rise in traffic for mw2255, that didn't happen.