Page MenuHomePhabricator

mcrouter codfw proxies sometimes lead to TKOs
Closed, DeclinedPublic0 Estimated Story Points

Description

After cleaning up all the noise related to high spikes of TKOs, I noticed that sometimes we still register little spikes (10~100 requests affected max) related to codfw proxies. For example:

Jul  4 00:28:58 mw1235 mcrouter[13474]: I0704 00:28:58.840715 13476 AsyncSSLSocket.cpp:119] TCP connect failed: AsyncSocketException: connect timed out after 1000ms, type = Timed out
Jul  4 00:28:58 mw1235 mcrouter[13474]: I0704 00:28:58.840767 13476 ProxyDestination.cpp:453] 10.192.0.61:11214 marked hard TKO. Total hard TKOs: 1; soft TKOs: 0. Reply: mc_res_connect_error
Jul  4 00:29:02 mw1235 mcrouter[13474]: I0704 00:29:02.877496 13476 ProxyDestination.cpp:453] 10.192.0.61:11214 unmarked TKO. Total hard TKOs: 0; soft TKOs: 0. Reply: mc_res_ok

(Note the 11214 port, that is one one used by the codfw proxies)

It is not a big deal at the moment, but we should figure out if we are currently reaching some limit (like too many requests at the same time, etc..). It might be the case that more codfw proxies are needed, or simply the proxy's available connections slots increased.

Event Timeline

elukey triaged this task as Medium priority.Jul 4 2019, 10:36 AM
elukey created this task.

We deployed all the changes for T225642, so async settings for codfw replication was not the culprit.

After T225059 we have per-shard metrics for mcrouter so it is easier to track down issues. I analyzed one of the last TKO event, it was related to mw1280:

https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&from=1564930491689&to=1564932592601&panelId=9&fullscreen&var-source=eqiad%20prometheus%2Fops&var-cluster=All&var-instance=All&var-memcached_server=All

If I set the instance to mw1280 in the grafana's variables, I get a nice rise in connect timeouts matching the TKOs, only for 10.192.16.54:11214 (mw2255.codfw.wmnet):

https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&from=1564930491689&to=1564932592601&var-source=eqiad%20prometheus%2Fops&var-cluster=All&var-instance=mw1280&var-memcached_server=All&panelId=30&fullscreen

I I set also the target shard variable in grafana (mw2255) I don't get much:

https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&from=1564930491689&to=1564932592601&var-source=eqiad%20prometheus%2Fops&var-cluster=All&var-instance=mw1280&var-memcached_server=10.192.16.54:11214:ascii:ssl:notcompressed-1000

codfw metrics for the host are not telling much:

https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&from=1564930491689&to=1564932592601&var-source=codfw%20prometheus%2Fops&var-cluster=All&var-instance=mw2255&var-memcached_server=All

Error log on mw1280 (none found on mw2250):

elukey@mw1280:~$ cat /var/log/mcrouter.log.1
Aug  4 15:12:26 mw1280 mcrouter[945]: I0804 15:12:26.124703  1044 AsyncSSLSocket.cpp:119] TCP connect failed: AsyncSocketException: connect timed out after 1000ms, type = Timed out
Aug  4 15:12:26 mw1280 mcrouter[945]: I0804 15:12:26.124752  1044 ProxyDestination.cpp:453] 10.192.16.54:11214 marked hard TKO. Total hard TKOs: 1; soft TKOs: 0. Reply: mc_res_connect_error
Aug  4 15:12:29 mw1280 mcrouter[945]: I0804 15:12:29.767894  1044 ProxyDestination.cpp:453] 10.192.16.54:11214 unmarked TKO. Total hard TKOs: 0; soft TKOs: 0. Reply: mc_res_ok

The only thing noticeable to report is that some minutes before the connect timeout there was a rise in FOUND responses from eqiad shards for mw1280:

https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&from=1564930491689&to=1564932592601&var-source=eqiad%20prometheus%2Fops&var-cluster=All&var-instance=mw1280&var-memcached_server=All&panelId=27&fullscreen

But in theory I'd have expected a rise in traffic for mw2255, that didn't happen.

As described in T255511 we should think about adding a gutter pool for mw2* proxies to better handle TKOs when they happen.

We are not using proxies anymore, but some TKOs we see every now and then could be related to T291385, not much we can do, closing.