https://logstash.wikimedia.org/goto/57d527a71a5b8b5b95281cdbd67aca07
Recently we switched the job runners from hhvm to php 7.0. Making this switch also changed how cirrussearch pools ssl connections from an hhvm internal curl pool to an external nginx instance performing proxying/pooling on localhost of each app server.
Since switching though we've seen a low volume, typically about 15 per 3 hours but as high as 300 per 3 hours, of `504 Gateway Time-out` errors in the CirrusSearch application code. The overall volume is low thousands of requests per second, so this is a tiny fraction of requests, but still they shouldn't be timing out. Perhaps 2/3 of the timeouts clustered around 1s, but the other third vary from 3-10+s.
Network Path traveled:
mw app server -> nginx (localhost) -> lvs -> nginx (elastic) -> elasticsearch
| | connect timeout | read timeout |
| cirrussearch | 5-10s | 10s-120s |
| nginx (mw app) | 1s | 600s |
| nginx (elastic) | 60s | 180s |
As an initial fix, it looks like before we had a 5s connect timeout and the new proxy is using a 1s connect timeout. We can bump the connect timeout on the proxy as a temporary fix, but ideally that 1s connect timeout should be plenty.