Recently we switched the job runners from hhvm to php 7.2. Making this switch also changed how cirrussearch pools ssl connections from an hhvm internal curl pool to an external nginx instance performing proxying/pooling on localhost of each app server.
Since switching though we've seen a low volume, typically about 15 per 3 hours but as high as 300 per 3 hours, of `504 Gateway Time-out` errors in the CirrusSearch application code. The overall volume is low thousands of requests per second, so this is a tiny fraction of requests, but still they shouldn't be timing out. Perhaps 2/3 of the timeouts clustered around 1s, but the other third vary from 3-10+s.
Network Path traveled:
mw app server -> nginx (localhost) -> lvs -> nginx (elastic) -> elasticsearch
| | connect timeout | read timeout |
| cirrussearch | 5-10s | 10s-120s |
| nginx (mw app) | 1s | 600s |
| nginx (elastic) | 60s | 180s |
I've been able to correlate application timeout logs with `/var/log/nginx/errors.log` on the mw application servers, the typical matching log line is as follows. This suggests our timeout is most likely coming from the nginx->nginx tls connection with the 1s connect timeout.
2019/07/15 13:36:48 [error] 18363#18363: *133964233 upstream timed out (110: Connection timed out) while SSL handshaking to upstream, client: 127.0.0.1, server: , request: "POST /_msearch HTTP/1.1", upstream: "https://10.2.1.30:9243/_msearch", host: "localhost:14243"
As an initial fix, it looks like before we had a 5s connect timeout and the new proxy is using a 1s connect timeout. We can bump the connect timeout on the proxy as a temporary fix, but ideally that 1s connect timeout should be plenty.