Page MenuHomePhabricator

Intermittent connect timeout for CirrusSearch connections
Closed, ResolvedPublic


Recently we switched the job runners from hhvm to php 7.2. Making this switch also changed how cirrussearch pools ssl connections from an hhvm internal curl pool to an external nginx instance performing proxying/pooling on localhost of each app server.

Since switching though we've seen a low volume, typically about 15 per 3 hours but as high as 300 per 3 hours, of 504 Gateway Time-out errors in the CirrusSearch application code. The overall volume is low thousands of requests per second, so this is a tiny fraction of requests, but still they shouldn't be timing out. Perhaps 2/3 of the timeouts clustered around 1s, but the other third vary from 3-10+s.

Network Path traveled:

mw app server -> nginx (localhost) -> lvs -> nginx (elastic) -> elasticsearch

connect timeoutread timeout
nginx (mw app)1s600s
nginx (elastic)60s180s

I've been able to correlate application timeout logs with /var/log/nginx/errors.log on the mw application servers, the typical matching log line is as follows. This suggests our timeout is most likely coming from the nginx->nginx tls connection with the 1s connect timeout.

2019/07/15 13:36:48 [error] 18363#18363: *133964233 upstream timed out (110: Connection timed out) while SSL handshaking to upstream, client:, server: , request: "POST /_msearch HTTP/1.1", upstream: "", host: "localhost:14243"

As an initial fix, it looks like before we had a 5s connect timeout and the new proxy is using a 1s connect timeout. We can bump the connect timeout on the proxy as a temporary fix, but ideally that 1s connect timeout should be plenty.

Event Timeline

jijiki triaged this task as Medium priority.Jul 15 2019, 2:42 PM
jijiki added a project: serviceops.
jijiki updated the task description. (Show Details)

Change 523194 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/puppet@production] Increase services proxy connect timeout to 5s

Change 523703 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] profile:service_proxy: Add more hiera variables

Change 523703 merged by Effie Mouzeli:
[operations/puppet@production] profile:service_proxy: Add more hiera variables

Change 523955 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hieradata: Set connect_timeout for cirrussearch

Change 523194 abandoned by EBernhardson:
Increase services proxy connect timeout to 5s

in favor of If64ae6fd2e2e5ebc3014773d206bb4f9968df673

Mentioned in SAL (#wikimedia-operations) [2019-07-18T10:15:12Z] <jijiki> Disable puppet on services_proxy hosts - T228063

Change 523955 merged by Effie Mouzeli:
[operations/puppet@production] hieradata: Set connect_timeout for cirrussearch

Mentioned in SAL (#wikimedia-operations) [2019-07-18T10:37:49Z] <jijiki> enable puppet on services_proxy hosts - T228063

Let's reopen if the issue persists.

Checked back into this and it's looking much better. july 16 had 2500 gateway timeouts per 12 hours, since deploying the highest 12 hour period is 250 gateway timeouts. Might be worth continuing to look into, but knocking this down an order of magnitude is probably sufficient.

Could you briefly summarise the impact of such backend timeout for Ciruss? E.g. is there a retry or fallback? Does it affect responses to end users, and if so is that response non-fatal non-500 and localised? If affecting things like POST requests or jobs, are they critical or eventually consistent? Thanks :)

It all depends on which code path gets the gateway timeout. Essentially all possible places cirrus talks to elasticsearch can end up erroring out. In terms of the most common code paths:

  • user searches are never retried. Depending on what called the search and what kind of search it was doing it might give the user an error message, or it might pretend no results exist for secondary searches.
  • Some page updates are retried, some are ignored, depending on what the data update was. Actual content updates gets thrown back into the job queue to be retried, but metrics updates that are used for ranking generally get thrown out.