Page MenuHomePhabricator

Investigate brief CirrusSearch outage (MW exception spike for api.php)
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error

Request ID: W6Fd5wrAMGkAALRjYREAAADA

message
[{exception_id}] {exception_url}   RuntimeException from line 67 of /srv/mediawiki/php-1.32.0-wmf.20/extensions/CirrusSearch/includes/Elastica/PooledHttp.php:

Timeout reached waiting for an available pooled curl connection!
stacktrace
#0 /srv/mediawiki/php-1.32.0-wmf.20/extensions/CirrusSearch/includes/Elastica/PooledHttp.php(67): curl_init_pooled(string)
#1 /srv/mediawiki/php-1.32.0-wmf.20/vendor/ruflin/elastica/lib/Elastica/Transport/Http.php(51): CirrusSearch\Elastica\PooledHttp->_getConnection(boolean)
#2 /srv/mediawiki/php-1.32.0-wmf.20/vendor/ruflin/elastica/lib/Elastica/Request.php(193): Elastica\Transport\Http->exec(Elastica\Request, array)
#3 /srv/mediawiki/php-1.32.0-wmf.20/vendor/ruflin/elastica/lib/Elastica/Client.php(674): Elastica\Request->send()
#4 /srv/mediawiki/php-1.32.0-wmf.20/vendor/ruflin/elastica/lib/Elastica/Multi/Search.php(150): Elastica\Client->request(string, string, string, array, string)
#5 /srv/mediawiki/php-1.32.0-wmf.20/extensions/CirrusSearch/includes/CompletionSuggester.php(208): Elastica\Multi\Search->search()
#6 /srv/mediawiki/php-1.32.0-wmf.20/extensions/CirrusSearch/includes/Util.php(119): Closure$CirrusSearch\CompletionSuggester::suggest()
#7 /srv/mediawiki/php-1.32.0-wmf.20/includes/poolcounter/PoolCounterWorkViaCallback.php(69): Closure$CirrusSearch\Util::wrapWithPoolStats()
#8 /srv/mediawiki/php-1.32.0-wmf.20/includes/poolcounter/PoolCounterWork.php(123): PoolCounterWorkViaCallback->doWork()
#9 /srv/mediawiki/php-1.32.0-wmf.20/extensions/CirrusSearch/includes/Util.php(177): PoolCounterWork->execute()
#10 /srv/mediawiki/php-1.32.0-wmf.20/extensions/CirrusSearch/includes/CompletionSuggester.php(221): CirrusSearch\Util::doPoolCounterWork(string, User, Closure$CirrusSearch\Util::wrapWithPoolStats;357)
#11 /srv/mediawiki/php-1.32.0-wmf.20/extensions/CirrusSearch/includes/CirrusSearch.php(438): CirrusSearch\CompletionSuggester->suggest(string, array)
#12 /srv/mediawiki/php-1.32.0-wmf.20/extensions/CirrusSearch/includes/CirrusSearch.php(531): CirrusSearch->getSuggestions(string, array, CirrusSearch\SearchConfig)
#13 /srv/mediawiki/php-1.32.0-wmf.20/includes/search/SearchEngine.php(566): CirrusSearch->completionSearchBackend(string)
#14 /srv/mediawiki/php-1.32.0-wmf.20/includes/search/SearchEngine.php(609): SearchEngine->completionSearchBackendOverfetch(string)
#15 /srv/mediawiki/php-1.32.0-wmf.20/extensions/CirrusSearch/includes/CirrusSearch.php(542): SearchEngine->completionSearch(string)
#16 /srv/mediawiki/php-1.32.0-wmf.20/includes/api/ApiOpenSearch.php(117): CirrusSearch->completionSearchWithVariants(string)
#17 /srv/mediawiki/php-1.32.0-wmf.20/includes/api/ApiOpenSearch.php(91): ApiOpenSearch->search(string, array)
#18 /srv/mediawiki/php-1.32.0-wmf.20/includes/api/ApiMain.php(1587): ApiOpenSearch->execute()
#19 /srv/mediawiki/php-1.32.0-wmf.20/includes/api/ApiMain.php(531): ApiMain->executeAction()
#20 /srv/mediawiki/php-1.32.0-wmf.20/includes/api/ApiMain.php(502): ApiMain->executeActionWithErrorHandling()
#21 /srv/mediawiki/php-1.32.0-wmf.20/api.php(87): ApiMain->execute()
#22 /srv/mediawiki/w/api.php(3): include(string)

Notes

1,512 api request failures were recorded during a 2 minute period between 20:13 and 20:18 UTC.

A few examples of affected urls:

  • Search suggestions from desktop: /w/api.php?action=opensearch&format=json&formatversion=2&search=###&namespace=0&limit=10&suggest=true
  • Search suggestion from mobile: /w/api.php?action=query&format=json&prop=pageprops%7Cpageprops%7Cpageimages%7Cdescription&generator=prefixsearch&ppprop=displaytitle&piprop=thumbnail&pithumbsize=80&pilimit=15&redirects=&gpssearch=###&gpsnamespace=0&gpslimit=15
  • Miscellaneous queries that use prefixindex and or generators.

Event Timeline

Given the high impact, I first reached out to SRE before filing the task. @RobH mentioned that this was due to the accidental disabling of a network port, which has been resolved and doesn't really have any actionable follow-up.

However, I'm filing this because according to the IRC conversation the search cluster was expected to be able to handle a partial failure. This task is to investigate whether that's correct and working as intended.

Indeed the initial cause was my bad network port change, documented on T204772.

At a high level, these logs are saying the connection pools filled up on API application servers.

In terms of overall impact, from the search logs it looks like we were serving ~100k search requests per minute and the error rate was on the order of ~1k/minute, so around 1% of requests were failed out due to the network hiccup. This only affected API application servers because they serve ~90% of the search requests.

We regularly restart multiple servers at a time without seeing this, so it would have to depend on what is different. We are of course on codfw now instead of eqiad, so there could be variances there but i can't think of anything obvious. One major difference from restarts is the restart script de-pools before taking out the servers. Additionally the network port is closed on restart, as opposed to leaving the connections hanging.

Unfortunately i don't have any insight into the actual state of the connection pools to draw conclusions about what state the pools were in. We are moving away from hhvm and this connection pool only exists there so i'm not sure building out that visibility will be the right solution.

Will have to ponder more what to do with this.

Looking at percentiles we can see tail latency spiked around the same time as this error: https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles-prometheus?orgId=1&from=1537301173784&to=1537303336668&var-cluster=codfw&var-smoothing=1

Looking over the logs some more we can see PoolCounter rejected significantly more requests, at a rate of 800 - 1k/s for 70 seconds after the hiccup. Requests were dropped at a rate of a few hundred per sec for another 60 seconds before clearing up. : https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles-prometheus?orgId=1&from=1537301173784&to=1537303336668&var-cluster=codfw&var-smoothing=1

Total request rate was around 1.8k/s, so around 50% of requests were rejected for 70 seconds.

For a separate but related bug, per-node latency percentiles failed to collect during the error. This is likely because we ask one node to collect the latencies from all the other nodes and report it at once. Maybe adjust that for prometheus to only collect from the local node so no inter-node network traffic is required.

Overall, what this means is elasticsearch started taking too long to respond. Up to 40s in the 99th percentile. This, expectedly, ran into the configured concurrency limits. For the exact error message in this I've uploaded a patch to turn the RuntimeException from hhvm into an Elastica exception so it takes the normal error handling path and reports an error instead of giving up a 500. That might be enough for this particular ticket, but we perhaps should open another for figuring out why elasticsearch performed so poorly in the face of a network partition.

Change 461260 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Rethrow RuntimeException from hhvm as an Elastica exception

https://gerrit.wikimedia.org/r/461260

Yet another thing to look into. Our dashboard for poolcounter rejections only went up to a few hundred per minute, but logstash recorded orders of magnitude more pool counter rejections. At a minimum perhaps that graph is in rejections/sec and not the labeled rejections/min?

https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles-prometheus?orgId=1&from=1537301173784&to=1537303336668&var-cluster=codfw&var-smoothing=1&panelId=65&fullscreen

To summarize some of the discussion I had with @dcausse:

Cirrus / elasticsearch are robust to loosing nodes in the sense that the cluster recovers automatically, no data is lost and no manual intervention is required. That being said, loosing 3 nodes unexpectedly has some impact:

  1. on going requests to the failed nodes are obviously not going to complete
  2. any new request sent to the failed nodes (before pybal is aware of the failure) are also going to fail
  3. internal cluster traffic is going to be affected until the nodes are identified as failed by elasticsearch (based on timeout of internal communication and active discovery)
  4. Cirrus uses HTTP connection pools, open connections to the failed nodes are going to be reused until they fail, at which time they are going to be recycled and new connections created to the healthy nodes.

Note that in the case of an orderly shutdown of elasticsearch (like during maintenance), ongoing connections are cleanly terminated, so 3) and 4) don't apply.

There isn't much we can do about 1) and 2), and that impact is minimal.

About 3) and 4), we could reduce various timeouts (either connect / read timeout at TCP level, or query timeout inside the elasticsearch payload). But that needs to be done carefully as some requests are expected to take time to complete. The most impacting requests are probably completion suggester, which in normal operation has a 95%-ile of ~20ms and ~1K queries per second. In case of lost nodes, queries to those lost nodes suddenly jump to the timeout limit (not exactly clear which timeout we hit, but probably at least 10 seconds).

Additional complexity in tuning those timeouts is that the connection pools are shared between different type of requests, with different expectations in term of nominal response times.

We might want to discuss in more details, but it isn't clear at the moment that implementing a solution to fail faster / cleaner is worth it. It would come at the cost of additional complexity, which has its own impact on stability.

Change 461260 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Rethrow RuntimeException from hhvm as an Elastica exception

https://gerrit.wikimedia.org/r/461260

Potentially the reason this drug on longer than it should have is timeouts updating the cluster state: https://logstash.wikimedia.org/goto/1f67b952e7da4dec76fc66addb6b901b

timestampcluster versionhosts timed out
20:18:303037396, 17, 35
20:19:073037402,3,4,6,7,9,11,17,19,24,28,31,35
20:19:373037412,3,4,6,7,8,9,11,17,1819,20,21,22,24,26,27,28,29,31,33,35,36
20:20:083037422,3,7,8,9,11,17,19,20,21,22,23,24,26,27,28,29,31,33,35
20:20:3930374311,17,19,20,21,28,31,3520
20:21:1030374421,31,35

Timeline:

  • 20:18:00 - Remove 2005 from cluster
  • 20:18:30 - timeout publishing removal of 2005
  • 20:18:37 - Finish remove 2005 from cluster
  • 20:19:07 - cluster state update to mark shards stale timeout
  • 20:19:37 - cluster state update to reroute unassigned shards timeout
  • 20:19:38 - Remove 2004, 2006 from cluster
  • 20:20:08 - timeout publishing removal of 2004,2006
  • 20:20:39 - cluster state update to mark shards stale timeout
  • 20:21:11 - cluster state update to reroute unsassigned shards timeout
  • 20:54:39 - re-add 2006 to cluster
  • 20:54:46 - re-add 2004,2005 to cluster

It's worth noting the elasticsearch docs say at https://www.elastic.co/guide/en/elasticsearch/reference/5.0/modules-discovery-zen.html#_cluster_state_updates

If the master does not receive acknowledgement from at least discovery.zen.minimum_master_nodes nodes within a certain time (controlled by the discovery.zen.commit_timeout setting and defaults to 30 seconds) the cluster state change is rejected

The part about minimum_master_nodes is slightly confusing. Our discovery.zen.minimum_master_nodes is set to 2, but this isn't clear if 2 master capable nodes must ack the state update, or if any 2 nodes could ack the update. Additionally though the logstash messages only mention a timeout and not that the update was rejected.

Overall i still believe elasticsearch is misbehaving here. I'm optimistic though that the coming improvements to split the cluster will improve the cluster state behaviour.

If the master does not receive acknowledgement from at least discovery.zen.minimum_master_nodes nodes within a certain time (controlled by the discovery.zen.commit_timeout setting and defaults to 30 seconds) the cluster state change is rejected

The part about minimum_master_nodes is slightly confusing. Our discovery.zen.minimum_master_nodes is set to 2, but this isn't clear if 2 master capable nodes must ack the state update, or if any 2 nodes could ack the update. Additionally though the logstash messages only mention a timeout and not that the update was rejected.

After talking with dcausse, we are pretty sure this can only refer to master capable nodes. Further review of the logs confirms the ack phase which has this limit had no problems, and only the publish phase was timing out. Essentially that means this wasn't a problem this time around, but seems like something that could be a problem in the future. With our current setup losing 2 master capable nodes will stop all writes to the cluster, along with preventing nodes from being removed from the cluster state.

I think for the purposes of this ticket we can call it complete. There isn't a whole lot that can be done about the network issues from the mediawiki side besides the already merged patch to fail gracefully. From the elasticsearch side the oversized impact (dropping requests for ~2 minutes) of this is expected to be mitigated by ongoing work to reduce the number of shards per cluster.

debt claimed this task.
mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:08 PM