Page MenuHomePhabricator

CirrusSearch/includes/Elastica/PooledHttp.php:67 Timeout reached waiting for an available pooled curl connection!
Closed, ResolvedPublicPRODUCTION ERROR

Description

When I rolled 1.30.0-wmf.9 forward to all wikis, there were a number of alerts that started firing in *-operations

23:01:38        icinga-wm PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and
                          target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
23:01:38        icinga-wm PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and
                          target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
23:01:47        icinga-wm PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and
                          target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
23:02:07        icinga-wm PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and
                          target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
23:02:37        icinga-wm PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and
                          target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
23:02:38        icinga-wm PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and
                          target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
23:02:47        icinga-wm RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy
23:02:57        icinga-wm PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0]

This showed up in the error logs at the same time:

#0 /srv/mediawiki/php-1.30.0-wmf.9/extensions/CirrusSearch/includes/Elastica/PooledHttp.php(67): curl_init_pooled(string)
#1 /srv/mediawiki/php-1.30.0-wmf.9/vendor/ruflin/elastica/lib/Elastica/Transport/Http.php(51): CirrusSearch\Elastica\PooledHttp->_getConnection(boolean)
#2 /srv/mediawiki/php-1.30.0-wmf.9/vendor/ruflin/elastica/lib/Elastica/Request.php(171): Elastica\Transport\Http->exec(Elastica\Request, array)
#3 /srv/mediawiki/php-1.30.0-wmf.9/vendor/ruflin/elastica/lib/Elastica/Client.php(662): Elastica\Request->send()
#4 /srv/mediawiki/php-1.30.0-wmf.9/vendor/ruflin/elastica/lib/Elastica/Index.php(521): Elastica\Client->request(string, string, array, array)
#5 /srv/mediawiki/php-1.30.0-wmf.9/extensions/CirrusSearch/includes/CompletionSuggester.php(203): Elastica\Index->request(string, string, array, array)
#6 [internal function]: Closure$CirrusSearch\CompletionSuggester::suggest()
#7 /srv/mediawiki/php-1.30.0-wmf.9/extensions/CirrusSearch/includes/Util.php(116): call_user_func_array(Closure$CirrusSearch\CompletionSuggester::suggest;1635, array)
#8 /srv/mediawiki/php-1.30.0-wmf.9/includes/poolcounter/PoolCounterWorkViaCallback.php(69): Closure$CirrusSearch\Util::wrapWithPoolStats()
#9 /srv/mediawiki/php-1.30.0-wmf.9/includes/poolcounter/PoolCounterWork.php(123): PoolCounterWorkViaCallback->doWork()
#10 /srv/mediawiki/php-1.30.0-wmf.9/extensions/CirrusSearch/includes/Util.php(174): PoolCounterWork->execute()
#11 /srv/mediawiki/php-1.30.0-wmf.9/extensions/CirrusSearch/includes/CompletionSuggester.php(217): CirrusSearch\Util::doPoolCounterWork(string, User, Closure$CirrusSearch\Util::wrapWithPoolStats;1347)
#12 /srv/mediawiki/php-1.30.0-wmf.9/extensions/CirrusSearch/includes/CirrusSearch.php(472): CirrusSearch\CompletionSuggester->suggest(string, array)
#13 /srv/mediawiki/php-1.30.0-wmf.9/extensions/CirrusSearch/includes/CirrusSearch.php(601): CirrusSearch->getSuggestions(string, array, CirrusSearch\SearchConfig)
#14 /srv/mediawiki/php-1.30.0-wmf.9/includes/search/SearchEngine.php(520): CirrusSearch->completionSearchBackend(string)
#15 /srv/mediawiki/php-1.30.0-wmf.9/extensions/CirrusSearch/includes/CirrusSearch.php(612): SearchEngine->completionSearch(string)
#16 /srv/mediawiki/php-1.30.0-wmf.9/includes/api/ApiQueryPrefixSearch.php(54): CirrusSearch->completionSearchWithVariants(string)
#17 /srv/mediawiki/php-1.30.0-wmf.9/includes/api/ApiQueryPrefixSearch.php(41): ApiQueryPrefixSearch->run(ApiPageSet)
#18 /srv/mediawiki/php-1.30.0-wmf.9/includes/api/ApiPageSet.php(180): ApiQueryPrefixSearch->executeGenerator(ApiPageSet)
#19 /srv/mediawiki/php-1.30.0-wmf.9/includes/api/ApiPageSet.php(144): ApiPageSet->executeInternal(boolean)
#20 /srv/mediawiki/php-1.30.0-wmf.9/includes/api/ApiQuery.php(238): ApiPageSet->execute()
#21 /srv/mediawiki/php-1.30.0-wmf.9/includes/api/ApiMain.php(1583): ApiQuery->execute()
#22 /srv/mediawiki/php-1.30.0-wmf.9/includes/api/ApiMain.php(546): ApiMain->executeAction()
#23 /srv/mediawiki/php-1.30.0-wmf.9/includes/api/ApiMain.php(517): ApiMain->executeActionWithErrorHandling()
#24 /srv/mediawiki/php-1.30.0-wmf.9/api.php(94): ApiMain->execute()
#25 /srv/mediawiki/w/api.php(3): include(string)
#26 {main}

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
thcipriani triaged this task as Unbreak Now! priority.Jul 13 2017, 11:57 PM

Alerts and errors stopped when I rolled back to 1.30.0-wmf.7 so this seems to be caused by 1.30.0-wmf.9. Adding as blocker, raising priority.

While possible, I doubt the cirrus error is related to the train rollout. We have an ongoing intermittent issue related to elasticsearch with T168816 and more generally T169498.

Actually this spike looks a little different than others we have seen recently, although the symptoms are the same. Specifically the cache miss rate of more like (use by recommendations) api requests went from ~80 misses per second up to 900. I'm not sure yet if this is some behaviour of the recommendation API retrying requests, or if there was an actual spike of external requests. Can problably dig through the web request logs and figure it out tomorrow. It could also perhaps be something like the mediawiki api sending back different headers, and the varnish caching layer in front of the cirrus caching layer not caching the new responses? unsure

While possible, I doubt the cirrus error is related to the train rollout. We have an ongoing intermittent issue related to elasticsearch with T168816 and more generally T169498.

I definitely thought this initially as well. But the correlation with the timing of the train is so strong https://logstash.wikimedia.org/goto/e5b90b5409d0a38cecf0e191c13e280a

The first wave of error messages/icinga alerts began a few minutes after I rolled wmf.9 forward, the gap between the two waves in the graph on logstash is the time period during which I rolled back. Then, thinking that the timeouts must have been a coincidence, I rolled forward again which is the second wave of errors hat continued until I finally rolled back and then ceased almost immediately.

I doubt that this task is pointing at the root cause of the error rate spike, but there does seem to be something in wmf.9 that caused this to happen.

poked at the webrequest logs, with

select year,month,day,hour,cache_status,round(cast(ts as int)/60) as minute,count(1)
from webrequest
where year=2017 and month=7 and ((day=13 and hour between 22 and 23) or (day=14 and hour=0))
and uri_query like '%gsrsearch=morelike:%'
group by year,month,day,hour,round(cast(ts as int)/60), cache_status order by minute desc limit 999999;

The request rate was relatively steady, dropping from 4k per minute down to 3k per minute and under as time went by. This matches the previous day pretty closely. This seems to suggest there was no external increase in requests, but there was certainly an internal increase in requests. One oddity is that cache_status is almost entirely pass.

This doesn't look like its retries, parsing out the query term from the cirrus logs for more like searches and comparing the # requests to the # unique searches i see ~96% unique searches is the hours before and after the load spike. The hour of the load spike saw 87% unique searches. This is indeed lower, but not enough to account for a 4 to 5x increase in request rate.

Still poking around, but no answers yet

Turns out i was origionally only looking at a portion of the traffic, missing requests that were urlencoded. The full data for an hour before deploy to an hour after deploy is:

yearmonthdayhourcache_statuscount
201771322hit4927895
201771322int3683
201771322miss356418
201771322pass413178
201771323-1
201771323hit3190371
201771323int2644
201771323miss931815
201771323pass1068460
20177140hit4505927
20177140int2877
20177140miss304398
20177140pass295204

This works out to ~5.7M reqs before deploy, 5.2M during deploy, and 5.1M after deploy. So the overall request rate is pretty sane, what changed is the hit/miss/pass rates. Something caused cache hit rates to go down so far as to overload the elasticsearch cluster behind it.

@Jdlrobson

I think i've found our culprit, a change to the RelatedArticles extension made all existing cached api responses invalid, and the flood of requests trying to refill the varnish caches is overloading the search cluster.

https://gerrit.wikimedia.org/r/#/c/355636/

The change is basically that prior to this patch two query parameters were passed as blank, i.e.: ...&gsrlimit=&pilimit=&. After this change these two are filled with default values, ...&gsrlimit=3&pilimit=3&.... Changing anything that breaks caching on this endpoint is fairly dangerous. I will make some poolcounter adjustments in CirrusSearch to prevent a flood of this type of query from overloading the search cluster, and hopefully allow other request types to go by unmolested, but that won't directly help RelatedArticles they will just get more queries that error out because the pool counter queue for more like requests will fill up.

An immediate fix to get wmf.9 out will either be to revert the above patch, or adjust it so that it doesn't pass the default value.

Change 365405 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Add PoolCounter specifically for morelike

https://gerrit.wikimedia.org/r/365405

Change 365406 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/mediawiki-config@master] Configure CirrusSearch-MoreLike pool counter

https://gerrit.wikimedia.org/r/365406

I think i've found our culprit, a change to the RelatedArticles extension made all existing cached api responses invalid, and the flood of requests trying to refill the varnish caches is overloading the search cluster.

I'm a little confused - previously the limits were set to 3 but we made them configurable. The only wiki they shouldn't be set to 3 is Wikivoyage... but that's not using Cirrus. The query looks exactly the same for me before and after that change.. what am I missing? it is Friday...

Change 365420 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/extensions/RelatedArticles@master] Add limit via ResourceLoaderGetConfigVars

https://gerrit.wikimedia.org/r/365420

Change 365420 merged by Jdlrobson:
[mediawiki/extensions/RelatedArticles@master] Add limit via ResourceLoaderGetConfigVars

https://gerrit.wikimedia.org/r/365420

Change 365647 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/extensions/RelatedArticles@wmf/1.30.0-wmf.9] Add limit via ResourceLoaderGetConfigVars

https://gerrit.wikimedia.org/r/365647

@EBernhardson thank you for your investigation on this task! Sorry to send you so far down the rabbit hole, but I'm glad you came back with the rabbit :) This is a very satisfying explanation.

And thanks too to @Jdlrobson for the quick patches :)

My current plan is to merge the outstanding patches as part of morning SWAT today and then roll the train forward.

Change 365406 merged by jenkins-bot:
[operations/mediawiki-config@master] Configure CirrusSearch-MoreLike pool counter

https://gerrit.wikimedia.org/r/365406

Mentioned in SAL (#wikimedia-operations) [2017-07-17T18:16:13Z] <niharika29@tin> Synchronized wmf-config/PoolCounterSettings.php: Configure CirrusSearch-MoreLike pool counter [mediawiki-config] - https://gerrit.wikimedia.org/r/365406 (T170648) (duration: 02m 54s)

Change 365677 had a related patch set uploaded (by Thcipriani; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@wmf/1.30.0-wmf.9] Add PoolCounter specifically for morelike

https://gerrit.wikimedia.org/r/365677

Change 365677 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@wmf/1.30.0-wmf.9] Add PoolCounter specifically for morelike

https://gerrit.wikimedia.org/r/365677

Mentioned in SAL (#wikimedia-operations) [2017-07-17T19:26:18Z] <thcipriani@tin> Synchronized php-1.30.0-wmf.9/extensions/CirrusSearch: [[gerrit:365677|Add PoolCounter specifically for morelike]] T170648 (duration: 03m 02s)

Change 365647 merged by jenkins-bot:
[mediawiki/extensions/RelatedArticles@wmf/1.30.0-wmf.9] Add limit via ResourceLoaderGetConfigVars

https://gerrit.wikimedia.org/r/365647

thcipriani assigned this task to EBernhardson.

Change 365405 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Add PoolCounter specifically for morelike

https://gerrit.wikimedia.org/r/365405

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:10 PM