Page MenuHomePhabricator

RESTBase 500 spike of all /page/related/ hits following 1.35.0-wmf.19 all-wiki deployment
Closed, ResolvedPublicPRODUCTION ERROR



Spike in Varnish 5xx errors (~ 2k/min) following all-wiki promotion of 1.35.0-wmf.19.

Screen Shot 2020-02-13 at 12.42.29 PM.png (706×2 px, 142 KB)


Timeline from SAL:

20:33 marxarelli: rollback to group1 due to 500 spike (2k/min) (T233867)
20:32 dduvall@deploy1001: rebuilt and synchronized wikiversions files: (no justification provided)
20:30 marxarelli: varnish 500 spike. rolling back
20:20 gehel: restarting blazegraph + updater on wdqs2006
20:19 dduvall@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.19


Stack Trace
res.message	500: server_error
res.msg	MW API call error cirrussearch-too-busy-error	HTTPError
res.status	500
HTTPError: 500: server_error
    at apiError (/srv/deployment/restbase/deploy-cache/revs/bfdd342afd64d297821d508d9cf794806727f158/restbase/sys/action.js:127:12)
    at checkQueryResponse (/srv/deployment/restbase/deploy-cache/revs/bfdd342afd64d297821d508d9cf794806727f158/restbase/sys/action.js:136:15)
    at buildQueryResponse (/srv/deployment/restbase/deploy-cache/revs/bfdd342afd64d297821d508d9cf794806727f158/restbase/sys/action.js:142:5)
    at tryCatcher (/srv/deployment/restbase/deploy-cache/revs/bfdd342afd64d297821d508d9cf794806727f158/node_modules/bluebird/js/release/util.js:16:23)
    at Promise._settlePromiseFromHandler (/srv/deployment/restbase/deploy-cache/revs/bfdd342afd64d297821d508d9cf794806727f158/node_modules/bluebird/js/release/promise.js:547:31)
    at Promise._settlePromise (/srv/deployment/restbase/deploy-cache/revs/bfdd342afd64d297821d508d9cf794806727f158/node_modules/bluebird/js/release/promise.js:604:18)
    at Promise._settlePromise0 (/srv/deployment/restbase/deploy-cache/revs/bfdd342afd64d297821d508d9cf794806727f158/node_modules/bluebird/js/release/promise.js:649:10)
    at Promise._settlePromises (/srv/deployment/restbase/deploy-cache/revs/bfdd342afd64d297821d508d9cf794806727f158/node_modules/bluebird/js/release/promise.js:729:18)
    at _drainQueueStep (/srv/deployment/restbase/deploy-cache/revs/bfdd342afd64d297821d508d9cf794806727f158/node_modules/bluebird/js/release/async.js:93:12)
    at _drainQueue (/srv/deployment/restbase/deploy-cache/revs/bfdd342afd64d297821d508d9cf794806727f158/node_modules/bluebird/js/release/async.js:86:9)
    at Async._drainQueues (/srv/deployment/restbase/deploy-cache/revs/bfdd342afd64d297821d508d9cf794806727f158/node_modules/bluebird/js/release/async.js:102:5)
    at Immediate.Async.drainQueues (/srv/deployment/restbase/deploy-cache/revs/bfdd342afd64d297821d508d9cf794806727f158/node_modules/bluebird/js/release/async.js:15:14)
    at runCallback (timers.js:672:20)
    at tryOnImmediate (timers.js:645:5)
    at processImmediate [as _immediateCallback] (timers.js:617:5)

Event Timeline

dduvall triaged this task as Unbreak Now! priority.Feb 13 2020, 8:44 PM
dduvall created this task.
Jdforrester-WMF renamed this task from Varnish 500 spike following 1.35.0-wmf.19 all-wiki deployment to Varnish 500 spike of all /page/related/ hits following 1.35.0-wmf.19 all-wiki deployment.Feb 13 2020, 10:59 PM

We dug into the Web request logs. All the 500s were coming from /api/rest_v1/page/related/*. Presumably the new MW version has some different output to what's used by RB, which ends up emitting a 500 error to the user.

For the given time period the CirrusSearch more_like cache, which is a second level cache (the http responses should be cached by traffic infra ) hit rate dropped from ~75% to 5%, and climbed back to ~12% over the ten minutes of deployment. The number of successfull requests doubled from ~200 to ~400, but this was not enough to handle the dramatic drop in hit rate.

Will need to investigate what about this changed, the way the cache key for these is generated is actually fairly fragile, at least when taking into account that we don't have the ability to drop and rebuild the cache live without dropping requests.

This RESTBase endpoint is pretty much wrapping Action API generator:search endpoint with a morelike query and returns the result. The logs indicate the problem is either in the Action API or in Cirrus Search. Here's the code for the request RESTbase generates:

15:19:49 <James_F> So this is likely needing a fix in Cirrus?
15:19:50 <+ebernhardson> the cache key is basically a hash of the query we send to elasticsearch, so changes in query building can change the key...
15:19:58 <+ebernhardson> James_F: hard to say, the requests could have changed shape too
15:20:06 <+ebernhardson> as in, api<->cirrus integration or whatnot
15:20:09 <James_F> Hmm.
15:20:38 → marxarelli joined (sid99685@wikimedia/dduvall-wmf)
15:22:22 <+ebernhardson> James_F: looking at actual changes to cirrussearch, between 18 and 19....nothing particularly suspicious
15:22:39 <James_F> Would type changes in the API break things?
15:22:47 <James_F> There's the new MW API type checking / enforcement system.
15:23:03 <James_F> Maybe it's just going to be bumpy and there's nothing we can do?
15:23:35 <+ebernhardson> James_F: hmm, well the cache key amounts to md5(serialize($searchQueryArray)), so almost anything that is passed changing would do it
15:23:43 <James_F> Hmm. OK.

Essentially what has happened here is that the namespace passed to elasticsearch has changed types from a string to an integer. This changed the request, which invalidates the cache. As for best method forward, not sure. Most direct of course is to change that integer back into a string, but there isn't an obviously great place to do that.

Change 572119 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Force namespace filter to use strings

Provisional plan is to make a config change to split half the traffic to codfw and roll the train with it, then once it's stabilised point traffic back all to eqiad.

As an alternate solution, we may actually be able to drop the cache and repopulate it. There is a failover cluster matching the primary cluster in the codfw datacenter, and CirrusSearch has support to redirect classes of queries to particular clusters. We can direct all of these related articles requests to the codfw cluster, and with it not being occupied by the typical search load we may be able to serve the full request load as the cache refills.

Our daily request rate varies from ~450 to ~700 qps of related articles traffic. During the rollout eqiad served ~400 qps of that before throttling additional requests at the PoolCounter (cluster stayed healthly, we rejected requests well over typical rates). We can increase the pool counter capacity while directing related articles traffic at codfw and likely serve the full 700qps. The cache in question has a 24hr ttl, so this can all be un-done the next day.

Sounds workable. Let's co-ordinate early tomorrow.

Change 572126 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/mediawiki-config@master] cirrus: redirect more_like to codfw to rebuild query cache

Change 572126 merged by jenkins-bot:
[operations/mediawiki-config@master] cirrus: redirect more_like to codfw to rebuild query cache

Mentioned in SAL (#wikimedia-operations) [2020-02-14T00:51:50Z] <jforrester@deploy1001> Synchronized wmf-config/PoolCounterSettings.php: T245202 cirrus: Increase the pool counter limits a bit (duration: 01m 05s)

Mentioned in SAL (#wikimedia-operations) [2020-02-14T00:53:35Z] <jforrester@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T245202 cirrus: Move all move_like traffic to codfw (duration: 01m 02s)

Change 572119 abandoned by EBernhardson:
Force namespace filter to use strings

Instead shifting all more_like traffic to the codfw search cluster to support a full drop-and-rebuild of this cache.

Mentioned in SAL (#wikimedia-operations) [2020-02-18T17:47:21Z] <marxarelli> re-rolling wmf.19 to all wikis (T233867) with eyes particularly on (T245202)

dduvall claimed this task.

Today's re-roll of 1.35.0-wmf.19 showed no such increase in pool counter rejections nor spikes in restbase/varnish 500s.

Random data points so I can find them next time we have this issue:

  • Typical peak load is 200qps or more_like
  • Peak more_like at deployment on codfw was ~675qps.
  • Peak search thread pool occupancy was 675. This is about 50% higher than our typical daily rate when all traffic is on one cluster.
  • No pool counter rejection or search thread pool rejections reported
  • p75 Latency increased from 180 to 250ms. p95 from 350ms to 550ms.
  • After 30 minutes: 550qps, p75 200ms, p95 425ms. Hit rate of 25%.