Page MenuHomePhabricator

PoolCounter queue full error delivered during news-driven traffic spike
Closed, ResolvedPublic

Description

It was reported on twitter and in this blog post that "pool queue is full" errors were delivered to users during the traffic spike associated with Prince's death. This is not expected or acceptable.

The poolcounter log showed a spike in "queue full" errors between 17:12 and 17:32. Corresponding with this, there was a spike in pcache.miss.revid.

Prince revid PC spike.png (247×1 px, 27 KB)

So my theory is that this was due to https://gerrit.wikimedia.org/r/#/c/122847/ , which causes parser cache entries to be treated as expired even if $useOutdated = true. Perhaps PoolCounter queue overflow was at fault in T48014, but if so, that bug should probably be fixed some other way. The point of $useOutdated is to keep the site up during a news event, characterised by high edit rate and exceptionally high traffic to a single article, there's no point in retaining the feature if that goal is not met.

Event Timeline

Change 285337 had a related patch set uploaded (by Tim Starling):
In ParserCache, respect $useOutdated

https://gerrit.wikimedia.org/r/285337

Perhaps PoolCounter queue overflow was at fault in T48014

I don't think PoolCounter queue overflow was at fault there, the problem there seemed like out-of-order processing causing bad cache state:

  • User A creates revision R1.
  • Something (possibly A) starts a job to parse R1. The job is protected by PoolCounter, but I don't think that's actually relevant.
  • User B creates revision R2.
  • (Possibly a job to parse R2 starts and finishes in here, or possibly no job ever starts because B is a bot.)
  • The job to parse R1 finishes, and saves in the cache in such a way that other things think it's R2.

I'd think we should be able to restore the functionality of $useOutdated as in https://gerrit.wikimedia.org/r/285337 without negatively affecting the fix for T48014, as anything that is explicitly asking for outdated parser cache entries should already be handling the possibility that it gets one. We might take a closer look at RefreshLinksJob to see if it could include a revid check instead of just a timestamp check, though.

Change 285449 had a related patch set uploaded (by Aaron Schulz):
Make refreshLinksJob explicitly check the cache rev ID

https://gerrit.wikimedia.org/r/285449

The wfDebug( '...' ) calls should be upgraded to wfDebugLog( 'ParserCache', '...' ) calls so we have an audit trail if this happens again.

Change 285575 had a related patch set uploaded (by Aaron Schulz):
Allow for logging cases when parser cache is rejected

https://gerrit.wikimedia.org/r/285575

Change 285575 merged by jenkins-bot:
Allow for logging cases when parser cache is rejected

https://gerrit.wikimedia.org/r/285575

Change 285337 merged by jenkins-bot:
In ParserCache, respect $useOutdated

https://gerrit.wikimedia.org/r/285337

Change 285449 merged by jenkins-bot:
Make refreshLinksJob explicitly check the cache rev ID

https://gerrit.wikimedia.org/r/285449

tstarling claimed this task.