Problem
In news-driven events leading to simultaneous high edit rate and high view rate for a Wikipedia page, large numbers of PHP workers are found to be waiting in PoolCounter for parsing to complete.
Currently, when a page is edited, requests for that page will be queued until the new revision has been parsed and the new HTML is available in ParserCache. This queueing is done using a combination of ATS request coalescing and MediaWiki PoolCounter. If MediaWiki’s PoolCounter maximum queue size of 100 workers is exceeded, or if the timeout of 15 seconds expires, MediaWiki checks to see if there is a stale cache entry. If there is, it serves that instead, with zero s-maxage. If there is no stale cache entry, an error is shown. So the PoolCounter timeout and maxqueue cannot be reduced without risking error delivery.
Proposal
If MediaWiki receives a parser cache miss request for a page, and parsing is already underway (i.e. the PoolCounter lock acquisition would block), it should serve the stale page immediately. The s-maxage could be several seconds, instead of zero, so that ATS can serve stale data at a high request rate while parsing continues. MediaWiki should only wait for the PoolCounter lock if stale data is not available.
Implementation details
- Add a timeout parameter to PoolCounter::acquireForAnyone(). If non-null, it overrides the default pool timeout.
- Judging by locks.c line 129, the daemon already has a special case for timeout=0 and will synchronously respond with "TIMEOUT" if the worker slots are occupied.
- The unused Redis implementation can easily be modified to override the timeout on a per-request basis, although it is probably not so efficient.
- In PoolCounter::execute(), in the cacheable branch, initially call acquireForAnyone() with timeout=0.
- If it fails (would have blocked), call fallback().
- If fallback() succeeds, return the response.
- If fallback() fails, call acquireForAnyone() with the default timeout and proceed as before.