Page MenuHomePhabricator

High insertion rate of ParsoidCachePrewarmJob causes substantial backlog
Closed, ResolvedPublic

Description

High insertion rate of ParsoidCachePrewarmJob causes substantial backlog, causing prewarm jobs to be delayed by up to two hours and more.

parsoid-prewarm-queue-enqueue-rate.png (500×1 px, 79 KB)
parsoid-prewarm-queue-wait-time.png (500×1 px, 50 KB)

It seems likely that the high intertion rates are caused by template edits that invalidate the cached rendering of many highly frequented pages. However, we were not able to identify any specific template edits that would have caused the spikes.

We suspect that the issue is amplified by a stampede effect on pages that are highly frequented but slow to parse. Stampede protection and deduplication may help.

NOTE: ParsoidCachePrewarmJob get scheduled on page view whenever a page is viwed and the cached output of the old parser is detected to be stale. The old parser uses PoolCounter for stampede protection - if we fail to get a lock, we'll serve stale output, and procede to schedule a ParsoidCachePrewarmJob. This may happen numberous times before the old parser has finiashed rendering.
NOTE: ParsoidOutputAccess does not use PoolCounter, so several ParsoidCachePrewarmJobs for the same page may be executing concurrently, re-parsing the same page.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 935716 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[mediawiki/core@master] Add stampede protection to ParsoidCachePrewarmJob.

https://gerrit.wikimedia.org/r/935716

Change 935722 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[mediawiki/core@master] ParsoidCachePrewarmJob: enable deduplication

https://gerrit.wikimedia.org/r/935722

akosiaris subscribed.

Moving to radar as we wanna keep an eye on this, but apparently nothing actionable for serviceops yet.

How does edit of a template get propagated to its usages?

To be more specific:

It seems likely that the high intertion rates are caused by template edits that invalidate the cached rendering of many highly frequented pages.

Does this mean editing a template would lead to these jobs being queued or it's simply just page_touched getting changed?

Does this mean editing a template would lead to these jobs being queued or it's simply just page_touched getting changed?

When a template is edited, one job is scheduled for the template itself, and page_touched is updated. But when pages that use the template are visited, jobs for rendering the pages will be scheduled.

Note that I'm still not certain that this is what is causing the spikes. But it's the best idea I have right now...

MSantos edited projects, added Parsoid (Tracking); removed Parsoid.
MSantos subscribed.

Change 935722 merged by jenkins-bot:

[mediawiki/core@master] ParsoidCachePrewarmJob: enable deduplication

https://gerrit.wikimedia.org/r/935722

daniel claimed this task.

Resolved by increasing processing concurrency for the cache warming jobs.

Change 935716 abandoned by Daniel Kinzler:

[mediawiki/core@master] Add stampede protection to ParsoidCachePrewarmJob.

Reason:

Using the jobqueue's built in deduplication mechanism seems sufficient.

https://gerrit.wikimedia.org/r/935716

Change 971543 had a related patch set uploaded (by Paladox; author: Daniel Kinzler):

[mediawiki/core@REL1_40] ParsoidCachePrewarmJob: enable deduplication

https://gerrit.wikimedia.org/r/971543

Change 971543 merged by jenkins-bot:

[mediawiki/core@REL1_40] ParsoidCachePrewarmJob: enable deduplication

https://gerrit.wikimedia.org/r/971543