Page MenuHomePhabricator

Lots of parsoidCachePrewarm jobs are created, creating a huge backlog making it unsustainable
Open, Needs TriagePublic

Description

Hi, we upgraded to MW 1.40 and have enabled the warmup for parsoid. We have over 6k+ wikis. We've tried to do a phrased rollout where we do wikis begining with a letter. We're currently allowing a-r and 0-9 but 30k+ jobs were registered and keeps growing. Running some manually and it's just slow. Really slow. It's like it's stalled but then eventually runs. It's fast then slow.

Event Timeline

for instance

2023-11-06 16:47:24 parsoidCachePrewarm Special: revId=81137 pageId=28704 page_touched=20210804154025 causeAction=view options=0 rootJobIsSelf=1 rootJobSignature=70d5f684089f7781c4d51528fad18addc266d501 rootJobTimestamp=20231106162841 namespace=-1 title= requestId=xxx (uuid=xxxx,timestamp=1699288121) STARTING

took around 2 minutes:

2023-11-06 16:49:35 parsoidCachePrewarm Special: revId=81137 pageId=28704 page_touched=20210804154025 causeAction=view options=0 rootJobIsSelf=1 rootJobSignature=70d5f684089f7781c4d51528fad18addc266d501 rootJobTimestamp=20231106162841 namespace=-1 title= requestId=xxx (uuid=xxx,timestamp=1699288121) t=131010 good

I've also back ported https://gerrit.wikimedia.org/r/c/mediawiki/core/+/971542/2 and https://gerrit.wikimedia.org/r/c/mediawiki/core/+/971543/1.

We use a redis jobrunner (the one Wikimedia used to use). We have root jobs removed as they are unrunnable and just build up.

Some thoughts on this:

Backporting https://gerrit.wikimedia.org/r/c/mediawiki/core/+/971543 is probably a good idea, it will prevent pileups when processing is slow.

Two minutes is a really long time to render a page. Something is going on there.

Prewarm jobs are triggered by edits and when visiting a page that doesn't have a valid entriy in the main parser cache (and no entry in the parsoid cache either). So a backlog of 30k jobs could be caused by a much-visited page (like the main page) failing to render for a long time. But new jobs are scheduled only if the main parser cache has a cache miss...

Does this site use PoolCounter for stampede protection? That would help I think.