Page MenuHomePhabricator

Introduce maintenance script for warming cache with Parsoid Outputs
Closed, ResolvedPublic

Description

We already have this functionality (not as a maintenance script) but as a job spec/job[1] for WMF wikis. Third-parties will need a mechanism to warm their parser cache with parsoid output because we're now making several extensions and core to begin using parsoid outputs for views, edits etc.

In order for third-party wikis to not feel a performance degradation when they begin using the new backend, we should provide a maintenance script for them to run as a first step to prepare their caches with appropriate parser outputs from parsoid so when they switch to using the new backend, performance will stay the same as before (with the legacy output).

Ideally, the script should go through pages whose content model is supported by parsoid on the set wiki progressively and parse pages, save the output in ParserCache (the backend for PC can be configurable with https://gerrit.wikimedia.org/g/mediawiki/core/+/ab1a809acc6633fd7ebd2027688d51c4813754d1/docs/config-schema.yaml#2465).

Due to relatively large sizes of wikis, the script should operate on the pages in batches (say 100 per batch) in order not to attempt doing such an operation for millions of pages at once.

Options/Flags

  • --force - force parse even if there is an entry in PC
  • --namespace X - parse pages in a given namespace. Example: --namespace MediaWiki
  • --start-from X - the page ID to start the parse from
  • More TBA.

[1] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/806443

Related Objects

StatusSubtypeAssignedTask
StalledNone
In ProgressNone
ResolvedMSantos
OpenNone
In ProgressNone
ResolvedDAlangi_WMF
ResolvedDAlangi_WMF
OpenNone
Resolved R_Rana
Resolved R_Rana
Resolved R_Rana
OpenNone
Resolved R_Rana
Resolved R_Rana
Resolved R_Rana
ResolvedDAlangi_WMF
ResolvedDAlangi_WMF
ResolvedDAlangi_WMF
Resolved R_Rana
ResolvedYaron_Koren

Event Timeline

Change 929677 had a related patch set uploaded (by Richika Rana; author: Richika Rana):

[mediawiki/core@master] Populate parser cache with parsoid output.

https://gerrit.wikimedia.org/r/929677

Also, the script should be stateful, so that if it gets aborted in a specific place/it fails in a specific place, it should resume and continue there.

I don't quite agree with "stateful". The script should output progress in a form that allows it to be re-started based on the last output. E.g. it could operate in batches of 100 pages, sorted by page ID. Afte reach page, it would output the last page ID it worked on. And a --start-from option could allow the script to be re-starte starting from that page ID.

This allows recovery afte failure, without the need to save state to a file or database.

I don't quite agree with "stateful". The script should output progress in a form that allows it to be re-started based on the last output. E.g. it could operate in batches of 100 pages, sorted by page ID. Afte reach page, it would output the last page ID it worked on. And a --start-from option could allow the script to be re-starte starting from that page ID.

This allows recovery afte failure, without the need to save state to a file or database.

This makes sense. Printing the ID and using that as a starting point for recovery makes sense. Thanks!

DAlangi_WMF changed the task status from Open to In Progress.Jun 17 2023, 7:19 PM
DAlangi_WMF assigned this task to R_Rana.
DAlangi_WMF triaged this task as Medium priority.

Change 931892 had a related patch set uploaded (by D3r1ck01; author: Derick Alangi):

[mediawiki/core@master] Process the cache warming activity in batches of 100 pages

https://gerrit.wikimedia.org/r/931892

Change 929677 merged by jenkins-bot:

[mediawiki/core@master] Populate parser cache with parsoid output.

https://gerrit.wikimedia.org/r/929677

Change 933081 had a related patch set uploaded (by Richika Rana; author: Richika Rana):

[mediawiki/core@master] Extend PrewarmParsoidParserCache script to allow filtering by namespace.

https://gerrit.wikimedia.org/r/933081

Change 931892 merged by jenkins-bot:

[mediawiki/core@master] Process the cache warming activity in batches of 100 pages

https://gerrit.wikimedia.org/r/931892

Change 933081 merged by jenkins-bot:

[mediawiki/core@master] Extend script to allow filtering by namespace

https://gerrit.wikimedia.org/r/933081

DAlangi_WMF lowered the priority of this task from Medium to Low.