Parsoid integration
Closed, ResolvedPublic1 Estimated Story Points
Actions

Description

Vision Table: https://www.mediawiki.org/wiki/Core_Platform_Team/Initiative/Caching_for_multiple_parsers/Initiative_Vision

Some context to help make sense of requirements further below:

A lot of code in MediaWiki assume the presence of ParserCache and work with that internal function-level API to access parser output.
Parsoid has its content cached (stored) in RESTBase and Parsoid clients interact with the RESTBase HTTP API to access Parsoid output and do transformations. But, some of these clients will switch over to accessing Parsoid internally via a function-level API instead of the HTTP API.
Our understanding is that Platform Engineering Is phasing out RESTBase and transitioning that functionality into other components. Given that, our understanding is that RESTBase functionality will be transitioned over to ParserCache. So, that means:
- ParserCache needs to provide multi-bucket support and ability to tie them together with a key (revid / tid, etc.). Parsoid produces 3 components per page: HTML, data-parsoid JSON blob, and data-mw JSON blob. For networking and computational efficiency reasons, these are stored separately in RESTBase (minor detail: data-mw is not stored separately right now, but will be if RESTBase continues to be around). Not all Parsoid clients need all blobs. So, the API needs to be able to fetch individual blobs.
- ParserCache (or whatever code component it is) needs to support the stashing functionality for editing clients to provide "storage semantics" (instead of caching semantics where cached content can get evicted arbitrarily as far as clients are concerned) so presence of stashed content is guaranteed within session / time windows. RESTBase provides this.
- The REST API needs to be integrated with ParserCache at some layer so that all REST API requests don't result in fresh parse requests to Parsoid.

In addition to supporting RESTBase functionality, @EvanProdromou has framed this enhanced-ParserCache functionality as a Multi-Parser-Cache (MPC from here on) solution. It has the following constraints / product requirements:

Switchover from core parser to Parsoid read views is going to be done in a phased manner and there might be reverts, etc. So, for quite a while, MPC needs to support caching of output from both core parser as well as Parsoid.
Parsoid's HTML blob is roughly the same size as the core parser's HTML blob. However, Parsoid produces two additional blobs (data-parsoid & data-mw) which also need to be stored in MPC.
Because of the two reasons above, MPC will have much higher storage needs compared to ParserCache.
MPC should provide an unified library interface that supports both ParserCache as well as RESTBase functionality to minimize code churn for existing ParserCache and RESTBase / Parsoid clients.

In addition to legacy HTML / Parsoid HTML / data-mw / data-parsoid, there may be additional derived fields, discussed in the comments below. W/o necessarily adding them to the requirements, these fields might include (for example), linter output, "structured comments" (ie, the output of the DiscussionTools parser), and even perhaps auxiliary data to help track annotations on the DOM (like mappings between node IDs of this revision and previous revisions).

Another fact to consider: whether the cache is tagged on "revision ID" or "timestamp" or something else -- that is, a given revision ID might have multiple parses, because its dependencies change. RESTBase exposed this via timestamp IDs it exposed. This functionality doesn't exist in core parser cache (as far as I know). FlaggedRevisions exposes another sort of distinguishing factor that might cause parses to vary -- it renders inclusions using the latest "flagged revision" of that template. This is perhaps a special case of "timestamp". Finally you might consider (w/ an eye to the future) a way of combining the various dependencies into a merkle tree something like a git commit hash, so that the etag uniquely identifies the set of dependency versions and (in theory) parses at different timestamps could result in the same etag if none of the dependencies were updated between the timestamps.

Details

	Subject	Repo	Branch	Lines +/-
	Cover ParserCache with integration tests	mediawiki/core	master	+389 -0

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		• Pchelolo	T250500 ParserCache / RESTBase / Parsoid integration
Open		None	T260169 Parsoid API to "expand page X with parameter Y and Z"
Resolved		• Pchelolo	T262571 Parser Cache Support for Multiple Parsers
Resolved		• eprodromou	T262582 MediaWiki Developer stores specific parser HTML output in parser cache
Invalid		None	T262583 MediaWiki Developer gets specific parser HTML output from parser cache
Resolved		• Pchelolo	T262584 MediaWiki Developer stores unspecified HTML output in parser cache
Invalid		None	T262585 MediaWiki Developer gets default parser HTML output from parser cache
Invalid		None	T262588 MediaWiki Developer gets parse metadata from parser cache
Resolved		• eprodromou	T262590 MediaWiki Developer purges all related output from parser cache
Resolved		• eprodromou	T262595 MediaWiki Developer stores page HTML in parser cache
Declined		None	T262598 MediaWiki Developer stores HTML output for arbitrary wikitext in the parser cache
Declined		None	T262601 MediaWiki Developer stores wikitext output from HTML reverse parse in parser cache
Resolved		• eprodromou	T262608 MediaWiki Core REST API Client gets a speedy response
Declined		None	T203781 Allow Parser::VERSION to be bumped without immediately resetting the ParserCache
Resolved		• Pchelolo	T263579 Change ParserCache serialization format to JSON
Resolved		• Pchelolo	T264394 Ensure content of ParserOutput is safe to serialize
Duplicate		None	T266246 GeoData extension should not write objects into ExtensionData
Resolved		• Pchelolo	T266248 GeoData extension should not write objects into ExtensionData
Resolved		• Pchelolo	T266251 PageImages extension should not write objects into ExtensionData
Resolved		CCicalese_WMF	T266252 TemplateData extension should not write objects into ExtensionData
Resolved		daniel	T266260 Kartographer extension should not write objects into ExtensionData
Resolved		noarave	T266263 Wikibase extension should not write objects into ExtensionData
Resolved		• Pchelolo	T266268 Translate extension should not write objects into ExtensionData
Resolved		• Pchelolo	T267377 Jade must not set non-json-serializable properties to JSConfigVars
Declined		None	T264396 Create a maintenance script for warming (or fixing) the parser cache
Resolved		• Pchelolo	T264397 Check ParserOutput validity after deserialization
Resolved	BUG REPORT	Seb35	T291244 Objects in ParserCache’s ExtensionData are not unserialized
Declined		None	T263582 Interfaces for ParserCache
Resolved		• Pchelolo	T263583 Introduce a ParserCacheFactory
Open		None	T263851 Disallow dynamic property access on ParserOutput
Resolved	PRODUCTION ERROR	• Pchelolo	T264257 Fix ParserOutput corruption wmf.10 -> wmf.11
Resolved	PRODUCTION ERROR	• Pchelolo	T269235 [X8ehRwpAICgAALHNCUsAAACI] /wiki/clamor ErrorException from line 157 of /srv/mediawiki/php-1.36.0-wmf.20/extensions/CategoryTree/includes/CategoryTreeHooks.php: ParserOutput::mCategoryTreeTag dynamic property write access deprecated [Called from CategoryTreeHooks::parserHook]
Resolved	PRODUCTION ERROR	• Pchelolo	T269236 [X8elNgpAAL8AASHtlA8AAADE] /wiki/Category:German_pronunciation_of_prepositions ErrorException from line 4021 of /srv/mediawiki/php-1.36.0-wmf.20/includes/parser/Parser.php: ParserOutput::mNoGallery public write access deprecated [Called from Parser::handleDoubleUnderscore]
Declined		• Pchelolo	T264348 Run RejectParserOutput hook much earlier, possibly multiple times
Resolved		• Pchelolo	T265295 PageHTMLHandler should access Parsoid directly
Resolved		• Pchelolo	T265954 Enable Parsoid on api_appserver cluster
Resolved		• Pchelolo	T268043 MW REST API should be routed to api_appserver MW cluster
Invalid		None	T267605 Class hierarchy between Parsoid and OldParser
Open		None	T271112 Do not access global $wgParserCacheExpireTime directly

Event Timeline

cscott created this task.Apr 17 2020, 4:17 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 17 2020, 4:17 PM

Briefly mentioning T205572: Optimize lang conversion and content negotiation combo - one of the decisions to make in the new design will be whether to cache un-language-converted content separately, as RESTBase does.

cscott added projects: Platform Engineering, MediaWiki-libs-BagOStuff, Parsing-Team--ARCHIVED.Apr 17 2020, 4:21 PM

• Pchelolo moved this task from Inbox to Future Initiatives/Small Projects on the Platform Engineering board.Apr 17 2020, 7:43 PM

• ssastry triaged this task as Medium priority.Apr 17 2020, 11:17 PM

• ssastry moved this task from Needs Triage to Missing Functionality on the Parsoid board.

• ssastry removed a project: Parsing-Team--ARCHIVED.

Editing team also was wondering about caching/updating linter data: T253799: RESTBase linting API is very slow (not cached).

I'm interested in bulk access for dumps purposes.

Krinkle edited projects, added MediaWiki-Parser; removed MediaWiki-libs-BagOStuff.Jul 30 2020, 10:20 PM

Krinkle moved this task from Backlog to ParserCache/ParserOutputAccess on the MediaWiki-Parser board.

cscott added a subtask: T260169: Parsoid API to "expand page X with parameter Y and Z".Aug 17 2020, 4:56 PM

• ssastry mentioned this in T259485: Parsoid Cache, Parser Merge (internals, not an API).Aug 24 2020, 3:11 PM

• ssastry merged a task: T259485: Parsoid Cache, Parser Merge (internals, not an API).Aug 25 2020, 4:18 AM

• ssastry added subscribers: Naike, kaldari, • ssastry.

cscott mentioned this in T261310: Upstream something like VE's ApiParsoidTrait into core.Aug 26 2020, 1:40 PM

• ssastry updated the task description. (Show Details)Aug 27 2020, 6:31 PM

• ssastry updated the task description. (Show Details)

• ssastry added a subscriber: EvanProdromou.

kaldari added a project: Platform Team Workboards (Platform-Product Roadmap).Aug 27 2020, 6:39 PM

kaldari moved this task from Untriaged to Later on the Platform Team Workboards (Platform-Product Roadmap) board.

The ParserOutput object also extends a base class, CacheTime, which contains a bunch of ParserCache-specific expiry code. If this is appropriate for the new MPC implementation, we can include it in the base class we'd like to factor out of ParserOutput; if it is not, then we should keep it out of the base class of ParserOutput and include it (maybe as a trait) in the LegacyParserOutput used by the legacy parsercache and legacy parser.

cscott updated the task description. (Show Details)Aug 31 2020, 10:33 PM

kaldari moved this task from Untriaged to Later on the Platform Team Workboards (Platform-Product Roadmap) board.

@Naike, @kaldari, while we don't need this before EOQ3 and while this is strictly "Later", it might be useful thinking through the requirements sooner than later and see what kind of designs they induce ... in case we need to work and iterate through those.

Note that generation of e.g. HTML dumps (see T254275) in a way that does not require getting all content every time but only the changed items, may be facilitated by having revision id or timestamp tags available.

kaldari unsubscribed.Sep 1 2020, 7:56 PM

Arrbee edited projects, added Tech-Product API Roadmap; removed Platform Team Workboards (Platform-Product Roadmap).Sep 2 2020, 2:16 PM

Arrbee moved this task from Untriaged to Later on the Tech-Product API Roadmap board.

• eprodromou added a project: Platform Engineering Roadmap.Sep 10 2020, 5:47 PM

• eprodromou mentioned this in T262317: Parsoid Cache .

• eprodromou updated the task description. (Show Details)Sep 10 2020, 5:49 PM

Naike set the point value for this task to 1.Sep 11 2020, 10:53 AM

Naike moved this task from Later (future inbox) to Next on the Platform Engineering Roadmap board.Sep 14 2020, 9:09 AM

• Pchelolo mentioned this in T262315: <CORE TECHNOLOGY> API Migration & RESTBase Sunset.Sep 16 2020, 6:11 PM

• eprodromou removed a project: Tech-Product API Roadmap.Sep 16 2020, 7:59 PM

ParserCache needs to provide multi-bucket support and ability to tie them together with a key (revid / tid, etc.). Parsoid produces 3 components per page: HTML, data-parsoid JSON blob, and data-mw JSON blob. For networking and computational efficiency reasons, these are stored separately in RESTBase (minor detail: data-mw is not stored separately right now, but will be if RESTBase continues to be around). Not all Parsoid clients need all blobs. So, the API needs to be able to fetch individual blobs.

Currently RESTBase stores all blobs together in a JSON document { "html": "lalala", "data-parsoid": "trulala" }, fetches the whole thing during read and only returns the requested portion. Previously we indeed have stored parts of the page bundle separately in separate tables, but eventually simplified it with not visible performance impact. The performance considerations are tied to backend implementation (Cassandra in RESTBase case), so might not how true for MW ParserCache backend. However, I propose not to optimize prematurely and start with storing the whole page bundle. We can revision it later if we find the overhead of deserializing data-parsoid to be important.

ParserCache (or whatever code component it is) needs to support the stashing functionality for editing clients to provide "storage semantics" (instead of caching semantics where cached content can get evicted arbitrarily as far as clients are concerned) so presence of stashed content is guaranteed within session / time windows. RESTBase provides this.

I would like to separate these concerns, and have ParserCache concentrate on caching, and introduce a separate component for stashing at a later stage. The requirements for these two components are drastically different with ParserCache having 2-level cache deduplicating by used options, different expiry semantics and different key for access. ParserStash (name TBD) is a simple key-value with TTL expiry.

In addition to supporting RESTBase functionality, @EvanProdromou has framed this enhanced-ParserCache functionality as a Multi-Parser-Cache (MPC from here on) solution

This can probably be achieved in the beginning by introducing an entirely separate ParserCache, (ParsoidCache?) service, and using the appropriate one in appropriate places. Once we have tighter integration we can create a wrapper class that would route calls to the appropriate parser cache.

The ParserOutput object also extends a base class, CacheTime, which contains a bunch of ParserCache-specific expiry code. If this is appropriate for the new MPC implementation, we can include it in the base class we'd like to factor out of ParserOutput; if it is not, then we should keep it out of the base class of ParserOutput and include it (maybe as a trait) in the LegacyParserOutput used by the legacy parsercache and legacy parser.

I've been thinking to extract CacheTime interface (plus a few more methods) into CacheableParserOutput interface and make ParserCache work with any instance of CacheableParserOutput. Then we could either make Parsoid's PageBundle implement it, or create a wrapper implementing the interface. This is still not decided though, will keep the ticket updated with latest developments.

• Pchelolo added a subtask: T203781: Allow Parser::VERSION to be bumped without immediately resetting the ParserCache.Sep 22 2020, 6:42 PM

Some valuable info from @ssastry : https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/623032

• Pchelolo changed the status of subtask T263851: Disallow dynamic property access on ParserOutput from Open to Stalled.Sep 25 2020, 6:19 PM

Change 630382 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/core@master] Cover ParserCache with integration tests

https://gerrit.wikimedia.org/r/630382

gerritbot added a project: Patch-For-Review.Sep 27 2020, 6:10 PM

• Pchelolo changed the status of subtask T263851: Disallow dynamic property access on ParserOutput from Stalled to Open.Sep 28 2020, 3:06 PM

• Pchelolo removed a subtask: T263689: ParserCache::getKey should not be public.Sep 28 2020, 3:09 PM

Naike moved this task from Next to Develop (Now) on the Platform Engineering Roadmap board.Oct 1 2020, 3:19 PM

Change 630382 merged by jenkins-bot:
[mediawiki/core@master] Cover ParserCache with integration tests

https://gerrit.wikimedia.org/r/630382

ReleaseTaggerBot added a project: MW-1.36-notes (1.36.0-wmf.12; 2020-10-05; NEVER DEPLOYED).Oct 1 2020, 9:00 PM

Maintenance_bot removed a project: Patch-For-Review.Oct 1 2020, 9:11 PM

• Pchelolo closed subtask T263583: Introduce a ParserCacheFactory as Resolved.Oct 2 2020, 12:27 AM

Naike mentioned this in T264348: Run RejectParserOutput hook much earlier, possibly multiple times.Oct 2 2020, 9:30 AM

Naike added a project: Code-Health-Objective.Oct 5 2020, 3:49 PM

WDoranWMF edited projects, added Product-Feature; removed Code-Health-Objective.Oct 5 2020, 3:50 PM

Naike edited projects, added Code-Health-Objective; removed Product-Feature.Oct 5 2020, 3:53 PM

Naike added a project: Platform Engineering Roadmap Decision Making.Oct 9 2020, 12:49 PM

• eprodromou edited subscribers, added: • eprodromou; removed: EvanProdromou.Oct 13 2020, 8:14 PM

• Pchelolo closed subtask T263582: Interfaces for ParserCache as Declined.Oct 16 2020, 11:10 PM

• Pchelolo closed subtask T264348: Run RejectParserOutput hook much earlier, possibly multiple times as Declined.

Naike moved this task from Untriaged to Icebox on the Platform Engineering Roadmap Decision Making board.Nov 16 2020, 3:25 PM

• ssastry removed a subtask: T267606: Platform Team extensions ported to use new Parsoid extension API.Nov 16 2020, 8:25 PM

• Pchelolo removed a subtask: T262572: Use Parsoid-specific Parser Cache for Visual Editor.Nov 18 2020, 8:07 PM

• Pchelolo removed a subtask: T263581: Find out the reason and potentially eliminate ParserCache split on action:render.Nov 18 2020, 10:47 PM

• Pchelolo closed subtask T263851: Disallow dynamic property access on ParserOutput as Resolved.Nov 19 2020, 8:49 PM

daniel mentioned this in T268848: Allow ParserCache to store information other than ParserOutput.Nov 26 2020, 6:29 PM

• Pchelolo reopened subtask T263851: Disallow dynamic property access on ParserOutput as Open.Dec 2 2020, 2:54 PM

• Pchelolo closed subtask T263579: Change ParserCache serialization format to JSON as Resolved.Dec 2 2020, 8:00 PM

• Pchelolo closed subtask T265295: PageHTMLHandler should access Parsoid directly as Resolved.Dec 7 2020, 3:20 PM

WDoranWMF removed projects: Code-Health-Objective, Platform Engineering Roadmap.Dec 7 2020, 6:45 PM

daniel added a subtask: T271112: Do not access global $wgParserCacheExpireTime directly.Jan 4 2021, 3:22 PM

• Pchelolo closed subtask T203781: Allow Parser::VERSION to be bumped without immediately resetting the ParserCache as Declined.Jan 4 2021, 4:37 PM

• Pchelolo closed subtask T267605: Class hierarchy between Parsoid and OldParser as Invalid.Jan 4 2021, 6:48 PM

• Pchelolo closed subtask T262571: Parser Cache Support for Multiple Parsers as Resolved.

We have been using this as an umbrella for work specific to ParserCache, and that has now been finished.

What about the stashing functionality that RESTBase provides? Or is that going to be handled as part of the "retire RESTBase" work?

In T250500#6721154, @ssastry wrote:

What about the stashing functionality that RESTBase provides? Or is that going to be handled as part of the "retire RESTBase" work?

T264669 is a new umbrella task for that work.

Good to know! Thanks! :)

ParserCache / RESTBase / Parsoid integrationClosed, ResolvedPublic1 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

ParserCache / RESTBase / Parsoid integration
Closed, ResolvedPublic1 Estimated Story Points
Actions

Related Objects
Search...