Page MenuHomePhabricator

ParserCache / RESTBase / Parsoid integration
Open, MediumPublic1 Estimated Story Points

Description

Vision Table: https://www.mediawiki.org/wiki/Core_Platform_Team/Initiative/Caching_for_multiple_parsers/Initiative_Vision

Some context to help make sense of requirements further below:

  • A lot of code in MediaWiki assume the presence of ParserCache and work with that internal function-level API to access parser output.
  • Parsoid has its content cached (stored) in RESTBase and Parsoid clients interact with the RESTBase HTTP API to access Parsoid output and do transformations. But, some of these clients will switch over to accessing Parsoid internally via a function-level API instead of the HTTP API.
  • Our understanding is that Platform Engineering Is phasing out RESTBase and transitioning that functionality into other components. Given that, our understanding is that RESTBase functionality will be transitioned over to ParserCache. So, that means:
    • ParserCache needs to provide multi-bucket support and ability to tie them together with a key (revid / tid, etc.). Parsoid produces 3 components per page: HTML, data-parsoid JSON blob, and data-mw JSON blob. For networking and computational efficiency reasons, these are stored separately in RESTBase (minor detail: data-mw is not stored separately right now, but will be if RESTBase continues to be around). Not all Parsoid clients need all blobs. So, the API needs to be able to fetch individual blobs.
    • ParserCache (or whatever code component it is) needs to support the stashing functionality for editing clients to provide "storage semantics" (instead of caching semantics where cached content can get evicted arbitrarily as far as clients are concerned) so presence of stashed content is guaranteed within session / time windows. RESTBase provides this.
    • The REST API needs to be integrated with ParserCache at some layer so that all REST API requests don't result in fresh parse requests to Parsoid.

In addition to supporting RESTBase functionality, @EvanProdromou has framed this enhanced-ParserCache functionality as a Multi-Parser-Cache (MPC from here on) solution. It has the following constraints / product requirements:

  • Switchover from core parser to Parsoid read views is going to be done in a phased manner and there might be reverts, etc. So, for quite a while, MPC needs to support caching of output from both core parser as well as Parsoid.
  • Parsoid's HTML blob is roughly the same size as the core parser's HTML blob. However, Parsoid produces two additional blobs (data-parsoid & data-mw) which also need to be stored in MPC.
  • Because of the two reasons above, MPC will have much higher storage needs compared to ParserCache.
  • MPC should provide an unified library interface that supports both ParserCache as well as RESTBase functionality to minimize code churn for existing ParserCache and RESTBase / Parsoid clients.

In addition to legacy HTML / Parsoid HTML / data-mw / data-parsoid, there may be additional derived fields, discussed in the comments below. W/o necessarily adding them to the requirements, these fields might include (for example), linter output, "structured comments" (ie, the output of the DiscussionTools parser), and even perhaps auxiliary data to help track annotations on the DOM (like mappings between node IDs of this revision and previous revisions).

Another fact to consider: whether the cache is tagged on "revision ID" or "timestamp" or something else -- that is, a given revision ID might have multiple parses, because its dependencies change. RESTBase exposed this via timestamp IDs it exposed. This functionality doesn't exist in core parser cache (as far as I know). FlaggedRevisions exposes another sort of distinguishing factor that might cause parses to vary -- it renders inclusions using the latest "flagged revision" of that template. This is perhaps a special case of "timestamp". Finally you might consider (w/ an eye to the future) a way of combining the various dependencies into a merkle tree something like a git commit hash, so that the etag uniquely identifies the set of dependency versions and (in theory) parses at different timestamps could result in the same etag if none of the dependencies were updated between the timestamps.

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenNone
OpenNone
OpenNone
InvalidNone
OpenNone
InvalidNone
OpenNone
InvalidNone
OpenNone
OpenNone
OpenNone
DeclinedNone
DeclinedNone
DeclinedNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
StalledNone

Event Timeline

cscott created this task.Apr 17 2020, 4:17 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 17 2020, 4:17 PM

Briefly mentioning T205572: Optimize lang conversion and content negotiation combo - one of the decisions to make in the new design will be whether to cache un-language-converted content separately, as RESTBase does.

ssastry triaged this task as Medium priority.Apr 17 2020, 11:17 PM
ssastry moved this task from Needs Triage to Missing Functionality on the Parsoid board.
ssastry removed a project: Parsing-Team.
cscott added a comment.Jun 4 2020, 6:56 PM

Editing team also was wondering about caching/updating linter data: T253799: RESTBase linting API is very slow (not cached).

I'm interested in bulk access for dumps purposes.

Krinkle moved this task from Backlog to ParserCache on the MediaWiki-Parser board.
ssastry updated the task description. (Show Details)Aug 27 2020, 6:31 PM
ssastry updated the task description. (Show Details)
ssastry added a subscriber: EvanProdromou.

The ParserOutput object also extends a base class, CacheTime, which contains a bunch of ParserCache-specific expiry code. If this is appropriate for the new MPC implementation, we can include it in the base class we'd like to factor out of ParserOutput; if it is not, then we should keep it out of the base class of ParserOutput and include it (maybe as a trait) in the LegacyParserOutput used by the legacy parsercache and legacy parser.

cscott updated the task description. (Show Details)Mon, Aug 31, 10:33 PM

kaldari moved this task from Untriaged to Later on the Platform Team Workboards (Platform-Product Roadmap) board.

@Naike, @kaldari, while we don't need this before EOQ3 and while this is strictly "Later", it might be useful thinking through the requirements sooner than later and see what kind of designs they induce ... in case we need to work and iterate through those.

Note that generation of e.g. HTML dumps (see T254275) in a way that does not require getting all content every time but only the changed items, may be facilitated by having revision id or timestamp tags available.

kaldari removed a subscriber: kaldari.Tue, Sep 1, 7:56 PM
eprodromou updated the task description. (Show Details)Thu, Sep 10, 5:49 PM
Naike set the point value for this task to 1.Fri, Sep 11, 10:53 AM

ParserCache needs to provide multi-bucket support and ability to tie them together with a key (revid / tid, etc.). Parsoid produces 3 components per page: HTML, data-parsoid JSON blob, and data-mw JSON blob. For networking and computational efficiency reasons, these are stored separately in RESTBase (minor detail: data-mw is not stored separately right now, but will be if RESTBase continues to be around). Not all Parsoid clients need all blobs. So, the API needs to be able to fetch individual blobs.

Currently RESTBase stores all blobs together in a JSON document { "html": "lalala", "data-parsoid": "trulala" }, fetches the whole thing during read and only returns the requested portion. Previously we indeed have stored parts of the page bundle separately in separate tables, but eventually simplified it with not visible performance impact. The performance considerations are tied to backend implementation (Cassandra in RESTBase case), so might not how true for MW ParserCache backend. However, I propose not to optimize prematurely and start with storing the whole page bundle. We can revision it later if we find the overhead of deserializing data-parsoid to be important.

ParserCache (or whatever code component it is) needs to support the stashing functionality for editing clients to provide "storage semantics" (instead of caching semantics where cached content can get evicted arbitrarily as far as clients are concerned) so presence of stashed content is guaranteed within session / time windows. RESTBase provides this.

I would like to separate these concerns, and have ParserCache concentrate on caching, and introduce a separate component for stashing at a later stage. The requirements for these two components are drastically different with ParserCache having 2-level cache deduplicating by used options, different expiry semantics and different key for access. ParserStash (name TBD) is a simple key-value with TTL expiry.

In addition to supporting RESTBase functionality, @EvanProdromou has framed this enhanced-ParserCache functionality as a Multi-Parser-Cache (MPC from here on) solution

This can probably be achieved in the beginning by introducing an entirely separate ParserCache, (ParsoidCache?) service, and using the appropriate one in appropriate places. Once we have tighter integration we can create a wrapper class that would route calls to the appropriate parser cache.

The ParserOutput object also extends a base class, CacheTime, which contains a bunch of ParserCache-specific expiry code. If this is appropriate for the new MPC implementation, we can include it in the base class we'd like to factor out of ParserOutput; if it is not, then we should keep it out of the base class of ParserOutput and include it (maybe as a trait) in the LegacyParserOutput used by the legacy parsercache and legacy parser.

I've been thinking to extract CacheTime interface (plus a few more methods) into CacheableParserOutput interface and make ParserCache work with any instance of CacheableParserOutput. Then we could either make Parsoid's PageBundle implement it, or create a wrapper implementing the interface. This is still not decided though, will keep the ticket updated with latest developments.