Page MenuHomePhabricator

General ParserCache service class for large "current" page-derived data
Closed, DeclinedPublic

Description

The new service, let's call it content cache, should work similar to the existing parser cache:

  • semi-permanent
  • support splitting on things like target language
  • get automatically purged when page content changes

In addition, support

  • more than one object per page (per MCR slot, but also other things like graphoid data or Wikibase constraint checks)
  • multiple "stages" per page, e.g. "current" and "stable". This is presently hacked in by the FlaggedRev extension.
  • multiple "targets" per object, e.g. per language, but also annotated vs. folly resolved parsoid output, just pre-processed output, etc.
  • be accessible from outside MediaWiki code, so standalone service can use it.

This is currently just a draft, which would probably become an RFC in time.

Event Timeline

Overall I like the idea, but I'd want to see more details as to how all the "in addition support" stuff is actually going to work at an interface level that doesn't wind up turning it into just another generic key-value store.

  • be accessible from outside MediaWiki code, so standalone service can use it.

When you say "create a service" in the title of this task, are you talking about a service in the sense of the PHP MediaWikiServices class, or a concrete backend implementation of such a thing to be used at Wikimedia? This bullet seems relevant to the latter, but for the former it would be an implementation detail. Unless you're proposing exposing it via api.php (or rest.php) which I very much doubt we'd want to do since that would bring in significant access control complications.

Really I'd guess you probably mean both, in which case distinguishing between the two parts of the proposal would still be useful.

Overall I like the idea, but I'd want to see more details as to how all the "in addition support" stuff is actually going to work at an interface level that doesn't wind up turning it into just another generic key-value store.

It is mostly a K/V store, but not "generic". It would have clear rules for building the keys from well known components (hopefully a little less complex than what we do for the parser cache key now, but conceptually similar). The SLA would be geared towards the use case at hand (equivalent to the parser cache). And the purging/invalidation mechanism would be unified.

The functionality that goes beyond plain K/V is the bucketing for purging. That's actually the complicated bit, and we'll probably think about how it fits in with our ideas about dependency tracking. Right now, ParserCache has one bucket per page, which contains all renderings for that page, one per target language etc. In the future, we may want to bucket per slot and stage (current, stable), maybe?...

When you say "create a service" in the title of this task, are you talking about a service in the sense of the PHP MediaWikiServices class, or a concrete backend implementation of such a thing to be used at Wikimedia?

...

Really I'd guess you probably mean both, in which case distinguishing between the two parts of the proposal would still be useful.

Indeed, i mean both: a service object in core, and an internal service accessible via HTTP, for use by other standalone services. This could be done in two ways: a standalone storage service which is then accessed from within core, or an implementation in core, which could then be accessed via a non-public REST API route. The former seems more straight forward, but I wouldn't dismiss the latter possibility just yet.

a standalone storage service which is then accessed from within core,

Kask,[1] accessed via RESTBagOStuff?

Using BagOStuff would also gives us for free support for non-Wikimedia users of MediaWiki who want to continue using whatever they're currently using for ParserCache. Plus it might simplify Wikimedia's migration somewhat, as it would decouple the MediaWiki side of the project from the storage side.

[1]: Not the same instance of Kask used for sessions, of course. Just like we currently don't use the same instance of MariaDB for the main databases, externalstore, and parser cache.

or an implementation in core, which could then be accessed via a non-public REST API route

At the moment we don't have a concept of "non-public REST API routes", unless I missed something when I was on vacation recently. The closest would be a public route that we document as internal and that requires some sort of shared secret to function.

Kask,[1] accessed via RESTBagOStuff?

Probably not Kask, but perhaps something similar, or a derivative or successor of Kask. Though I'm not entirely sure that we want Cassandra as a backend for this.

At the moment we don't have a concept of "non-public REST API routes", unless I missed something when I was on vacation recently. The closest would be a public route that we document as internal and that requires some sort of shared secret to function.

Well, kind of - at least initially, the idea is to have the API routes for php-parsoid be internal only. But that will probably be done by making the entire REST API internal only. Not sure how easy or reliable it would be to have some routes be public and some internal. I think it's an option, but I don't know if it's a good one.

Kask,[1] accessed via RESTBagOStuff?

Probably not Kask, but perhaps something similar, or a derivative or successor of Kask.

This sounds an awful lot like file storage (where I'm defining "file" to mean some semi-large (for definition of large) chunk of opaque data), which Kask (and Cassandra) aren't well suited for.

Though I'm not entirely sure that we want Cassandra as a backend for this.

Same.

Krinkle renamed this task from Generalize ParserCache, create a service for storing data derived from a page's current content to Generalize ParserCache into a generic service class for large "current" page-derived data.Jul 16 2019, 11:17 PM

@daniel wrote in task description:

  • be accessible from outside MediaWiki code, so standalone service can use it.

What is the use case for external access?

How would an external consumer deal with value validation (e.g. matches known rev id), fragmentation parameters, and how would it deal with absence of the value? -I see ParserCache as fundamentally a getWithSet-like interface (with very high persistence and poolcounter etc, but nonetheless fundamentally lazy-populated).

Krinkle triaged this task as Medium priority.Jul 16 2019, 11:19 PM
Krinkle moved this task from Inbox to Watching on the TechCom board.

This is somewhat similar to what RESTBase is doing right now - caching results of page transformation (Parsoid HTML/DP, MCS content, summary) and what change-prop does (invalidating things in the cache in the right order, purging varnish), so the 2 systems should be unified.

How would an external consumer deal with value validation (e.g. matches known rev id), fragmentation parameters, and how would it deal with absence of the value

I would agree that this is a VERY big concern if going with an external service. RESTBase attempted to do so by selectively replicating the revision table into Cassandra and redoing a bunch of MW logic for access control, but I think this is a lost cause. MW access control/validation logic is too complicated and no external service would be ever good enough in it, so let's not repeat past mistakes going forward.

Using BagOStuff would also gives us for free support for non-Wikimedia users of MediaWiki who want to continue using whatever they're currently using for ParserCache.

Totally support it. Thinking very big and ambitions, we could evven use several BagOStaff's with different backends for different content types, so we select correct replication/latency/consistency/etc guarantees for different content types and use correct backend technology for correct use-cases.

What is the use case for external access?

To clarify: I was referring to access outside MW core, but inside the local (in our case, WMF) network. The intent is not to make this a public service that can be accessed directly by external clients.

Concrete use cases (some currently in core), for extracting data from page content, and caching it for later access: Wikibase constraint validation, graphoid, kartographer, mathoid, template data, page summary...

How would an external consumer deal with value validation (e.g. matches known rev id), fragmentation parameters

Ideally, the cache service itself would know about these things and handle them correctly. E.g. before returning a cache entry, it would check that it's not stale, and when purging the entry for a given page, it would purge the entire "bucket" of cached variants.

and how would it deal with absence of the value? -I see ParserCache as fundamentally a getWithSet-like interface (with very high persistence and poolcounter etc, but nonetheless fundamentally lazy-populated).

Currently, ParserCache isn't getWithSet. If there is no entry cached or the cached entry is stale, you get nothing back. Generating and then caching is the caller's responsibility.

For the new component described here, I'd propose to keep it that way. Generally, a component that accesses the cache (inside mw core or as a standalone service) would be using the cache for a kind of derived resource it knows how to generate.

The idea is: there would be one place to go to for getting rendered content, and one to go to for getting extracted infobox data, and one to go to for graphoid output, etc - and each of these places knows how to generate the derived resources, and uses the unified cache internally. This makes more sense to me than a generic end point for fetchi9ng any kind of resource, with some kind of internal routing to generate each resource.

When then should different components that derive different kind of things from pages share the caching infrastructure, instead of writing their own? Because the purging mechanism is the same, and the access keys are the same, and the scale is similar. Having to re-invent this wheel leads to duplication and annoyance, or the abuse of less-than-ideal mechanisms that exist, like page props.

Currently, ParserCache isn't getWithSet. If there is no entry cached or the cached entry is stale, you get nothing back. Generating and then caching is the caller's responsibility.

That's what I mean. It's not like primary data in MySQL or in an API module. There is a very normal and common scenario in which data asked from ParserCache logically exists, and the revision can be parsed, but it just isn't in the cache right now. And its presence in the cache is not something the caller can predict. In other words, consuming ParserCache is pointless unless the consumer can also generate and populate it (in order words, it uses it like getWithSet kind of).

Exposing it to external services doesn't seem useful as such, because things would randomly be missing, without a way for it to populate it. And I don't think we want external services to come up with their own way of populating the ParserCache, that should presumably be something exclusively done within MediaWiki core.

If we want to expose a public way over the MediaWiki API to get the ParserOutput, that would make more sense. E.g. it would use the ParserCache if available, and generate on-demand as-needed.

I guess there is some confusion over the terms "external service" and "parser cache".

I like the idea of having the ParserCache being a more generalized caching mechanism for MediaWiki. I have serious doubts about other things hinted here, specifically exposing a caching endpoint to other services. I'd argue that such a caching service should be separated from MediaWiki, have a simple API, and probably be structured around the page/revision identifier. We also probably don't want such a system to be written in PHP, as we would aim for the highest possible throughput.

We do not want an application doing some business logic to also be the cache storage for everything else. It was wrong with restbase, it would be wrong here. Each application should manage its own caching logic. This logic should not be delegated to another application and should not rely on the automagic properties of some centralized management system that then becomes the brain of the whole architecture. The only exception I see to this could be some purging logic.

So if we want such a system to be generalized and usable outside of MediaWiki it should be a thin service in front of a storage system[1] and it should:

  • Have primitives that reproduce whatever API we use with e.g. BagOfStuff
  • Be able to work across datacenters in write/write mode
  • Associate data it stores with a series of tags
  • Be able to purge/invalidate its contents based on a query on those tags
  • (probably) Be able to consume such invalidate events from our event systems directly, not needing a mediator
  • Aim for the best bargain in terms of cost/GB/latency/maintenance required
  • Support TTLs for objects

such a system would allow to do complex purging logic like "invalidate all entries where tags template=<tid>,wiki=<wiki>". I think that for this to be effective we need to allow a nearly-arbitrary set of tags.

Summary of TechCom discussion on 2019-08-14:

  • general support for the idea of having a generic store for artifacts derived from page content ("cache" seems a misleading term)
  • functionality should be implemented internally to MW core for now, based on the storage infrastructure of the existing ParserCache
  • making such functionality available to standalone services raises concerns, e.g. about checking access permissions
  • we may want a way to add access flags to the cached objects
  • ParserCache can already store multiple variants. The new system would just add the ability to add other things beyond variants, and to store them for more than just the latest revision of a page.

Rough sketch of an interface, off the top of my head:

put( $page, $stage, $field, $data, $meta )
get( $page, $stage, $field )
  • $page is the page ID
  • $stage is something like "current" or "stable".
  • $field the thing to store - e.g. ParserOutput (one per variant), ConstraintViolations, GraphData, etc.
  • $data is an arbitrary blob
  • $meta contains restrictions and expiry information, like:
    • expiry timestamp
    • revision id
    • access restrictions
    • etag

Edited to add:

purge( $page, $stage = null, $field = null )

The idea is that $page and also $page+$stage serve as "tags" or "buckets" for purging all object associated with them, using a prefix match or partial composite key. $page$stage+$field forms the primary key used for put and get.

  • $stage is something like "current" or "stable".

This could probably use more detail.

  • $field the thing to store - e.g. ParserOutput (one per variant), ConstraintViolations, GraphData, etc.

This makes it sound like $field is "ParserOutput", but I suppose it would actually have to be something like "ParserOutput" plus maybe the slot name plus the ParserOptions hash (which I guess is what you mean by "variant").

  • $meta contains restrictions and expiry information, like:
    • access restrictions

Why is this in here? Are you really intending to have to invalidate the cache every time some access restriction changes, even if $data doesn't (which it usually won't)?

Unless you're now planning to serve data right out of the cache rather than the cache always being behind some frontend, it would probably be better to leave access control to the frontend.

  • etag

I guess by this you mean "some hash of all the source data that went into $data"?

  • $stage is something like "current" or "stable".

This could probably use more detail.

The intent is to remove the need for the FlaggedRevs extension to hack in a second ParserCache.

  • $field the thing to store - e.g. ParserOutput (one per variant), ConstraintViolations, GraphData, etc.

This makes it sound like $field is "ParserOutput", but I suppose it would actually have to be something like "ParserOutput" plus maybe the slot name plus the ParserOptions hash (which I guess is what you mean by "variant").

Yes, that's exactly what I meant.

  • $meta contains restrictions and expiry information, like:
    • access restrictions

Why is this in here? Are you really intending to have to invalidate the cache every time some access restriction changes, even if $data doesn't (which it usually won't)?

Not invalidate the cache, just update the meta data associated with the cache entry.
Access restrictions change very rarely, so performance-wise, this shouldn't be a problem either way.

Unless you're now planning to serve data right out of the cache rather than the cache always being behind some frontend, it would probably be better to leave access control to the frontend.

The idea is being able to use the new cache for things like the page summary, or graphoid data. These would be served externally. Which access restrictions apply may not just depend on the page itself, it may be specific to the field. So the code that writes the blob also sets the access restrictions. They would then be checked automatically when retrieving or serving the blob.

  • etag

I guess by this you mean "some hash of all the source data that went into $data"?

Whatever we want to use to identify the content to outer layers of caching, e.g. the web cache.
This is just an example though. The idea is that the meta data can be anything we may want to read or update without touching the full blob.

This discussion inspired me to write patches for Graph and Kartographer that make their API modules retrieve the graph/map JSON blobs from the ParserOutput object in the parser cache, rather than from the page_props table. This avoids issues with gzipping the data stored in the page_props table and it getting truncated if it's too large despite that. See also T98940 and T119043.

Kask,[1] accessed via RESTBagOStuff?

Probably not Kask, but perhaps something similar, or a derivative or successor of Kask.

This sounds an awful lot like file storage (where I'm defining "file" to mean some semi-large (for definition of large) chunk of opaque data), which Kask (and Cassandra) aren't well suited for.

Though I'm not entirely sure that we want Cassandra as a backend for this.

Same.

I want to take the opportunity to walk this back just a bit. Cassandra is not ideal for storing files, but neither is MySQL/MariaDB, and yet we have done this in the past with both. Either may be suitable for doing so again here, but if it goes that way we should do our homework first. We have some limits in place now (limits that may or may not make sense for a service like this), but there were instances in the past where RESTBase persisted Parsoid output 10s of GB in size (and I believe MySQL's LONGBLOB is limited to 4GB FWIW).

What is the use case for external access?

To clarify: I was referring to access outside MW core, but inside the local (in our case, WMF) network. The intent is not to make this a public service that can be accessed directly by external clients.

Concrete use cases (some currently in core), for extracting data from page content, and caching it for later access: Wikibase constraint validation, graphoid, kartographer, mathoid, template data, page summary...

How would an external consumer deal with value validation (e.g. matches known rev id), fragmentation parameters

Ideally, the cache service itself would know about these things and handle them correctly. E.g. before returning a cache entry, it would check that it's not stale, and when purging the entry for a given page, it would purge the entire "bucket" of cached variants.

and how would it deal with absence of the value? -I see ParserCache as fundamentally a getWithSet-like interface (with very high persistence and poolcounter etc, but nonetheless fundamentally lazy-populated).

Currently, ParserCache isn't getWithSet. If there is no entry cached or the cached entry is stale, you get nothing back. Generating and then caching is the caller's responsibility.

For the new component described here, I'd propose to keep it that way. Generally, a component that accesses the cache (inside mw core or as a standalone service) would be using the cache for a kind of derived resource it knows how to generate.

The idea is: there would be one place to go to for getting rendered content, and one to go to for getting extracted infobox data, and one to go to for graphoid output, etc - and each of these places knows how to generate the derived resources, and uses the unified cache internally. This makes more sense to me than a generic end point for fetchi9ng any kind of resource, with some kind of internal routing to generate each resource.

When then should different components that derive different kind of things from pages share the caching infrastructure, instead of writing their own? Because the purging mechanism is the same, and the access keys are the same, and the scale is similar. Having to re-invent this wheel leads to duplication and annoyance, or the abuse of less-than-ideal mechanisms that exist, like page props.

I'm trying to grok the discussion here and keep coming up a bit confused on one point, and I think it's a really important one.

You're proposing a service that would be accessible to external services (internal to WMF, external to MW), but access that would be limited to managing their own data, is that correct? IOW, the economy of scale here comes from a shared data model, interface, and mechanism for purging; Accessing content cross-service would still be brokered through the service responsible for it. Is that correct?

I like the idea of having the ParserCache being a more generalized caching mechanism for MediaWiki. I have serious doubts about other things hinted here, specifically exposing a caching endpoint to other services. I'd argue that such a caching service should be separated from MediaWiki, have a simple API, and probably be structured around the page/revision identifier. We also probably don't want such a system to be written in PHP, as we would aim for the highest possible throughput.
.
We do not want an application doing some business logic to also be the cache storage for everything else. It was wrong with restbase, it would be wrong here. Each application should manage its own caching logic. This logic should not be delegated to another application and should not rely on the automagic properties of some centralized management system that then becomes the brain of the whole architecture. The only exception I see to this could be some purging logic.
.
So if we want such a system to be generalized and usable outside of MediaWiki it should be a thin service in front of a storage system[1] and it should:

  • Have primitives that reproduce whatever API we use with e.g. BagOfStuff
  • Be able to work across datacenters in write/write mode

What drives this requirement?

  • Associate data it stores with a series of tags
  • Be able to purge/invalidate its contents based on a query on those tags

I interpret this as an indexing requirement. In addition, I assume that purges need to be atomic at least, and perhaps isolated as well. This is all very worth pointing out; Previous comments referred to this as k/v storage, and cited the possibility of using BagOStuff, and based on these requirements, I do not think that is the case.

  • (probably) Be able to consume such invalidate events from our event systems directly, not needing a mediator
  • Aim for the best bargain in terms of cost/GB/latency/maintenance required
  • Support TTLs for objects

such a system would allow to do complex purging logic like "invalidate all entries where tags template=<tid>,wiki=<wiki>". I think that for this to be effective we need to allow a nearly-arbitrary set of tags.

I'm trying to grok the discussion here and keep coming up a bit confused on one point, and I think it's a really important one.

You're proposing a service that would be accessible to external services (internal to WMF, external to MW), but access that would be limited to managing their own data, is that correct? IOW, the economy of scale here comes from a shared data model, interface, and mechanism for purging; Accessing content cross-service would still be brokered through the service responsible for it. Is that correct?

That sounds correct to me.

The Platform Engineering team is will soon start work on a a project that will cover part of what is proposed here, in the context of the Parsoid migration: T262571: Parser Cache Support for Multiple Parsers.

The current plan is to keep ParserCache pretty much as it it, but to extract interfaces from ParserOutput and ParserOptions that can then be implemented by Parsoid and other code. The idea is then to create multiple ParserCache instances managed by a ParserCacheFactory, each configured with a different key prefix (and potentially also configured with different backend caches). This would cover the needs of Parsoid, the old parser, as well as FlaggedRevisions. There is currently no plan to work on a similar cache that would be usable outside MediaWiki.

Krinkle renamed this task from Generalize ParserCache into a generic service class for large "current" page-derived data to General ParserCache service class for large "current" page-derived data.Sep 30 2020, 5:25 PM

@daniel just wondering if you have an update?

I'm abandoning this proposal. It's covered I think by the following:

  • We now have a ParserCacheFactory (T263583), so we can cleanly store output generated in different ways or for different purposes (e.g. Parsoid or FlaggedRevs)
  • Caching output for different slots separately would still be nice. Could be done by restructuring ParserCache, or introducing a CompositeParserOutput class. We investigated this a while back, but it wasn't prioritized at the time, ses T192817.
  • We may still want to put other kinds of data besides ParserOutput into a ParserCache. With the serialization mechanism now using JSON, it should now be trivial to modify the class hierarchy around ParserCache to support this, see T268848.

Making this mechanism accessible outside of MediaWiki, to standalone services, doesn't seem desirable if the implementation still lives in MediaWiki. Moving the entire ParserCache to a standalone service would be a substantial project, for which I don't see a justification at the moment... I guess TermBox would be a potential use case. If Wikidata wants to dig into this, I suggest filing a new ticket with for the more specific use case. We can then revisit the idea of using the same service for core's needs as well.

I think there is currently one last bit missing to address the needs that led to the creation of this task, see T270710: Allow values other than ParserOutput to be stored in a ParserCache instance.