Page MenuHomePhabricator

TemplateData extension sets non-json-serializable page properties
Open, LowPublic

Description

TemplateData extension uses the following method to generate page properties:

	/**
	 * @return string JSON (gzip compressed)
	 */
	public function getJSONForDatabase() : string {
		if ( $this->jsonDB === null ) {
			// Cache for repeat calls
			$this->jsonDB = gzencode( $this->getJSON() );
		}
		return $this->jsonDB;
	}

Obviously the gzipped data can not be serialized to JSON as a part of the ParserOutput. We should probably remove the gzipping from this layer - if we do need to save on space, we can compress data on the storage layer.

Original parent is T263579

Event Timeline

Pchelolo added subscribers: saper, Krinkle.

So, the problem is that when we switch ParserCache to JSON, we loose the ability to store gzip-compressed data in page properties. T203850 already has restricted gzipping to just mysql, but it seems now we will have to eliminate gzipping from TemplateData altogether.

@Krinkle and @saper has been working on this last - do you think it's ok if I just drop gzipping from TemplateData altogether? Do we know if that will result in catastrophic increase in table size? Can we move gzipping to a different level - PageProps can zip the content in writes/reads, and we could potentially make a ZippedBagOStuff for ParserCache, if there's concerns about the size. But I do not think individual extensions should bother compressing their own page properties - that seems like a too low-level concern. What do you think?

Last I checked, the gzip is required for the data to fit in the database column, it was not an optimisation afaik. There is also the matter of display in the UI on pages such as Special:PagesWithProp where "gzip" is kind of a proy for "don't display big blobs here, use it as a boolean prop for whether the field is or isn't set".

I don't understand what the page_props value has to do with the ParserCache value though, why does it matter for the ParserOutput object whether one of its string values happens to use gzip encoding?

I don't understand what the page_props value has to do with the ParserCache value though, why does it matter for the ParserOutput object whether one of its string values happens to use gzip encoding?

json_encode sometimes fails to encode gzipped content. The page_props table is populated from ParserOutput::mProperies set by parser and extensions, so data set into PageProps is also present in ParserOutput object and is cached in the ParserCache. Switching ParserCache to json from PHP serialization makes us crash on some pages due to gzipped data in mProperties.

One option could be to drop page_props from ParserCache, and then read it on-demand from the database when ParserOutput::getProperty is called, but this might have negative performance impact.

Another option is to detect non-json-serializable strings when serializing ParserOutput for ParserCache and encode them somehow (base64, whatever), decode back on deserialization. It's totally doable, but given that there's other issues with gzip-encoded page properties, like T210548, I thought that perhaps we could fix it once and for all by either removing the gzipping altogether or moving it way down to the storage layer.

For JSON encoding, strings must be valid UTF-8, which gz-encoded binary is not. That makes sense. Thanks, forgot about that!

I'd be open to simply not gzipping these in page_props. It just means someone will need to take time to investigate why we did that, and whether the issues still apply. At the time, the issue was that many templates were unable to be documented propertly due to the limited size that page_props allows. If the size of that table field been increased since then, we might be able to disable compression.

Adding first-class support for this at the storage layer could work as well, we'd likely want that to be opt-in and transparent as much as possible, with some way for TemplateData to declare or hook-filter to indicate its desire to compress these. Sounds good to me. I wrote this extension for VisualEditor whilst on the Editing team. It'd be a fairly small change, and afaik Editing team still stewards this, so be sure to coordinate with them.

Ok, thank you for your input. I think at first I will pursue the simpler solution - look whether it's possible to remove compression and not break anything. Will try to hunt down original reasons and will write up what I discover.

Quick note: tried to find other instances of this problem via codeserch, it seems like TemplateData might be the only one that is compressing page properties.

If we don't care about the wasted space, we can just write TemplateData to the BlobStore, and record the blob address to pageprops. Maybe plus a hash, so we only write a new blob if it changed.

Or we can have a separate ParserCache instance, just for TemplateData. Perhaps that would be a pattern that could work for all kinds of data derived from the primary content. This is the kind of use case I had in mind for T227776.

Or we can have a separate ParserCache instance, just for TemplateData.

I'm not sure how will this help. The problem is that any page that has <templatedata> tag in it is not uncacheable in ParserCache with JSON serialization. Are you proposing to add php-serializing parser cache alongside with JSON-serializing cache?

If we don't care about the wasted space, we can just write TemplateData to the BlobStore, and record the blob address to pageprops. Maybe plus a hash, so we only write a new blob if it changed.

I think that nothing except TemplateData itself can be using these page properties, cause nothing really knows how to properly deserialize them. So technically we don't really need to have a page property for it at all? We can just use SqlBagOStuff with page_id as a key, or create a dedicated DB table for TemplateData, and drop the page property entirely.

Or we can have a separate ParserCache instance, just for TemplateData.

I'm not sure how will this help. The problem is that any page that has <templatedata> tag in it is not uncacheable in ParserCache with JSON serialization. Are you proposing to add php-serializing parser cache alongside with JSON-serializing cache?

The problem is that it's trying to squeeze the data into pageprops. Note that the data is already JSON, but then gets compressed to fit into the pageprops field.

I'm proposing to not put it into pageprops at all. And also not into the main ParserOutput. It should go into a separate ParserCache. This would require ParserCache to accept a thing that isn't a ParserOutput, but a slightly broader interface. It would need to know the concrete class to allow for deserialization from JSON... this needs a little more more thought for the details, but my point is: we want a generalized parser cache for holding this kind of derived data instead of putting it into pageprops.

I think that nothing except TemplateData itself can be using these page properties, cause nothing really knows how to properly deserialize them. So technically we don't really need to have a page property for it at all? We can just use SqlBagOStuff with page_id as a key, or create a dedicated DB table for TemplateData, and drop the page property entirely.

That's exactly the idea behind putting it into a separate ParserCache :)

An easier / more lightweight / intermediate solution is to just use setExtensionData() instead of setProperty(), so it's still part of the ParserOutput, but doesn't need to be written to the pageprops table. So no need for compression. We would need to re-parse in case of a cache miss, but those are rare (and TemplateData is rare as well).

Change 636754 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/extensions/TemplateData@master] [WIP] Don't write template data to pageprops table.

https://gerrit.wikimedia.org/r/636754

Change 637712 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/extensions/TemplateData@master] Add logging for parser cache misses.

https://gerrit.wikimedia.org/r/637712

The data requires strong persistence and fast access in batches without requring re-parsing under any circumstances. If it was needed only on page views, it would have already used setExtensionData. It is actually not needed at all on page views, so if there was precedent for excluding a page prop from ParserOutput, then TemplateData could make use of that. However, as of now we don't have that as a primitive, and I'm not sure we need that either. It seems harmless to keep there, and benefical for simplicity and consistency.

The main issue as I see is it that support for gzip compression was added locally in the TemplateData extension, thus leading to non-UTF8 binary data to seen by core where the new JSON-compatible ParserCache would not expect this.

Chatting with @Pchelolo I suggested to make gzip compression for page props supported and recognised by core. E.g. declaratively via an attribute in the extension registry (like for user rights, and such) or through a filter hook where we opt-in by append the key to a by-ref list of keys to compress. Core would then use this list and encode/decode as needed, transparent to ParserOutput and TemplateData.

There isn't a need for compat there imho, but if we want to make it easy to enable for pre-existing things in other extension, then we could make it e.g tolerate uncompressed values when reading values for keys that have opted-in. I'd recommend against trying to blindly decode any/all page props as gzip as that would imho be wasteful at run-time, and also remove ability to detect errors. If we know something is meant to be compressed and fails, we can emit an error for data corruption. We can't do that if it's implicit with no registered intent.

Long-term it would make sense to migrate TemplateData to MCR since it is literally just a blob from the source meant as a separately authored slot. It's not derived in any meaningful way apart from extraction. And if TemplateData is the only use case for this core feature, then perhaps it makes sense not to do that, or to only support it unofficially/internally for a short while and inves toward MCR instead. Alternatively, TemplateData could be given its own table if we don't want this to be in page_props for some reason.

The data requires strong persistence and fast access in batches without requring re-parsing under any circumstances.

My understanding was that TemplateData is used primarily by VisualEditor for generating an edit form for template parameters. Is that right?

I find two calls to ApiTemplateData via codesearch: one in GWToolset, and one in Parsoid's DataAccess. Both are for a single title. I also see the onParserFetchTemplateData handler in TemplateDataHooks, which sais "This inefficient implementation is currently tuned for Parsoid's use case where it requests info for exactly one title. For a real batch use case, this code will need an overhaul."

What's the use case that requires fast access in batches? For which of the above use cases would a reparse (if rare) be unacceptable, and why?
I would like to explore real world access patterns, so we can make an informed decision.

Chatting with @Pchelolo I suggested to make gzip compression for page props supported and recognised by core. E.g. declaratively via an attribute in the extension registry (like for user rights, and such) or through a filter hook where we opt-in by append the key to a by-ref list of keys to compress. Core would then use this list and encode/decode as needed, transparent to ParserOutput and TemplateData.

The filter hook approach seems attractive, but doing it cleanly isn't trivial. The problem is that while the retrieval logic resides in the PageProps class, the logic for updating the page_props table is entangled with the mechanism for updating links tables in the LinksUpdate, and requires comparison of serialized values for optimizing writes. To provide proper encapsulation of the serialization mechanism, a bit of refactoring would be needed.

By the way: ParserOutput::setProperty is documented to supports scalar values only. Adding a serialization mechanism would mean we start supporting (some) objects there. It seems like we are introducing support for somethign that we don't really want. PageProps was never intended as a storage mechanism for arbitrary data, but for things like counts, flags, and attributes.

Here's another thing we could do: we could write TemplateData to the BlobStore, and store the resulting BlobAddress (plus a hash) to PageProps. When updating, we could check the hash of the new data against the stored hash, and if it matches, just use the old address. If it changed, store a new blob. Old blobs would just stay in the blob store and never be deleted. We do that for revision data, so it's probably not a big deal.

Alternatively we can make the page prop MEDIUMBLOB instead of BLOB in mysql?

The data requires strong persistence and fast access in batches without requring re-parsing under any circumstances.

My understanding was that TemplateData is used primarily by VisualEditor for generating an edit form for template parameters. Is that right?

No, that came much later. The main use case is the interactive discovery and edit interface itself, which uses a prefixindex generator. (source)

Change 637712 merged by jenkins-bot:
[mediawiki/extensions/TemplateData@master] Add logging for parser cache misses.

https://gerrit.wikimedia.org/r/637712

Alternatively we can make the page prop MEDIUMBLOB instead of BLOB in mysql?

Sure, that would fix it. But the schema change is going to take months. And We would need to dercide whether this is really a use case we want to support with pageprops.

Change 638471 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] ParserOutput: add support for binary properties in JSON.

https://gerrit.wikimedia.org/r/638471

Change 638666 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] ParserOutput: only allow scalar page props.

https://gerrit.wikimedia.org/r/638666

Here's another thing we could do: we could write TemplateData to the BlobStore, and store the resulting BlobAddress (plus a hash) to PageProps. When updating, we could check the hash of the new data against the stored hash, and if it matches, just use the old address. If it changed, store a new blob. Old blobs would just stay in the blob store and never be deleted. We do that for revision data, so it's probably not a big deal.

This would be unprecedented, in that afaik we've never used BlobStore/ExternalStore for derived data.

It is imho a big deal because unlike revision data, they would have no way to be accessed, validated, or oversighted. It also isn't clear to me how this would work given that it is currently a dynamic operation based on parser expansion. This means using a keyword like {{CURRENTIME}} and {{#time}} would cause a lot of churn without a clear way to rate limit this in an way that is intuitive to users. Not to mention PC variations by language and other preferences. Writing to the BlobStore from within a parser function seems better to avoid. A post-save hook could work and more closely model MCR, but that still leaves us with a lot of potentiall uncollectable gargage.

This doesn't seem like a viable direction short-term, but it it could work long-term with more thought behind it. However, for the long-term, I think MCR would be a offer a better return on investment and would be less "novel" and surprising from an achitecture perspective for future maintainers.

Summary of options considered so far

  1. Remove gzencode call, but otherwise keep as-is. T266200#6571491
    • Not viable, the compression is needed in order to fit the allowed templatedata size in the page_props schema.
  2. Move gzencode() call from TemplateData/ParserOutput writer to core/PageProps writer, opt-in support from core. T266200#6571491, T266200#6592676
    • Viable, adds a small capability to core.
  3. Change core to remove all page props from ParseOutput/ParserCache, read/parse on-demand as needed. – T266200#6572904
    • Not viable. Database only contains current version of canonical parser variant. Needs to be available in general query API and for many different generator/batch queries including for use cases other than TemplateData. Parsing on-demand in such context is not acceptable and would make content inaccessible due to reaching execution timeouts.
  4. Move from page_props to a custom table owned by TemplateData.T266200#6592676
    • Needlessly expensive. Would add significant complexity and maint cost long-term, syncing via LinksUpdate? Re-using scalable core infra should be preferred. E.g. MCR, PageProps, or BlobStore.
  5. Move from page_props to MCR. – T266200#6592676
    • Viable long-term and "the right thing" to do. But non-trivial in the short-term. Need something else first to unblock Json-ParserCache.
  6. Remove call and increase size of page_props schema. – T266200#6596647
    • Viable long-term, but need something short-term first to unblock Json-ParserCache.
  7. Move from page_props to a custom BlobStore (like ExternalStore), and track address in page_props.T266200#6595428
    • Needlessly expensive, but could work long-term. However seems inferior to MCR.

I think supporting gzip in core is the simplest way to get this unblocked. If we don't want to have to support the gzip capability in PageProps long-term, I propose we mark it @internal and effectively deprecated on-arrival. E.g. just for one release cycle while we move it to MCR and-or increase PageProps, and remove the option again from core after that.

I think the most primising short term solution right now is to apply base64 encoding to binary properties in ParserOutput, see https://gerrit.wikimedia.org/r/638471. That should just work (tm). I'll give my thoughts on your comment anyway, for future consideration.

This would be unprecedented, in that afaik we've never used BlobStore/ExternalStore for derived data.

AbuseFilter has been using ExternalStore for that, and is about to start using BlobStore, see T261889.

It is imho a big deal because unlike revision data, they would have no way to be accessed, validated, or oversighted.

Blobs for old revisions would become inaccessible, so no need for oversight. Access to the latest would be possible via the address in page_props-

It also isn't clear to me how this would work given that it is currently a dynamic operation based on parser expansion. This means using a keyword like {{CURRENTIME}} and {{#time}} would cause a lot of churn without a clear way to rate limit this in an way that is intuitive to users. Not to mention PC variations by language and other preferences.

I don't see how that is different from the current situation.

Writing to the BlobStore from within a parser function seems better to avoid. A post-save hook could work and more closely model MCR, but that still leaves us with a lot of potentiall uncollectable gargage.

Yea, it should probably not happen inline, but once parsing is complete. Must happen though before the ParserOutput goes into ParserCache.

This doesn't seem like a viable direction short-term, but it it could work long-term with more thought behind it. However, for the long-term, I think MCR would be a offer a better return on investment and would be less "novel" and surprising from an achitecture perspective for future maintainers.

Absolutely, yes.

This would be unprecedented, in that afaik we've never used BlobStore/ExternalStore for derived data.

AbuseFilter has been using ExternalStore for that, and is about to start using BlobStore, see T261889.

That data isn't derived or otherwise ephemeral or variable (afaik). It's moving from one append-only store (custom table) to another, both times each record remains referenced by the app. Very much a textbook use case for ES like revision text and Flow content is.

It also isn't clear to me how this would work given that it is currently a dynamic operation based on parser expansion. This means using a keyword like {{CURRENTIME}} and {{#time}} would cause a lot of churn without a clear way to rate limit this in an way that is intuitive to users. Not to mention PC variations by language and other preferences.

I don't see how that is different from the current situation.

Right now we only store the current-canonical version, nothing else. The problem is that this would seem to cause uncontrolled growth based on past revisions and many parser cache variations, in a store that isn't ephemeral.

I think the most primising short term solution right now is to apply base64 encoding to binary properties in ParserOutput, see https://gerrit.wikimedia.org/r/638471. That should just work (tm). I'll give my thoughts on your comment anyway, for future consideration.

This seems fine by me because afaik we never need this data via ParserOutput anyway. encoding it in a way that doesn't break JSON, or even omitting it entirely, either would be fine by me. I'd also be fine with it being marked temporary or internal and not long-term supported, if that helps you with long-term maintenance.

This draws a nice compromise I think by making it opt-in rather than omitting page props from PC more generally, which has compat concerns. If TemplataData were to want to support PC, then I could write the same data in non-compressed form to setExtensionData. However as said, we don't yet need that afaik, so base64-tolerating it or omitting is fine.

This seems fine by me because afaik we never need this data via ParserOutput anyway. encoding it in a way that doesn't break JSON, or even omitting it entirely, either would be fine by me. I'd also be fine with it being marked temporary or internal and not long-term supported, if that helps you with long-term maintenance.

I was indeed wondering if we could just skip it during JSON serialization. Are we sure we never run LinksUpdate based on a ParserOutput we loaded from the ParserCache? If we did, we'd end up removing templatedata from page_props. I think RefreshLinksJob can cause this situation.

This seems fine by me because afaik we never need this data via ParserOutput anyway. encoding it in a way that doesn't break JSON, or even omitting it entirely, either would be fine by me. I'd also be fine with it being marked temporary or internal and not long-term supported, if that helps you with long-term maintenance.

I was indeed wondering if we could just skip it during JSON serialization. Are we sure we never run LinksUpdate based on a ParserOutput we loaded from the ParserCache? If we did, we'd end up removing templatedata from page_props. I think RefreshLinksJob can cause this situation.

Change 638471 merged by jenkins-bot:
[mediawiki/core@master] ParserOutput: add support for binary properties in JSON.

https://gerrit.wikimedia.org/r/638471

Change 636754 abandoned by Daniel Kinzler:
[mediawiki/extensions/TemplateData@master] Don't write template data to pageprops table.

Reason:
We go with base64 encoding for now, we may come back to this later

https://gerrit.wikimedia.org/r/636754

Bildschirmfoto von 2020-11-17 21-59-14.png (697×1 px, 73 KB)

The data gathered shows a good 10% of parser cache misses, amounting to over a thousand re-parses per hour. That means my idea of using setExtensionData() instead of setProperty() doesn't work. I have already abandoned the corresponding patch. I'm now submitting patches to undo the statistics gathering, which is just adding unnecessary ParserCache lookups.

Change 641532 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/extensions/TemplateData@master] Revert "Add logging for parser cache misses."

https://gerrit.wikimedia.org/r/641532

I'm lowering the priority. We are for now working around this issue by transparently applying base64 encoding to binary strings in JSON data.

This ticket should however remain open, since the underlying problem remains unsolved. The long term solution should probably be migrating TemplateData to MCR.

daniel lowered the priority of this task from High to Low.

Change 641532 merged by jenkins-bot:
[mediawiki/extensions/TemplateData@master] Revert "Add logging for parser cache misses."

https://gerrit.wikimedia.org/r/641532

Pchelolo claimed this task.
Pchelolo moved this task from Ready to Deploy to Done on the Platform Team Workboards (Green) board.

Ouch, jumped the gun on closing it.

Will move it to roadmap decision making and unparent.

Change 638666 abandoned by Daniel Kinzler:
[mediawiki/core@master] ParserOutput: only allow scalar page props.

Reason:
obsolete

https://gerrit.wikimedia.org/r/638666

Sorry, but I do want to access TemplateData via API. May be within pageprops or anywhere else.

TemplateData is a valuable information which may be exploited by various tools and gadgets, also on searching for the appropriate template. At least template description is required via API. There is a project at WMDE which is heading for interactive browsing through all templates matching some criteria.

Another application brought me here:

Now imagine a gadget changing that by screengrabbing on demand for 200 templates:

<dl>
<dt><a href="/Template:Quotation">Quotation</a></dt>
<dd>Adds a block quotation.</dd>
<dt><a href="/Template:Quotation_with_original">Quotation with original</a></dt>
<dd>Adds a block quotation and original quote as well.</dd>
<dt><a href="/Template:Quote">Quote</a></dt>
<dd>adds a block quotation to a page ...
  • Another issue is to query on list of used templates within a page to equip them all with explanation on request.
  • Those need general description statement only, but in all languages to match current user language. Maybe pageprops can be reduced to multilingual overall description, even plain JSON.

I wondered why API returned corrupted gzip data while a colleague is unzipping in the Cloud, but directly accessing replica fields.

  • You may also visit this prototype with most frequently used templates in German Wikipedia article namespace (the green things are TemplateData), and explore a single template and its parameters (the green things are TemplateData parameter descriptions) but in the Cloud.

Since MCR has been mentioned:

  • Please note that it is common practice to transclude parts of JSON by {{#tag:templatedata rather than static <templatedata> within one page.
  • Please see this category full of shared doc bodies including TemplateData, e.g. this one or feeding 2140 template doc pages with centrally maintained TemplateData and more.

@PerfektesChaos That sounds great. What is your question or concern in relation to this task? TemplateData is available for the API, always has been, and is as far as I know also the only supported way of accessing the data. Nothing about that has changed as far as I know.

This task is about how the data is internally stored in the private ParserCache structure, of which the internals do not affect tools, gadgets, or API queries.

If you are finding corrupted or wrongly compresed/decompressed data somewhere, please file a high priority task and explain how to reproduce the issue. If you suspect it is related to this one, feel free to mention this task there.

@PerfektesChaos That sounds great. What is your question or concern in relation to this task? TemplateData is available for the API, always has been, and is as far as I know also the only supported way of accessing the data. Nothing about that has changed as far as I know.

It has been taken into account to drop storage of TemplateData, be it compressed or not, within DB.

  • That should be kept available.
  • The track is rather confusing, distributed over three tasks, and I do not see a clear solution.
  • One issue is complaining about violation of pageprops for graph and kartographer, considering to drop storing at all. Others seem to think about a new pagepropsBlob for large things, while pageprops is kept for usual tiny things.

If you are finding corrupted or wrongly compresed/decompressed data somewhere, please file a high priority task and explain how to reproduce the issue. If you suspect it is related to this one, feel free to mention this task there.

I found that yesterday. Not in DB tables but API. That’s why I looked for this task here. Just try. This issue is already on rails, somewhere. Probably comp/decomp is okay at DB, but result is not yet ASCII-encoded for safe JSON.

Why not use https://www.mediawiki.org/wiki/Special:ApiSandbox#action=templatedata&titles=Template%3AQuote? Using the pageprops API for that never worked, as far as I'm aware of. The string was always compressed, i.e. all you got was a bogus binary blob that might or might not be gzip compressed. Yes, that's confusing, and something should be done about this. But restoring the old binary format is not a good solution, in my humble opinion. Not sure what to do instead. Is it possible to make the pageprops API aware and deliver the decoded string instead?

Yes, thanks, meanwhile I helped myself with action=templatedata now. However, I do need displaytitle from pageprops as well, so I wanted one API call only. Now I make two calls every 50 templates. I can save decompression of result now, consuming more bandwith.

The opinions in the three tasks are quite confusing, including discontinuation of something.

I guess best approach is to create a new DB field pagedata or pageblobs with unlimited size. I learnt that pageprops are used as a place to store result of templatedata and graph evaluation etc. Apparently they broke the pageprops limitation, which is supposed to contain tiny things.

If a new DB field is used, the binaries would vanish from Special:PagesWithProp options where they need special treatment since they cannot be displayed.