Page MenuHomePhabricator

Move GeoJSON for <mapframe> tags to its own table (in order to fix Flagged Revs with mapframe)
Open, MediumPublic

Description

Instead of storing GeoJSON for <mapframe> tags in the page_props table parser cache, we should store it in a table that's really just a key-value store, where the keys are the hashes and the values are the GeoJSON blobs. This is basically solution #1 from T119043: Graph/Graphoid/Kartographer - data storage architecture but without the second rev tracking table.

We would need to modify the code that writes the GeoJSON blobs to page_props (indirectly through ParserOutput methods) to instead write to this new table, and modify the api.php module to read from this new table instead of page_props. We'd also need a migration script (or have the API module read from both).

Related Objects

StatusSubtypeAssignedTask
InvalidNone
ResolvedLena_WMDE
Resolved JGirault
Resolved JGirault
DuplicateNone
ResolvedYurik
Resolved JGirault
Resolved JGirault
ResolvedEsanders
Resolved JGirault
ResolvedYurik
ResolvedYurik
ResolvedMaxSem
ResolvedYurik
ResolvedYurik
Resolved JGirault
ResolvedYurik
ResolvedYurik
Resolved Catrope
ResolvedYurik
Resolved JGirault
Resolveddebt
Resolveddebt
Resolveddebt
ResolvedDereckson
Resolveddebt
Resolveddebt
ResolvedDatGuy
ResolvedUrbanecm
Resolveddebt
ResolvedJayprakash12345
InvalidJayprakash12345
Resolved Catrope
ResolvedUrbanecm
ResolvedUrbanecm
ResolvedZabe
ResolvedBUG REPORTTheDJ
Resolved 4nn1l2
ResolvedJTannerWMF
OpenNone

Event Timeline

Catrope created this task.

I think @MaxSem was working on this a while back, and might have even had some code. ]

This should not be done exclusively for mapframe because graph has identical issue, and needs the same type of storage. It should be a table that all extensions can use - just add an extra column for "type" of sorts.

I think @MaxSem was working on this a while back, and might have even had some code. ]

https://gerrit.wikimedia.org/r/#/c/291300/

jmatazzoni subscribed.

Note that we have not promised this. 'it was judged out of scope. but we are including it on the project board, and will keep an eye on things in case we have time to accomplish this task, which we think would fix the incompatibility of mapframe and Flagged Revs.

Storing just the keyvalue hash->spec is not a very good solution, because graphs/map templates/lua may rely on dynamic things like time, which means the data will keep changing with every map rendering, polluting this table without any hope of cleaning it up.

Storing just the keyvalue hash->spec is not a very good solution, because graphs/map templates/lua may rely on dynamic things like time, which means the data will keep changing with every map rendering, polluting this table without any hope of cleaning it up.

Thanks for explaining that, now I understand why you proposed a second table for tracking which revision owns which blobs.

jmatazzoni renamed this task from Move GeoJSON for <mapframe> tags to its own table instead of page_props to Move GeoJSON for <mapframe> tags to its own table instead of page_props (in order to fix Flagged Revs with mapframe).Apr 26 2018, 5:42 PM
jmatazzoni added a subscriber: Bmueller.

This turns out to be much trickier than I thought. There isn't one single blob of GeoJSON for one revision necessarily, because there isn't one parser cache entry for each revision: the parser cache is fragmented on various things that have the potential to change the GeoJSON (like values of magic words). You can still store key-value pairs, but it's hard-to-impossible to figure out which ones are in use and which ones you can clean up. If someone puts {{CURRENTIME}} inside a GeoJSON blob, that'll end up creating lots of entries in that table all but one of them outdated, but there's no good way (that I saw) to figure out which ones those are.

Can't you just store it in the parser cache? FlaggedRevs provides a secondary parser cache for stable revisions and hooks into the ParserOutput retrieval process so as long as you use something like ParserOutput::get/setExtensionData you shouldn't have to care whether the wiki uses FlaggedRevs or not.

In the longer term, mapframe data might be a good candidate for per-revision storage via a dedicated Multi-Content-Revisions slot.

@Tgr the main problem with cache is that sometimes it might not be there when needed, and it would have to be rebuilt. But to rebuild something, you have to somehow provide the context required for the rebuilding. Currently each graph is identified with the hash, making it impossible. The map/graph would have to use a different URL structure, and include all the needed info (e.g. page revision ID and graph index), and graphoid/kartotherian would have to pass all that info to MW API for regeneration. But the moment you include revision ID, you destroy the image cache, because each image would have to be regenerated on each page save - and regenerating these images may take longer that page parsing itself.

What's the difference between a page save and a parser cache rebuild in terms of the context available for deriving the hash?

similar - you re-run parser for the specific page revision (either new or old), for the whole page, and as part of that process you get a JSON blob. That json blob is what the graphoid/kartotherian services need to render the image. Hash is just a way to uniquely identify that blob. Most of the time, the blob would be identical between revisions, or oven across multiple pages in the same wiki. But in some cases, e.g. when time-related constants are used, the blob will be different with each parsing.

Is that a bad thing? If time-related constants are used, that means editors expect the image to change over time.

Sure, that's fine. The problem is the cache. If you include revisionID in the URL, your image MUST regenerate on each save (regardless if it depends on time or not). This is an expensive operation, plus users may be annoyed that their graphs/maps don't show up right after saving. If you don't include revID, you cannot regenerate the page, but only get json blob from the cache when available. BTW, currently json blobs are already stored in memcached, with the longest possible expiration.

I still don't understand how this is specific to FlaggedRevs. Right now, ApiQueryMapData uses the page title to look up the kartographer page property for that page; if it used the canonical parser cache entry instead, it would receive the property based in the correct revision (stable or latest, depending on user and page config and whatnot). In all other aspects it seems indentical to me.

Yes, that might require the page to be reparsed; so does fetching the page HTML via the API. Is that bad? I'd imagine the map data is usually fetched in some context where the page HTML also gets displayed, so if the page is not in the parser cache, it needs to be reparsed anyway.

Architecture: parser for graph/kartographer generates JSON blobs and their hashes. The hash goes into img URL. Browser fetches image from backend service, which uses the hash to get the original JSON blob from MW API and renders it into an image. There are two levels of caching: image cache (Varnish), and memcached used by the MW API (per hash).

Caching aspect: if the parser caches uses the same memcached instance, than it doesn't matter if something is in the latest vs non-latest rev -- json blob gets cached for all revisions the same way, by its hash value. So a stable page rev, when parsed, would place the json blob into the same cache as the latest rev, by the blob's hash, with the maximum expiration.

The issue happens when you get a cache miss. The URL only contains hash and the page ID. If MW API fails to find given hash in cache, it needs to somehow get/regenerate the original blob. Currently it simply pulls it by hash from the pageprops. To do a full regeneration, you have to have 2 things: page revision ID (instead of page ID), and the graph index (there could be multiple graphs on a page, so you need to identify which graph you need). Hash is useless for graph identification because it may be different for every parse.

So if you include revision ID and graph index into the URL, MW API may be able to regenerate the needed json blob in most cases. The biggest drawback of it is that now your Varnish cache will get lower hit rate - because now image URL is per revision, not per page. This is a big deal for a page with a few hundred graphs (I have seen those - e.g. a table with small pie charts). Every update will require a full regen of all images. Overall, this might actually not be that bad - if the majority of graph-using pages are not changed often.

Note that even a blob regeneration still doesn't fully solve it. Let's say your page has a single graph, and uses some template before the graph. Now let's say someone modifies the template and adds a graph to it - so now there are two graphs on the page. The next time a user views the page, the HTML might still be in Varnish cache, with only one graph, but the graph image itself might be gone from cache. The service would request the blob # 1, and reparsing would get the wrong blob, because now it will count the graph inside the template.

I still don't understand how this is specific to FlaggedRevs. Right now, ApiQueryMapData uses the page title to look up the kartographer page property for that page; if it used the canonical parser cache entry instead, it would receive the property based in the correct revision (stable or latest, depending on user and page config and whatnot). In all other aspects it seems indentical to me.

That would probably alleviate the issue a bit, but you'd still just be moving it. The problem is that it's hard for ApiQueryMapData to give you a GeoJSON blob for a revision that isn't the canonical one. If you change the definition of "canonical" to be the stable revision instead of the latest one, you've fixed the rendering of maps on the stable version while breaking it on the latest version. Viewing maps on historical versions of pages would also remain broken.

Because our stack assumes that you're always viewing the latest version even when you're not, and because it uses a hash of the GeoJSON as the key, maps do work (more or less by accident) on the stable version and other old versions if the GeoJSON is the same as on the latest version. But edits to the GeoJSON break rendering of old revisions, and that's what we're trying to fix here.

Yes, that might require the page to be reparsed; so does fetching the page HTML via the API. Is that bad? I'd imagine the map data is usually fetched in some context where the page HTML also gets displayed, so if the page is not in the parser cache, it needs to be reparsed anyway.

Old revisions are not in the parser cache (AFAIK), so they'd have to be reparsed every time.

Old revisions are not in the parser cache (AFAIK), so they'd have to be reparsed every time.

The older hash -> json blob stays in memcached for a long time.

Removing task assignee due to inactivity, as this open task has been assigned to the same person for more than two years (see the emails sent to the task assignee on Oct27 and Nov23). Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome.
(See https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.)

thiemowmde renamed this task from Move GeoJSON for <mapframe> tags to its own table instead of page_props (in order to fix Flagged Revs with mapframe) to Move GeoJSON for <mapframe> tags to its own table (in order to fix Flagged Revs with mapframe).Mar 23 2021, 10:28 AM
thiemowmde updated the task description. (Show Details)
thiemowmde subscribed.

Note: Since https://gerrit.wikimedia.org/r/531159 (merged 2019-08-21) page_props aren't used any more, but the parser cache. I updated this task's description a bit to reflect this.