Maniphest T192695

Move GeoJSON for <mapframe> tags to its own table (in order to fix Flagged Revs with mapframe)
Open, MediumPublic
Actions

Assigned To

None

Authored By

	Catrope
	Apr 21 2018, 12:09 AM

Description

Instead of storing GeoJSON for <mapframe> tags in the ~~page_props table~~ parser cache, we should store it in a table that's really just a key-value store, where the keys are the hashes and the values are the GeoJSON blobs. This is basically solution #1 from T119043: Graph/Graphoid/Kartographer - data storage architecture but without the second rev tracking table.

We would need to modify the code that writes the GeoJSON blobs ~~to page_props (indirectly through ParserOutput methods)~~ to instead write to this new table, and modify the api.php module to read from this new table ~~instead of page_props~~. We'd also need a migration script (or have the API module read from both).

Related Objects
Search...

Status	Subtype	Assigned	Task
Invalid		None	T155601 Stabilizing Interactive Products
Resolved		Lena_WMDE	T138057 Epic: Enable <mapframe> on Wikipedia
Resolved		• JGirault	T128913 Implement <mapframe> box frame & captioning
Resolved		• JGirault	T136458 Should HTML formatting/DOM manip (presentation, responsiveness) be done in JS (frontend) or in PHP (backend)
Duplicate		None	T143418 VE Map to support frame / frameless
Resolved		Yurik	T144062 Epic: Enable <maplink> on Wikipedias
Resolved		• JGirault	T144003 Introduce a static map page as a non-JS support for maplinks
Resolved		• JGirault	T144560 Disabling <mapframe> doesn't disable it in VE
Resolved		Esanders	T144561 VE: can't edit <maplink>
Resolved		• JGirault	T145014 Show snapshot map image as background for non-js clients
Resolved		Yurik	T124134 Implement static kartotherian service with geojson layer
Resolved		Yurik	T133247 Create a pipeline for supplying data to static map service
Resolved		MaxSem	T147128 Simplify ExternalData parsing in Kartographer
Resolved		Yurik	T147530 Migrate all map usages from "href" to "service" external data
Resolved		Yurik	T147529 Enable snapshot overlay configuration for Kartotherian
Resolved		• JGirault	T148070 Use maps snapshot service until user interacts (click/mouseover?)
Resolved		Yurik	T149071 Enable snapshot service for he,ca,mk wikipedias
Resolved		Yurik	T149070 Show preview with static maps is broken
Resolved		Catrope	T151665 Investigate how <mapframe/link> work with the Flagged Revisions extension
Resolved		Yurik	T150358 Snapshot: If datalayer loading fails, shorten caching
Resolved		• JGirault	T149147 Snapshot service does not handle auto-positioning
Resolved		debt	T153158 Define future <mapframe> deployments
Resolved		debt	T151591 Enable <mapframe> in Finnish Wikipedia
Resolved		debt	T154524 Enable <mapframe> on French Wikipedia
Resolved		Dereckson	T161032 Please turn on mapframe for Swedish Wikipedia
Resolved		debt	T168915 Enable Kartographer extension on ptwiki
Resolved		debt	T167619 Enable Kartographer extension's to be determined features on eu.wikipedia
Resolved		DatGuy	T164574 Enable <mapframe> on WMUA wiki
Resolved		Urbanecm	T171588 Please turn on mapframe for Czech Wikipedia
Resolved		debt	T171805 Enable <mapframe> for 3 wikis that have community consensus
Resolved		Jayprakash12345	T177695 Activate <mapframe> in eswiki
Invalid		Jayprakash12345	T183086 Mediawiki:Gadget-MapFrame.js
Resolved		Catrope	T175102 Enable <mapframe> on the English Wikipedia
Resolved		Urbanecm	T183661 Enable <mapframe> on Latvian Wikipedia
Resolved		Urbanecm	T193371 Enable <mapframe> on Marathi Wikipedia
Resolved		Zabe	T279635 Enable Kartographer on bs.wiki
Resolved	BUG REPORT	TheDJ	T151524 Maps live preview is broken on 2nd attempt
Resolved		• 4nn1l2	T295571 Enable <mapframe> on Indonesian-language Wikipedia
Resolved		JTannerWMF	T193470 Mapframe maps with an image in the GeoJSON don't display on mobile website or mobile apps
Open		None	T192695 Move GeoJSON for <mapframe> tags to its own table (in order to fix Flagged Revs with mapframe)

Event Timeline

Catrope triaged this task as Medium priority.Apr 21 2018, 12:09 AM

Catrope created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 21 2018, 12:09 AM

I think @MaxSem was working on this a while back, and might have even had some code. ]

This should not be done exclusively for mapframe because graph has identical issue, and needs the same type of storage. It should be a table that all extensions can use - just add an extra column for "type" of sorts.

In T192695#4147297, @Yurik wrote:

I think @MaxSem was working on this a while back, and might have even had some code. ]

https://gerrit.wikimedia.org/r/#/c/291300/

Note that we have not promised this. 'it was judged out of scope. but we are including it on the project board, and will keep an eye on things in case we have time to accomplish this task, which we think would fix the incompatibility of mapframe and Flagged Revs.

• jmatazzoni mentioned this in Collaboration-Team-Triage (Collab-Team-This-Quarter).Apr 25 2018, 3:12 PM

Storing just the keyvalue hash->spec is not a very good solution, because graphs/map templates/lua may rely on dynamic things like time, which means the data will keep changing with every map rendering, polluting this table without any hope of cleaning it up.

In T192695#4158076, @Yurik wrote:

Storing just the keyvalue hash->spec is not a very good solution, because graphs/map templates/lua may rely on dynamic things like time, which means the data will keep changing with every map rendering, polluting this table without any hope of cleaning it up.

Thanks for explaining that, now I understand why you proposed a second table for tracking which revision owns which blobs.

• jmatazzoni renamed this task from Move GeoJSON for <mapframe> tags to its own table instead of page_props to Move GeoJSON for <mapframe> tags to its own table instead of page_props (in order to fix Flagged Revs with mapframe).Apr 26 2018, 5:42 PM

• jmatazzoni added a subscriber: Bmueller.

Nemo_bis subscribed.Apr 30 2018, 7:17 PM

Liuxinyu970226 subscribed.May 9 2018, 11:54 AM

This turns out to be much trickier than I thought. There isn't one single blob of GeoJSON for one revision necessarily, because there isn't one parser cache entry for each revision: the parser cache is fragmented on various things that have the potential to change the GeoJSON (like values of magic words). You can still store key-value pairs, but it's hard-to-impossible to figure out which ones are in use and which ones you can clean up. If someone puts {{CURRENTIME}} inside a GeoJSON blob, that'll end up creating lots of entries in that table all but one of them outdated, but there's no good way (that I saw) to figure out which ones those are.

Can't you just store it in the parser cache? FlaggedRevs provides a secondary parser cache for stable revisions and hooks into the ParserOutput retrieval process so as long as you use something like ParserOutput::get/setExtensionData you shouldn't have to care whether the wiki uses FlaggedRevs or not.

In the longer term, mapframe data might be a good candidate for per-revision storage via a dedicated Multi-Content-Revisions slot.

@Tgr the main problem with cache is that sometimes it might not be there when needed, and it would have to be rebuilt. But to rebuild something, you have to somehow provide the context required for the rebuilding. Currently each graph is identified with the hash, making it impossible. The map/graph would have to use a different URL structure, and include all the needed info (e.g. page revision ID and graph index), and graphoid/kartotherian would have to pass all that info to MW API for regeneration. But the moment you include revision ID, you destroy the image cache, because each image would have to be regenerated on each page save - and regenerating these images may take longer that page parsing itself.

What's the difference between a page save and a parser cache rebuild in terms of the context available for deriving the hash?

similar - you re-run parser for the specific page revision (either new or old), for the whole page, and as part of that process you get a JSON blob. That json blob is what the graphoid/kartotherian services need to render the image. Hash is just a way to uniquely identify that blob. Most of the time, the blob would be identical between revisions, or oven across multiple pages in the same wiki. But in some cases, e.g. when time-related constants are used, the blob will be different with each parsing.

Is that a bad thing? If time-related constants are used, that means editors expect the image to change over time.

Sure, that's fine. The problem is the cache. If you include revisionID in the URL, your image MUST regenerate on each save (regardless if it depends on time or not). This is an expensive operation, plus users may be annoyed that their graphs/maps don't show up right after saving. If you don't include revID, you cannot regenerate the page, but only get json blob from the cache when available. BTW, currently json blobs are already stored in memcached, with the longest possible expiration.

I still don't understand how this is specific to FlaggedRevs. Right now, ApiQueryMapData uses the page title to look up the kartographer page property for that page; if it used the canonical parser cache entry instead, it would receive the property based in the correct revision (stable or latest, depending on user and page config and whatnot). In all other aspects it seems indentical to me.

Yes, that might require the page to be reparsed; so does fetching the page HTML via the API. Is that bad? I'd imagine the map data is usually fetched in some context where the page HTML also gets displayed, so if the page is not in the parser cache, it needs to be reparsed anyway.

Architecture: parser for graph/kartographer generates JSON blobs and their hashes. The hash goes into img URL. Browser fetches image from backend service, which uses the hash to get the original JSON blob from MW API and renders it into an image. There are two levels of caching: image cache (Varnish), and memcached used by the MW API (per hash).

Caching aspect: if the parser caches uses the same memcached instance, than it doesn't matter if something is in the latest vs non-latest rev -- json blob gets cached for all revisions the same way, by its hash value. So a stable page rev, when parsed, would place the json blob into the same cache as the latest rev, by the blob's hash, with the maximum expiration.

The issue happens when you get a cache miss. The URL only contains hash and the page ID. If MW API fails to find given hash in cache, it needs to somehow get/regenerate the original blob. Currently it simply pulls it by hash from the pageprops. To do a full regeneration, you have to have 2 things: page revision ID (instead of page ID), and the graph index (there could be multiple graphs on a page, so you need to identify which graph you need). Hash is useless for graph identification because it may be different for every parse.

So if you include revision ID and graph index into the URL, MW API may be able to regenerate the needed json blob in most cases. The biggest drawback of it is that now your Varnish cache will get lower hit rate - because now image URL is per revision, not per page. This is a big deal for a page with a few hundred graphs (I have seen those - e.g. a table with small pie charts). Every update will require a full regen of all images. Overall, this might actually not be that bad - if the majority of graph-using pages are not changed often.

Note that even a blob regeneration still doesn't fully solve it. Let's say your page has a single graph, and uses some template before the graph. Now let's say someone modifies the template and adds a graph to it - so now there are two graphs on the page. The next time a user views the page, the HTML might still be in Varnish cache, with only one graph, but the graph image itself might be gone from cache. The service would request the blob # 1, and reparsing would get the wrong blob, because now it will count the graph inside the template.

In T192695#4200366, @Tgr wrote:

I still don't understand how this is specific to FlaggedRevs. Right now, ApiQueryMapData uses the page title to look up the kartographer page property for that page; if it used the canonical parser cache entry instead, it would receive the property based in the correct revision (stable or latest, depending on user and page config and whatnot). In all other aspects it seems indentical to me.

That would probably alleviate the issue a bit, but you'd still just be moving it. The problem is that it's hard for ApiQueryMapData to give you a GeoJSON blob for a revision that isn't the canonical one. If you change the definition of "canonical" to be the stable revision instead of the latest one, you've fixed the rendering of maps on the stable version while breaking it on the latest version. Viewing maps on historical versions of pages would also remain broken.

Because our stack assumes that you're always viewing the latest version even when you're not, and because it uses a hash of the GeoJSON as the key, maps do work (more or less by accident) on the stable version and other old versions if the GeoJSON is the same as on the latest version. But edits to the GeoJSON break rendering of old revisions, and that's what we're trying to fix here.

In T192695#4200366, @Tgr wrote:

Yes, that might require the page to be reparsed; so does fetching the page HTML via the API. Is that bad? I'd imagine the map data is usually fetched in some context where the page HTML also gets displayed, so if the page is not in the parser cache, it needs to be reparsed anyway.

Old revisions are not in the parser cache (AFAIK), so they'd have to be reparsed every time.

In T192695#4201262, @Catrope wrote:

Old revisions are not in the parser cache (AFAIK), so they'd have to be reparsed every time.

The older hash -> json blob stays in memcached for a long time.

Kocio subscribed.May 13 2018, 12:27 AM

Catrope added a subtask: T193470: Mapframe maps with an image in the GeoJSON don't display on mobile website or mobile apps .Jun 12 2018, 5:59 PM

Catrope removed a subtask: T193470: Mapframe maps with an image in the GeoJSON don't display on mobile website or mobile apps .

Catrope added a parent task: T193470: Mapframe maps with an image in the GeoJSON don't display on mobile website or mobile apps .

Lea_WMDE subscribed.Jun 18 2018, 5:10 PM

JTannerWMF removed a project: Collaboration-Team-Triage (Collab-Team-This-Quarter).Jun 26 2018, 5:49 PM

Restricted Application added a project: Collaboration-Team-Triage. · View Herald TranscriptJun 26 2018, 5:49 PM

Liuxinyu970226 edited projects, added Growth-Team; removed Collaboration-Team-Triage.Sep 14 2018, 12:49 PM

MSantos subscribed.Sep 14 2018, 2:12 PM

• Mholloway subscribed.Sep 25 2018, 3:29 PM

MSantos moved this task from Unsorted to General on the Maps (Kartographer) board.Sep 25 2018, 3:30 PM

SBisson removed a project: Growth-Team.Sep 28 2018, 2:41 AM

Yarl subscribed.Jan 28 2019, 11:50 AM

Catrope mentioned this in T98940: graphoid fails if page_props is out of sync with parser cache, or on old revisions of a page.Aug 18 2019, 11:14 AM

Removing task assignee due to inactivity, as this open task has been assigned to the same person for more than two years (see the emails sent to the task assignee on Oct27 and Nov23). Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome.
(See https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.)

Marsupium subscribed.Jan 16 2021, 10:01 AM

Note: Since https://gerrit.wikimedia.org/r/531159 (merged 2019-08-21) page_props aren't used any more, but the parser cache. I updated this task's description a bit to reflect this.

Move GeoJSON for <mapframe> tags to its own table (in order to fix Flagged Revs with mapframe)Open, MediumPublicActions

Description

Related ObjectsSearch...

Event Timeline

Move GeoJSON for <mapframe> tags to its own table (in order to fix Flagged Revs with mapframe)
Open, MediumPublic
Actions

Related Objects
Search...