Page MenuHomePhabricator

Graph/Graphoid/Kartographer - data storage architecture
Open, Stalled, LowPublic

Description

Graph/Graphoid need to solve the "data" problem, which seems to be identical to the <map> tag (Kartographer). Both <graph> and <map> have similar data storage needs, so I will only talk about the <graph>.

Intro

The <graph> tag contains JSON graph spec that describes how the graph should be drawn:

<graph>{ "data": ..., "marks": ..., ... }</graph>

During the page parse, the spec needs to be extracted and stored, so that it can be later used by the Graphoid service or the browser.

Requirements

  1. parser needs to store the graph definition (spec) in some storage
  2. Graphoid service needs the spec to generate the image
  3. browser needs to get the graph image generated by the Graphoid service
  4. browser needs to get the JSON spec when the user wants to interact with the graph
  5. BonusA: on save preview (not VE), it would be good to show the graph from Graphoid, instead of drawing it client-side, bit allow "click-to-interact" mode.
  6. BonusB: When user looks at an older revision of an article, they should see the graphs for that revision.
  7. BonusC: A special case of BonusB - if the graph spec uses external data that changes, or if Vega library changes the way it works, users should still see the original graph when looking at the older revisions.

Current limitations

The current approach of storing the spec in page_props SQL table has these issues:

  • 64KB limit per article - breaks for bigger/multiple graphs
  • for a new graph, it will take time for slave DBs to get replicated - which sometimes prevents graphoid from getting the spec from the api, causing an error (which may get cached)
  • the URL with graph ID (hash) becomes invalid if the spec changes
  • viewing older revisions do not show graphs unless they are identical to the latest revision
  • getting the spec via api.php is not cached by varnish (? TBD)

Special case - Visual Editor plugin

Since VE plugin can only support simple non-templated graphs, it would make sense to always render the graph on the client, without using Graphoid. On edit, if the user opens the graph editor, VE should load the graph libraries and render the content of the <graph> tag.

Solution #1 - SQL storage

Introduce new graph-specific SQL tables:

table graph_specs - contains just the graph specs, indexed by their hashes
    fields: hash (string key), last_used(timestamp), spec (blob)

table graph_revs - lists which graphs are used by which page revisions
    fields: id (autonum key), page_rev_id (int), hash (string)
  • When creating a new article revision, or updating an existing one (null edit), the graph ext will ensure that each graph exists in graph_specs table and update last_used timestamp on each. Graph ext will also ensure that graph_revs table contains only the right graph hashes. Graph extension will never delete any rows from the graph_specs, thus preventing any dead links from browser cache. This assumes that at that point graph ext will know revid of the new/existing revision.
  • A cleanup batch job will delete any graph_specs rows that are older than 1 month (?) and not referenced by the graph_revs table.
  • Graph ext will implement an api to get graph spec by hash: api.php? action=getgraph & hash=XXX
  • Browser will get an image by Graphoid/<hash>.png url (3). Graphoid will access graph spec via the api (2), and so can the browser (4). For page preview bonusA, the graph ext will add/update a row in the graph_specs table, but will not touch graph_revs.
  • This approach solves bonusB unless we choose to cleanup older page revision rows in graph_revs.
  • The bonusC is not solved because we do not store images, but regenerate them when needed.

PROs: data is part of the wiki mysql, under proper DB management (monitoring, backups, etc); per-user access control checks
CONs: slave replication bug is still there, api.php caching issues, no BonusC

Solution #2 - RestBASE POST storage

In the near future RestBASE plans to provide an alternative Cassandra-based storage with TTL support operating via HTTP (T101093)

  • When saving/null editing, graph ext will POST graph spec to RestBASE, which will store it in Cassandra, and return the hash ID as a header. The hash will be used to construct the <img href="...Graphoid/(hash).png" /> tag (or it can return the full URL) Additionally, RB will POST the spec to the Graphoid service (2), and store the resulting image to Cassandra.
  • Graphoid service will only support POST requests - posting spec will return an image.
  • When browser requests an image via GET (3), RestBASE will simply return the stored image, without calling Graphoid. When browser requests the graph spec via GET (4), RestBASE will return it from Cassandra as well. For page preview bonusA, graph ext would POST the spec to RB with an additional "temp" parameter/header. RB will check if this spec already exists in Cassandra, and if not, store it with a 1 month (?) TTL. On browser request, the image will not be cached.
  • For BonusB, older images are available from RB, or can be regenerated on the fly because the spec is also stored.
  • BonusC is solved only partially, because the hash is generated from the spec, and if spec stays the same but external data or vega lib change, newer image will override the older one.

How to solve BonusC?

BonusC can only be solved if we store each generated image forever (except those generated for the "preview" mode). At the same time, we do not want to store multiple identical images. But we won't know that the image is identical until after we generate it.

  • A user should be able to view an old page revision with all additional resources (e.g. graph) - as of the moment when the newer page revision was added
  • For the "HEAD" revision of the page, and ONLY for it, the page should be in a flux state - changes to any templates, or external resources such as images, should be reflected in the HEAD revision automatically.
  • We want to optimize both storage and cache, so that if two revisions contain a reference to an identical resource, that resource is only stored once, and has the same URL to improve browser and server-side caching.

So it seems we need two mechanisms - one to keep the HEAD revision up to date (null edits), and another is archiving - to preserve whatever state the current HEAD, plus all referenced resources/images/graphs/etc before adding a new revision. It is ok if there was a rendering error in an old revision - because that is an accurate representation of the past. TBD: mechanism to manually delete old images due to legal reasons.

Assuming we use Cassandra (solution #2) for image storage, when graph ext POSTs the spec to RB, RB could first generate the image, and only then return the hash. This way HTML will contain a permanent URL to the image. This works but at a significant performance degradation - parsing will pause until image generation is done. Another problem with this is null-edits - each null-edit could generate a different images for the same article revision, and all of them will be stored forever. To delete them, we would have to track which image belong to which revision.

Related Objects

StatusAssignedTask
InvalidNone
OpenNone
ResolvedJGirault
ResolvedJGirault
OpenNone
ResolvedYurik
ResolvedJGirault
ResolvedJGirault
ResolvedEsanders
ResolvedJGirault
ResolvedYurik
ResolvedYurik
ResolvedMaxSem
ResolvedYurik
ResolvedYurik
ResolvedJGirault
ResolvedYurik
ResolvedYurik
ResolvedCatrope
OpenNone
ResolvedYurik
ResolvedJGirault
Resolveddebt
Resolveddebt
Resolveddebt
ResolvedDereckson
Resolveddebt
Resolveddebt
ResolvedDatGuy
ResolvedUrbanecm
Resolveddebt
ResolvedJayprakash12345
InvalidJayprakash12345
ResolvedCatrope
ResolvedUrbanecm
ResolvedUrbanecm
InvalidNone
StalledNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Yurik added a comment.EditedJan 8 2016, 7:33 PM

A much simpler proposal that will solve most issues with very little effort:

  • On parse (normal & preview), store hash->data in Memcached with indefinite TTL
  • On parse (normal only), store title->timestamp in Memcached (this is only needed to prevent DDOS)

When a client tries to get data via API by title + hash:

  • check Memcached for hash
  • if not exists, check Memcached for title, and fail if title exists and its timestamp is recent
  • if title does not exist or has old timestamp, perform full tile re-parsing
  • if not exists, check Memcached for title, and fail if title exists and its timestamp is recent

I'm not sure what this is trying to achieve. Is it some sort of PoolCounter substitute?

  • if title does not exist or has old timestamp, perform full tile re-parsing

When you reparse you often won't see the same hash again, so what graph will you deliver to the user?

Change 263160 had a related patch set uploaded (by Yurik):
Cache JSON objects in memcached

https://gerrit.wikimedia.org/r/263160

Change 263160 merged by jenkins-bot:
Cache JSON objects in memcached

https://gerrit.wikimedia.org/r/263160

Yurik moved this task from Unsorted to Tracking on the Maps (Kartographer) board.Jan 30 2016, 1:47 PM
Yurik added a project: Maps.Feb 2 2016, 5:59 PM
Restricted Application added a project: Discovery. · View Herald TranscriptFeb 2 2016, 5:59 PM
Yurik moved this task from All map-related tasks to Kartographer on the Maps board.Feb 2 2016, 5:59 PM
Deskana moved this task from Needs triage to Maps on the Discovery board.Feb 3 2016, 6:14 PM
RobLa-WMF mentioned this in Unknown Object (Event).May 4 2016, 7:33 PM
Yurik assigned this task to MaxSem.May 6 2016, 12:52 PM
Yurik triaged this task as High priority.
Yurik moved this task from Tracking to General on the Maps (Kartographer) board.May 8 2016, 12:54 PM
Yurik moved this task from Backlog to In progress on the Maps-Sprint board.May 27 2016, 10:24 PM
Yurik moved this task from In progress to To-do on the Maps-Sprint board.Jun 7 2016, 10:32 PM
Pchelolo moved this task from Backlog to watching on the Services board.Oct 12 2016, 11:27 PM
Pchelolo edited projects, added Services (watching); removed Services.
Yurik removed a project: Maps.Dec 15 2016, 4:40 AM
MaxSem removed MaxSem as the assignee of this task.Jun 20 2017, 10:53 PM
MaxSem added a subscriber: MaxSem.
brion added a subscriber: brion.Jan 9 2018, 8:22 PM

Is this still needed open? Currently Graphs and Maps development is on hold and we may rethink some of how this is done later, but it's not on the immediate agenda.

akosiaris changed the task status from Open to Stalled.Jan 16 2019, 6:11 PM
akosiaris lowered the priority of this task from High to Low.
akosiaris added a subscriber: akosiaris.

Is this still needed open? Currently Graphs and Maps development is on hold and we may rethink some of how this is done later, but it's not on the immediate agenda.

3 years since the last comment/patch I am guessing at the very least Stalled and Low priority. I 'll set it as such, feel free to revert.

In the meantime there's been changes in the storage layer with Multi-Content Revisions support and per the RFC[1] the Graph extension SHOULD (is this correct?) be able to store its data into an MCR slot, hopefully rendering some parts of this discussion moot?

Note that this has been brought up in T211881

[1] https://www.mediawiki.org/wiki/Requests_for_comment/Multi-Content_Revisions

In the meantime there's been changes in the storage layer with Multi-Content Revisions support and per the RFC[1] the Graph extension SHOULD (is this correct?) be able to store its data into an MCR slot, hopefully rendering some parts of this discussion moot?

The Graph extension could potentially use an MCR slot to store the Vega JSON rather than embedding it in the wikitext inside a <graph> tag. But that wouldn't support the existing uses where templates and modules are being used to generate the Vega JSON.

At one point during the planning/development of MCR there was an idea of "derived content" and "virtual slots" for caching things like this, but as far as I know that's currently not planned for implementation.

Yurik added a comment.Jan 16 2019, 6:54 PM

Most of the time, Vega is used via a template, because otherwise you have a massive copy/paste of code without any benefit, while having no way to fix issues or improve appearance of all graphs en mass. Thus, per what @Anomie said - MCR is an orthogonal (in its current form) to the generated content. This actually has more similarities with the image thumb service than MCR (content is generated from "master" - wiki markup, and cached for usage by both the rendering service like Graphoid and directly from the client via the dynamic graph loading).

The Graph extension could potentially use an MCR slot to store the Vega JSON rather than embedding it in the wikitext inside a <graph> tag. But that wouldn't support the existing uses where templates and modules are being used to generate the Vega JSON.

True. But it could replace the problematic use of a page_property for the graph, at least fixing in a better way the issue in T184128 (I am assuming here MCR can hold more than 64kb, please correct me if I am wrong). That would allow to have graphs per revision and allow to potentially regenerate PNGs on demand for a revision (actually that's not entirely true because of Vega library changes and compatibility possible/probable issues between vega library versions)

At one point during the planning/development of MCR there was an idea of "derived content" and "virtual slots" for caching things like this, but as far as I know that's currently not planned for implementation.

Good to know. Thanks for pointing it out.

Most of the time, Vega is used via a template, because otherwise you have a massive copy/paste of code without any benefit, while having no way to fix issues or improve appearance of all graphs en mass. Thus, per what @Anomie said - MCR is an orthogonal (in its current form) to the generated content. This actually has more similarities with the image thumb service than MCR (content is generated from "master" - wiki markup, and cached for usage by both the rendering service like Graphoid and directly from the client via the dynamic graph loading).

This does contradict however with requirement 6. BonusB: When user looks at an older revision of an article, they should see the graphs for that revision. given above. Just noting it, effectively reiterating what I think Tim has better phrased it in his comment at T119043#1868557

Yurik added a comment.Jan 17 2019, 5:54 PM

Most of the time, Vega is used via a template, because otherwise you have a massive copy/paste of code without any benefit, while having no way to fix issues or improve appearance of all graphs en mass. Thus, per what @Anomie said - MCR is an orthogonal (in its current form) to the generated content. This actually has more similarities with the image thumb service than MCR (content is generated from "master" - wiki markup, and cached for usage by both the rendering service like Graphoid and directly from the client via the dynamic graph loading).

This does contradict however with requirement 6. BonusB: When user looks at an older revision of an article, they should see the graphs for that revision. given above. Just noting it, effectively reiterating what I think Tim has better phrased it in his comment at T119043#1868557

@akosiaris why is it a contradiction? The bonus B is similar to being able to see an older version of an article with every type of dependent resource, not just graph, i.e. older images, templates, Lua modules, and even data tables from Commons. On the other hand, the current (master) version should auto-refresh when dependencies are updated.

I think we should have a "snapshot" capability - whenever page is edited, the previous page version in HTML should be preserved, and shown whenever one looks at the history. This removes the requirement of using older templates/modules when rendering history. The links to multimedia (images/movies) should be replaced with the immutable URLs. The links to blob data (maps/graphs) would also be switched to the non-mutable versions - producing the same graph/map if needed. BTW, I think Gabriel was trying to do something like that with his API service.

Is this still needed open? Currently Graphs and Maps development is on hold and we may rethink some of how this is done later, but it's not on the immediate agenda.

3 years since the last comment/patch I am guessing at the very least Stalled and Low priority. I 'll set it as such, feel free to revert.

FWIW this issue has recently come up in T210548: gzip-encoded page properties can't be exported from the API.

I think we should have a "snapshot" capability - whenever page is edited, the previous page version in HTML should be preserved, and shown whenever one looks at the history.

That would be nice to have, but it should be done for everything, or nothing. Having some things show the current version of included data, and other things showing the old version of included data, is very confusing.

Yurik added a comment.Jan 18 2019, 3:10 PM

I think we should have a "snapshot" capability - whenever page is edited, the previous page version in HTML should be preserved, and shown whenever one looks at the history.

That would be nice to have, but it should be done for everything, or nothing. Having some things show the current version of included data, and other things showing the old version of included data, is very confusing.

I think the first step is to save HTML (preserve template/module parsing results). Next step - when snapshoting, switch to image permalinks. Lastly, implement "computed blobs storage" as this ticket describes, and also use permalinks when snapshoting.
These things don't need to happen at the same time. Even making it possible to view proper text of the older version of an article is a good first step.

I think the first step is to save HTML (preserve template/module parsing results). Next step - when snapshoting, switch to image permalinks. Lastly, implement "computed blobs storage" as this ticket describes, and also use permalinks when snapshoting.

Step 0: Resource more storage for the relevant servers. Probably a lot more storage.

Yurik added a comment.Jan 18 2019, 6:43 PM

Step 0: Resource more storage for the relevant servers. Probably a lot more storage.

I agree, this would require some storage space, but I am not sure it will be that huge. The total number of edits is 3.8B (includes redirects, probably includes Wikidata). If we say the average per page content HTML is 10KB, that's ~36TB. Also, I'm sure half if not more of the pages are reverts or other identical HTML content so if data is stored under hashes, multiple revisions can link to it. Plus GZIP. So my back of the napkin calculations bring the total requirements to about 5-10TB, or even lower if we remove Wikidata (can be re-rendered from revision).

Most of the time, Vega is used via a template, because otherwise you have a massive copy/paste of code without any benefit, while having no way to fix issues or improve appearance of all graphs en mass. Thus, per what @Anomie said - MCR is an orthogonal (in its current form) to the generated content. This actually has more similarities with the image thumb service than MCR (content is generated from "master" - wiki markup, and cached for usage by both the rendering service like Graphoid and directly from the client via the dynamic graph loading).

This does contradict however with requirement 6. BonusB: When user looks at an older revision of an article, they should see the graphs for that revision. given above. Just noting it, effectively reiterating what I think Tim has better phrased it in his comment at T119043#1868557

@akosiaris why is it a contradiction? The bonus B is similar to being able to see an older version of an article with every type of dependent resource, not just graph, i.e. older images, templates, Lua modules, and even data tables from Commons. On the other hand, the current (master) version should auto-refresh when dependencies are updated.

Maybe I am reading that sentence wrong, but I think it leaves the impression that content is always generated from master, hence my comment.

Yurik added a comment.Jan 19 2019, 7:38 PM

@akosiaris at the moment, yes, graphs are shown if they are in page-props db - i.e. generated from the last page revision. If you try to view an older version of the page, you will only see the graph image if it hasn't changed since then, otherwise you will see a broken image (because the hash would be different, and that hash would not be stored in the page props). On the other hand, you can view the older variant if you use page preview -- graphs will be rendered on the client, without using page props.

The desired behavior is to be able to show older versions of the graph - regardless if the graph came via template or was part of the main page content - it should work transparently. The problem is that we don't have a system (to my knowledge) to use older template & module revisions when rendering older page revision. Also, we would have to simulate old time for all time-based functions. Ideally we should also properly check if linked pages didn't exist at the time, and show them as red links. And all this also means the graph data should be preserved as well - for eternity, for each revision, but handling duplicates.

Inspired by this comment, I submitted a patch to make Graph use the (existing) ParserCache instead of the page_props table, see T98940#5420144

Change 531159 had a related patch set uploaded (by Catrope; owner: Catrope):
[mediawiki/extensions/Kartographer@master] Stop storing gzipped JSON blobs in page_props

https://gerrit.wikimedia.org/r/531159

tagging CPT for code review.

Change 531159 merged by jenkins-bot:
[mediawiki/extensions/Kartographer@master] Stop storing gzipped JSON blobs in page_props

https://gerrit.wikimedia.org/r/531159

WDoranWMF added a subscriber: WDoranWMF.

Untagging CPT since CR was completed by another team.