Graph/Graphoid need to solve the "data" problem, which seems to be identical to the <map> tag (Kartographer). Both <graph> and <map> have similar data storage needs, so I will only talk about the <graph>.
Intro
The <graph> tag contains JSON graph spec that describes how the graph should be drawn:
<graph>{ "data": ..., "marks": ..., ... }</graph>
During the page parse, the spec needs to be extracted and stored, so that it can be later used by the Graphoid service or the browser.
Requirements
- parser needs to store the graph definition (spec) in some storage
- Graphoid service needs the spec to generate the image
- browser needs to get the graph image generated by the Graphoid service
- browser needs to get the JSON spec when the user wants to interact with the graph
- BonusA: on save preview (not VE), it would be good to show the graph from Graphoid, instead of drawing it client-side, bit allow "click-to-interact" mode.
- BonusB: When user looks at an older revision of an article, they should see the graphs for that revision.
- BonusC: A special case of BonusB - if the graph spec uses external data that changes, or if Vega library changes the way it works, users should still see the original graph when looking at the older revisions.
Current limitations
The current approach of storing the spec in page_props SQL table has these issues:
- 64KB limit per article - breaks for bigger/multiple graphs
- for a new graph, it will take time for slave DBs to get replicated - which sometimes prevents graphoid from getting the spec from the api, causing an error (which may get cached)
- the URL with graph ID (hash) becomes invalid if the spec changes
- viewing older revisions do not show graphs unless they are identical to the latest revision
- getting the spec via api.php is not cached by varnish (? TBD)
Special case - Visual Editor plugin
Since VE plugin can only support simple non-templated graphs, it would make sense to always render the graph on the client, without using Graphoid. On edit, if the user opens the graph editor, VE should load the graph libraries and render the content of the <graph> tag.
Solution #1 - SQL storage
Introduce new graph-specific SQL tables:
table graph_specs - contains just the graph specs, indexed by their hashes fields: hash (string key), last_used(timestamp), spec (blob) table graph_revs - lists which graphs are used by which page revisions fields: id (autonum key), page_rev_id (int), hash (string)
- When creating a new article revision, or updating an existing one (null edit), the graph ext will ensure that each graph exists in graph_specs table and update last_used timestamp on each. Graph ext will also ensure that graph_revs table contains only the right graph hashes. Graph extension will never delete any rows from the graph_specs, thus preventing any dead links from browser cache. This assumes that at that point graph ext will know revid of the new/existing revision.
- A cleanup batch job will delete any graph_specs rows that are older than 1 month (?) and not referenced by the graph_revs table.
- Graph ext will implement an api to get graph spec by hash: api.php? action=getgraph & hash=XXX
- Browser will get an image by Graphoid/<hash>.png url (3). Graphoid will access graph spec via the api (2), and so can the browser (4). For page preview bonusA, the graph ext will add/update a row in the graph_specs table, but will not touch graph_revs.
- This approach solves bonusB unless we choose to cleanup older page revision rows in graph_revs.
- The bonusC is not solved because we do not store images, but regenerate them when needed.
PROs: data is part of the wiki mysql, under proper DB management (monitoring, backups, etc); per-user access control checks
CONs: slave replication bug is still there, api.php caching issues, no BonusC
Solution #2 - RestBASE POST storage
In the near future RestBASE plans to provide an alternative Cassandra-based storage with TTL support operating via HTTP (T101093)
- When saving/null editing, graph ext will POST graph spec to RestBASE, which will store it in Cassandra, and return the hash ID as a header. The hash will be used to construct the <img href="...Graphoid/(hash).png" /> tag (or it can return the full URL) Additionally, RB will POST the spec to the Graphoid service (2), and store the resulting image to Cassandra.
- Graphoid service will only support POST requests - posting spec will return an image.
- When browser requests an image via GET (3), RestBASE will simply return the stored image, without calling Graphoid. When browser requests the graph spec via GET (4), RestBASE will return it from Cassandra as well. For page preview bonusA, graph ext would POST the spec to RB with an additional "temp" parameter/header. RB will check if this spec already exists in Cassandra, and if not, store it with a 1 month (?) TTL. On browser request, the image will not be cached.
- For BonusB, older images are available from RB, or can be regenerated on the fly because the spec is also stored.
- BonusC is solved only partially, because the hash is generated from the spec, and if spec stays the same but external data or vega lib change, newer image will override the older one.
How to solve BonusC?
BonusC can only be solved if we store each generated image forever (except those generated for the "preview" mode). At the same time, we do not want to store multiple identical images. But we won't know that the image is identical until after we generate it.
- A user should be able to view an old page revision with all additional resources (e.g. graph) - as of the moment when the newer page revision was added
- For the "HEAD" revision of the page, and ONLY for it, the page should be in a flux state - changes to any templates, or external resources such as images, should be reflected in the HEAD revision automatically.
- We want to optimize both storage and cache, so that if two revisions contain a reference to an identical resource, that resource is only stored once, and has the same URL to improve browser and server-side caching.
So it seems we need two mechanisms - one to keep the HEAD revision up to date (null edits), and another is archiving - to preserve whatever state the current HEAD, plus all referenced resources/images/graphs/etc before adding a new revision. It is ok if there was a rendering error in an old revision - because that is an accurate representation of the past. TBD: mechanism to manually delete old images due to legal reasons.
Assuming we use Cassandra (solution #2) for image storage, when graph ext POSTs the spec to RB, RB could first generate the image, and only then return the hash. This way HTML will contain a permanent URL to the image. This works but at a significant performance degradation - parsing will pause until image generation is done. Another problem with this is null-edits - each null-edit could generate a different images for the same article revision, and all of them will be stored forever. To delete them, we would have to track which image belong to which revision.