Change Details

Graph/Graphoid need to solve the "data" problem. == Requirements 1. graph definition (spec) is generated by the graph extension during parse and stored 2. browser needs to get the image generated by Graphoid 3. Graphoid service needed the spec to generate the image 4. browser needs to get the spec when user wants to interact with the graph 5. BonusA: on save preview (not VE), it would be good to show the graph from Graphoid, instead of drawing it client-side, bit allow "click-to-interact" mode. 6. BonusB: When user looks at an older revision of an article, they should see the graphs for that revision. 7. BonusC: A special case of BonusB - if the graph spec uses external data that changes, or if Vega library changes the way it works, users should still see the original graph when looking at the older revisions. == Current limitations The current approach of storing the spec in page_props SQL table has these issues: * 64KB limit per article - breaks for bigger/multiple graphs * the URL with graph ID (hash) becomes invalid if the spec changes * getting the spec via api.php is not cached by varnish (? TBD) == Special case - Visual Editor plugin Since VE plugin can only support simple non-templated graphs, it would make sense to always render the graph on the client, without using Graphoid. On edit, if the user opens the graph editor, VE should load the graph libraries and render the content of the <graph> tag. == Solution #1 - SQL storage Introduce new graph-specific SQL tables: ``` table graph_specs - contains just the graph specs, indexed by their hashes fields: hash (string key), last_used(timestamp), spec (blob) table graph_revs - lists which graphs are used by which page revisions fields: id (autonum key), page_rev_id (int), hash (string) ``` * When creating a new article revision, or updating an existing one (null edit), the graph ext will ensure that each graph exists in graph_specs table and update last_used timestamp on each. Graph ext will also ensure that graph_revs table contains only the right graph hashes. Graph extension will never delete any rows from the graph_specs, thus preventing any dead links from browser cache. This assumes that at that point graph ext will know revid of the new/existing revision. * A cleanup batch job will delete any graph_specs rows that are older than 1 month (?) and not referenced by the graph_revs table. * Graph ext will implement an api to get graph spec by hash: api.php? action=getgraph & hash=XXX * Browser will get an image by `Graphoid/<hash>.png` url (2). Graphoid will access graph spec via the api (3), and so can the browser (4). For page preview bonusA (5), the graph ext will add/update a row in the graph_specs table, but will not touch graph_revs. * This approach solves bonusB (6) unless we choose to cleanup older page revision rows in graph_revs. * The bonusC (7) is not solved because we do not store images, but regenerate them when needed. PROs: data is part of the wiki mysql, under proper DB management (monitoring, backups, etc); per-user access control checks CONs: api.php caching issues, unable to support BonusC (7) == Solution #2 - RestBASE POST storage In the near future RestBASE plans to provide an alternative Cassandra-based storage with TTL support operating via HTTP. * When saving/null editing, graph ext will POST graph spec to RestBASE, which will store it in Cassandra, and return the hash ID as a header. The hash will be used to construct the `<img href="...Graphoid/(hash).png" />` tag (or it can return the full URL) Additionally, RB will POST the spec to the Graphoid service (3), and store the resulting image to Cassandra. * Graphoid service will only support POST requests - posting spec will return an image. * When browser requests an image via GET (2), RestBASE will simply return the stored image, without calling Graphoid. When browser requests the graph spec via GET (4), RestBASE will return it from Cassandra as well. For page preview bonusA (5), graph ext would POST the spec to RB with an additional "temp" parameter/header. RB will check if this spec already exists in Cassandra, and if not, store it with a 1 month (?) TTL. On browser request, the image will not be cached. * For BonusB (6), older images are available from RB, or can be regenerated on the fly because the spec is also stored. * BonusC (7) is solved only partially, because the hash is generated from the spec, and if spec stays the same but external data or vega lib change, newer image will override the older one. == How to solve BonusC? BonusC can only be solved if we store each generated image forever (except those generated for the "preview" mode). At the same time, we do not want to store multiple identical images. But we won't know that the image is identical until after we generate it. Assuming we use Cassandra (solution #2) for image storage, when graph ext POSTs the spec to RB, RB could first generate the image, and only then return the hash. This way HTML will contain a permanent URL to the image. This works but at a significant performance degradation - parsing will pause until image generation is done. Another problem with this is null-edits - each null-edit could generate a different images for the same article revision, and all of them will be stored forever. To delete them, we would have to track which image belong to which revision.