Page MenuHomePhabricator

Allow Cassandra-based blob caching with expiration
Closed, ResolvedPublic

Description

In order for Graphoid to be able to store and hash configurations, we need a temporary (1-2 months) storage for the JSON blobs. Gabriel mentioned that Cassandra is capable of this task, and we would need to expose it to NodeJS services.

Event Timeline

Yurik raised the priority of this task from to Needs Triage.
Yurik updated the task description. (Show Details)
Yurik added projects: Services, Graphoid.
Yurik added subscribers: Yurik, mobrovac, GWicke.

IIRC, the main issue here is the fact that Graph IDs are not stable - when the JSON changes, the ID changes with it. Hence, there's no way for RESTBase to know which ID succeeds which one. If there were such a mechanism, we could use existing revision-retention policies (CCing @Eevans as he could possibly give a better answer here). This part of the request could be accommodated by creating a new retention policy, though.

Another issue (maybe more pertinent to this concrete task) is about PUTting or POSTing stuff to Cassandra. Up until now, RESTBase's workflow has been:

  • client requests item X
  • if X is found in the DB, return it
  • otherwise place a call to a (WMF-internal) service
  • store the response
  • return it to the client

I suspect that this is not what you want @Yurik . Rather, you'd need Graphoid to store directly the JSON keyed on the ID (and perhaps revision?). This might work, given that we take some precautions, namely, restricting access to internal IPs. I do not find this ideal, though, and think we should come up with a general, secure approach to publicly-available PUT and friends before proceeding.

Thoughts are welcomed.

I was thinking of the following workflow for the Graphoid:

  • Either browser or MediaWiki PHP uses a POST request to submit the graph definition (JSON)
  • Graphoid renders the PNG/SVG image and stores image in Cassandra under the hash of that same image
  • Graphoid returns the hash of the resulting image in a response header, and optionally returns the image itself (image is not needed if MW internally asks for it, as it only needs the hash ID)
  • HTML will contain an <img> tag with /graphoid/hash.png or /graphoid/hash.svg (note that there will be no need for the domain, page title, or revision id). Graphoid will simply pull the image from Cassandra to render (or if I understood correctly, this can be done by restbase). In theory, if the hash is missing from Cassandra, Graphoid could force-refresh the page contained in the REFERRED-BY header, but that might be too problematic.
  • All MW pages are refreshed at least once a month, during which time MW renderer will make a request to graphoid to re-retrieve the hash for the definition. Graphoid will re-generate the image, and update Cassandra with it.

Can we already accomplish this with Cassandra from the service?

Clarification from the IRC chat with @mobrovac:

Yes, I would like users to be able to upload graph definitions to the service. The service creates hash + image. The image is cached in Cassandra (hash is the key). Subsequent requests should get the image from Casandra.

IRC conversation excerpt:

(18:46:44) mobrovac: yurik_: question: so, if i get the current workings correctly, when a user edits a graph def, it's stored in mysql/mariadb and then graphoid fetches it via the mw api to render the image
(18:46:46) mobrovac: correct/
(18:46:46) mobrovac: ?
(18:47:14) yurik_: mobrovac,  currently the graph is stored in prop-pages
(18:47:20) yurik_: doesn't work well - max 64kb
(18:47:23) yurik_: per page, not per graph
(18:47:41) yurik_: but yes, overall process is correct
(18:48:06) yurik_: and i want to get rid of it - only store the number of graphs in the graph_spec
(18:48:11) yurik_: (prop pages)
(18:48:40) mobrovac: so bsacially this boils down to - you'd like to move the JSONs from mysql to cassandra?
(18:49:18) yurik_: mobrovac, incorrect
(18:49:24) yurik_: i want json to never be stored
(18:49:30) yurik_: i want to store PNG/SVG in casandra
(18:49:46) mobrovac: ah
(18:49:46) mobrovac: so
(18:50:37) mobrovac: the idea is the user/page sends the json blob, it's hashed, the img is rendered and stored, and the hash is remembered in the page somewhere?
(18:51:14) yurik_: two use cases:  during page parsing, json is generated, posted to graphoid, which returns hash. The hash is outputed to HTML (can also be stored in pageprogs just for fun)
(18:51:28) yurik_: another use case - page preview - the json is generated by javascript on the client and posted
(18:51:49) yurik_: gets back the image, shows it
(18:52:05) yurik_: the caching time in the second case should be much lower
(18:52:22) mobrovac: ok, but page preview should not store anything then
(18:52:25) mobrovac: until it's saved
(18:52:38) yurik_: page preview simply gets the image from graphoid
(18:52:44) yurik_: it doesn't store anything anywhere
(18:53:04) yurik_: again - graphoid will accept two calls:    /hash   and   POST:graph def
(18:53:17) yurik_: will return image in the first case
(18:53:26) yurik_: will return image or hash in the second case
(18:54:20) yurik_: when the page is saved, parser will call graphoid again

@Yurik, would it be possible to separate the graph def from the data and hash only the definition? The data may then be collected from /title/revision/data_id, giving us stable graph hashes with variable data.

The workflow could then be something like:

  • user/client puts the graph definition, which is stored in Cassandra
  • client requests /svg/title/revision/graph_hash/data_hash (or /latest)
  • RB looks up if an image is available, if so returns it
  • if not, sends the graph definition and data to Graphoid to render the image
  • stores it and returns it to the client

@mobrovac, this won't work - there is really no clear separation between graph definition and data - definition may include some of the data, and reference other data by URL, and reference other images (used as drawing elements) also by URL. Separating it would be pointless. I would much rather have RB lookup the PNG/SVG image in Cassandra directly with /svg/hash or /png/hash, and rely on Cassandra to store everything longer than MW page refresh cycle (1 month). Every page re-rendering will refresh update Cassandra's cache and extend the expiration.

@Yurik, the main issues I see with storing / requesting only the graph image by its content hash are:

  • we can't ever delete or expire any of those graph images, as we don't have any idea if they are still referenced
  • we can't re-generate them (for example, after changes or bugs in the render pipeline)

If the image request supplied enough information to allow access to the original graph definition (via /{title}/{revision}/{graph_def_hash} or just /{graph_def_hash}, if that's stored by hash in RB), then this would be possible. Which disadvantages do you see in supplying more information in the image request?

Per IRC with @GWicke, the new workflow will look like this:
On page rendering:

  • Calculate hash value based on original unexpanded <graph> tag content
  • Expand tag content to proper graph definition (JSON)
  • Pass hash and json to Graphoid service
  • Graphoid service renders image and stores it in Cassandra under the given hash
  • HTML will refer to the image by hash + pageid
  • Graphoid will require some secret to allow this mode to avoid external image replacement

On page preview:

  • Browser will pass proper graph spec to Graphoid service without the hash, and get back the image without storing it.

@GWicke, re your comment - I don't mind having title/revid in the request. We cannot treat two graphs as identical if they have identical hash, because the hash will be based on the graph definition before template expansion. On the other hand, if the hash belongs to the same pageID, it should be enough. RevisionID might be excessive, unless we create a two level redirection - revisionID+hash -> image-based hash -> actual image data, thus allowing multiple pages and revisions to reference the same graph image.

The workflows seem a bit outdated, so here's a new revision.

Graphoid changes:

  • Graph definition data will now be stored in RESTBase, so graphoid doesn't need page's title and revision any more, so these should(could) be removed from it's endpoints.

RESTBase endpoints:

  • POST /{domain}/v1/media/graph/store - this endpoint is used by graph extension to POST graph definition as a post body. Internally, the data is posted to grapoid, if all went fine - the graph definition is stored in post_data module and a hash is returned in response body. Note that although graphoid result is disregarded, this step is needed to verify that graph definition is correct. Note, that this endpoint is allowed only for internal IPs.
  • POST /{domain}/v1/media/graph/view/{format} - This endpoint is used by the preview mode, the data is simply passed to graphoid without storage
  • GET /{domain}/v1/media/graph/view/{format}/{hash} - this endpoint is used to reference a rendered graph by a hash, returned by the store endpoint.

Also, an existing graphoid endpoint in RESTBase becomes outdated with this change, so it should be deprecated and potentially removed.

@mobrovac, @GWicke, @Eevans, @Yurik, thoughts?

The workflows seem a bit outdated, so here's a new revision.

Graphoid changes:

  • Graph definition data will now be stored in RESTBase, so graphoid doesn't need page's title and revision any more, so these should(could) be removed from it's endpoints.

RESTBase endpoints:

  • POST /{domain}/v1/media/graph/store - this endpoint is used by graph extension to POST graph definition as a post body. Internally, the data is posted to grapoid, if all went fine - the graph definition is stored in post_data module and a hash is returned in response body. Note that although graphoid result is disregarded, this step is needed to verify that graph definition is correct.

So there is no performance gain. Two solutions pop to mind:

  • set up a specialised validator route in Graphoid, which only returns 200 or 400
  • store the result since we wait for the result regardless of the fact that the Graph extension isn't expecting it; allowing for a slightly longer delay would even allow us to get all of the formats and store them right away (much like Mathoid's /complete route); that would need a new Graphoid route, though

We can go either way, but the latter would provide fast access even for the first retrieval (regardless of Varnish caching).

  • POST /{domain}/v1/media/graph/view/{format} - This endpoint is used by the preview mode, the data is simply passed to graphoid without storage

I presume this would get called from the Graph extension as well? I.e. this would be an internal IP as well?

  • GET /{domain}/v1/media/graph/view/{format}/{hash} - this endpoint is used to reference a rendered graph by a hash, returned by the store endpoint.

I must admit I don't like sending this to Graphoid and wait for it to render it. Granted, most requests will hit the cache, but the cache is not infinite and in order for it to put something in, something else has to come out of it. From that I'm inferring that keeping all of the rendered graph images in cache for 30 days is not a realistic supposition. So I'd be very much in favour of storing the responses in RESTBase with the ttl retention policy, so that they can expire.

Also, an existing graphoid endpoint in RESTBase becomes outdated with this change, so it should be deprecated and potentially removed.

Yup, agreed, I'm not aware of any client using it.

Graphoid differs from Mathoid because its result also depends on the external data, not just the graph spec. So I worry about introducing yet another caching layer that has to be invalidated.

On the other hand, if RB cache is invalidated every time a POST request is made, even if the posted data produces an identical hash value, this route could also be made to work. MW could than re-post the graph to RB during the action=purge or action=save, invalidating Varnish and RB cache at the same time. I wonder if there could be a regeneration race condition here - cache is purged but Varnish still gets an older image and re-caches it.

I can introduce a validator route to graphoid - shouldn't be very hard.

Upon further reflection, I think we should stay with just one POST request available to internal IPs only - but make it expire data after 1.5 months because every wiki page is regenerated at least once a month, thus refreshing the data. This way we will treat page previews just like any other page save, which will keep its hash valid for a while.

I think we should also keep the title/revid when posting - mostly for debugging purposes. They should not be required when rendering an image.

  • POST /{domain}/v1/media/graph/store/{title}/{revid} + JSON body (Allowed from internal IPs only)
    1. Validate JSON with Graphoid, which will also return the hash key (we could generate it in RB, but we might as well allow Graphoid to later change how the key is produced).
    2. If success, store JSON body in Cassandra under the generated hash key. Store title and revid for debugging purposes, not as part of the key. The storage should expire in 1+ month.
    3. Return hash id as content
  • GET /{domain}/v1/media/graph/view/{format}/{hash} (public)
    1. Gets JSON from Cassandra by hash (domain is not needed here), and passes domain, format, and json to Graphoid
    2. Graphoid returns an image

On the other hand, if RB cache is invalidated every time a POST request is made, even if the posted data produces an identical hash value, this route could also be made to work.

We could make both endpoints support cache-control: no-cache header, which would regenerate the content, and send such a request when there's a need for invalidation.

I think we should also keep the title/revid when posting - mostly for debugging purposes. They should not be required when rendering an image.

I'm not sure we need to expose title/revid at a URI level, if it's just for debugging, we could add those fields to request body and make them optional. A downside of storing the title/rev is that they will also be a part of a hash, which we don't want (just because the post_data module works like that).

In that case the store and view parts are not required and we can have

  • POST /{domain}/v1/media/graph with formData body with title, revision, data fields.
  • GET /{domain}/v1/media/graph/{format}/{hash}
  1. Validate JSON with Graphoid, which will also return the hash key (we could generate it in RB, but we might as well allow Graphoid to later change how the key is produced).

There should be only one source of truth(a hash) and I'm voting to generate it in RESTBase, as it's responsible for storing/using the hash later on.

On the other hand, if RB cache is invalidated every time a POST request is made, even if the posted data produces an identical hash value, this route could also be made to work.

We could make both endpoints support cache-control: no-cache header, which would regenerate the content, and send such a request when there's a need for invalidation.

If Graphoid returns no-cache to a GET request, Varnish wouldn't cache it (i presume), which would be very bad. I need behavior similar to the MW - pages are returned and cached, but MW can send out a special UDP packet to varnishes to flush individual pages. If RESTBase starts caching as well, we would have two identical disk caching layers, which doesn't make much sense IMO.

I think we should also keep the title/revid when posting - mostly for debugging purposes. They should not be required when rendering an image.

I'm not sure we need to expose title/revid at a URI level, if it's just for debugging, we could add those fields to request body and make them optional. A downside of storing the title/rev is that they will also be a part of a hash, which we don't want (just because the post_data module works like that).

In that case the store and view parts are not required and we can have

  • POST /{domain}/v1/media/graph with formData body with title, revision, data fields.
  • GET /{domain}/v1/media/graph/{format}/{hash}

That' exactly why i think hash generation should not be done in RESTBase - i might decide that two very different bodies actually produce the same result and give you the same hash. Hash generation is a data-dependent process, and as such, should be done only where the code knows the data. The above POST and GET look ok to me - i'm ok with passing title & revid in either URL or formdata or headers. It might be good for debugging to return title & revid as headers with the image.

Without RESTBase, Graphoid would have the following algorithm:

  • POST(IN: json blob, OUT: hash): generates hash, saves data to database with key=hash, returns hash
  • GET(IN: hash, OUT: image): get json blob from database where key=hash, render and return image

Lets try to figure out how we can make it work with RESTBase, while also keeping it simple. Having RESTBase removes the need for each service to set up/manage its own database, but proxy complexity should not be higher than database setup ))

Additional debug use case: for a given hash, get the original json data. We could do it via format=json, e.g. support GET /.../graph/json/hash.json URL. This might be useful for figuring out various issues, and for me to export existing graphs for external testing.

Per vchat today:

  • Lets keep the hash generating in the RESTBase
  • On POST, save the JSON blob and the Graphoid resulting image in Cassandra
  • For debugging, POST request will pass title and revision as optional fields. These fields should be stored, passed to Graphoid and returned on each GET request as extra response headers. These fields should NOT be involved in hash calculation. The optional request fields could be sent as headers, query params, or form fields. At some point we might decide to disable these headers in response unless some magic debug flag is passed in.

Potential future async optimization: Graph generation could take a long time, so RESTBase could store data blob and return hash right away, to let MW proceed with HTML generation. In the mean time, RESTBase will wait for Graphoid to finish image generation and store the resulting image. If RESTBase receives a GET request and the image is not there, but the data is stored, RESTBase should re-render, unless it can support collapsed forwarding.

The graphoid service (not yet deployed), has been updated to support POST requests. See comment in https://gerrit.wikimedia.org/r/#/c/248284/
Usage: /www.mediawiki.org/v2/png with graph data as json body

CC: @Pchelolo

The graphoid service (not yet deployed), has been updated to support POST requests. See comment in https://gerrit.wikimedia.org/r/#/c/248284/
Usage: /www.mediawiki.org/v2/png with graph data as json body

Nice work, @Yurik! (I'm going to bug you a bit here and express my wish to see some tests for this new v2 route as well).

@Pchelolo and I discussed it a bit more this morning and here are some considerations.

In v2 routes, Graphoid should also be capable of supplying all of the possible output as a JSON blob. The reasoning here is that there is a planned expansion to rendering SVG format as well. To ensure quick responses for hashes in different formats, there should be a separate route that responds with the following body:

{
  "png": {
    "headers": {
      "content-type": "image/png"
    },
    "body": // binary PNG blob
  },
  "svg": {
    // idem as above
  }
}

That would allow RESTBase to store once the complete response for all of the current (and future) formats and serve any of them just by using the hash.

As for the public route itself, since graphoid's response doesn't depend at all on the title and revid, we feel it's best to go with a route without them. Apart from the logical implications, that also facilitates caching. For debugging purposes, we might want to come up with a scheme where the Graph extension sends RESTBase the title and rev in a header.

  • I could add an "/all" route to get all formats. Will be done shortly.
  • Will restbase forward those debug headers to graphoid?
  • will restbase store extra response headers and return them?
    • if yes, how do you handle the headers from the /all response?
This comment was removed by Yurik.
  • Will restbase forward those debug headers to graphoid?

As you wish :)

  • will restbase store extra response headers and return them?

Yes.

  • if yes, how do you handle the headers from the /all response?

We will store the result in the following format:

{
   headers: {global headers from graphoid /all response go here},
   data: {
      "png": {
         headers: {png-specific headers like content-type go here}
         body: blob
      }
   }
}

The resulting headers would be constructed like this: Object.assign({format-specific-headers}, {global headers}, {x-resourse-location: 'blablabla'});

The resulting headers would be constructed like this: Object.assign({format-specific-headers}, {global headers}, {x-resourse-location: 'blablabla'});

To clarify, if {format-specific-headers} and {global-headers} contain the same header, the one from {format-specific-headers} will be returned to the client. One case where that is important is, e.g. the Content-Type header.

Implemented /all and /svg. The headers "Title" and "RevisionId" are always returned if passed in.

https://github.com/wikimedia/mediawiki-services-graphoid/commit/fc015123a55220603b55b41aeec682bc806e6c00

GWicke claimed this task.