Page MenuHomePhabricator

RFC: Generalize content-addressable POST request storage
Closed, ResolvedPublic

Description

We have recently encountered several use cases that

  • need to POST some data to a service, and
  • would like to store the response of this request at a compact GET-accessible location for future use.

Examples include:

  • Graphs: POST from the extension or VisualEditor, GET from the rendered page
  • Math: POST from the extension or VisualEditor, GET form the reference in the rendered page

We might be able to implement this generically:

  • store the request as a JSON blob in a key-value bucket, using the sha-1 of the JSON
  • store the response from the backend service in a different bucket, using the same hash
  • return hash location (in Content-Location header?) and original response
    • alternatively, redirect to GET location

To facilitate a clean separation between content types, it probably makes sense to use separate request / response buckets per service. By storing the request, we can re-render all requests when necessary.

For expiry, we could consider a lazy LRU scheme. In any case, the direct correspondence between request and response key should help with systematic cleanup of both components.

Other considerations:

  • We might want to limit the sources of POST requests (only authenticated users, by IP) and public read access to the original requests.

Strawman config stanza

This is an example of how this could be implemented as an extension of the simple_service module we already have:

/{module:postservice}:
   x-modules:
     - name: simple_service
       type: file
       options:
         paths:
           /posttest/{hash}/png:
             get:
               # First, check storage for png. 
               # If found, simply return & be done.
               storage:
                 # result storage for pngs
                 bucket_request: 
                   uri: /{domain}/sys/key_value/postservice.png
                 item_request:
                   uri: /{domain}/sys/key_value/postservice.png/{hash}

               # If png not found, load original post request
               # Return 404 if missing
               post_request_storage:
                 item_request:
                   uri: /{domain}/sys/key_value/postservice.post/{hash}

               # Then, send templated backend request. POST request data
               # object is available in `post_request`.
               # On success, store back via put to `storage.item_request`
               # and return to client.
               backend_request:
                 method: post
                 uri: http://some.post.service/png
                 body: '{$.post_request.body}'


           /posttest/:
             post:
               # Store or verify presence of POST data.
               post_request_storage:
                 bucket_request: 
                   uri: /{domain}/sys/key_value/postservice.post
                 item_request:
                   uri: /{domain}/sys/key_value/postservice.post/{$.post_request.hash}
               # What to do on success:
               response:
                 status: 200
                 headers:
                   content-location: '{$.post_request.hash}'

Main changes are:

  • added post_request_storage stanza, with implicit storage / hashing behavior
    • hash & possibly update if method is POST
    • load into post_request variable if method is GET, before making backend request on storage miss
  • added optional response template (possibly useful for general request templating)
  • added ability to reference post_request.body in backend_request (see T86737)

Event Timeline

GWicke raised the priority of this task from to Medium.
GWicke updated the task description. (Show Details)
GWicke added a project: RESTBase.
GWicke renamed this task from Generalize content-addressable POST request storage to RFC: Generalize content-addressable POST request storage.Jul 15 2015, 11:47 PM
GWicke updated the task description. (Show Details)
GWicke set Security to None.
GWicke added a subscriber: Yurik.

I like this idea. It lets us have a (permanent) link between the request and response (and its location), which enables easy regeneration if needed (think Cache-Control: no-cache). That does, however, require the SHA1's to be stable. Otherwise, we'd be left with a lot of garbage collection to do.

return hash location (in Content-Location header?) and original response

Yes, this. Alternatively, we could also provide an X-Restbase-Content-SHA1 header or something of the sorts. Redirects POST -> GET are not defined, AFAIK (or POST redirects in general, for that matter).

Let me outline the strategy I envision for the math extension and the mathoid service, after reading this task.
The Math extension sends a post request to the following url
http://mathoid-service/convert/SHA1(postdata)
and the payload postdata
The mathoid service would internally decide if the url http://mathoid-service/convert/SHA1(postdata) has been queried before and either recalculate the respons or use the cached response.
Basically all that needs to be changed to the exisiting services is

  1. make the math extension calculate the sha1 from postdata and append that to the mathoid-service url
  2. reconfigure the mathoid service to cahce results

with regard to 2 @GWicke pointed me to https://github.com/wikimedia/restbase/blob/master/config.test.yaml#L122-L138 which I do not fully understand at the moment
with regard to 1 I started on https://gerrit.wikimedia.org/r/#/c/225054/

@Physikerwelt, my proposal is to handle the hashing and caching in RESTBase, so that the logic can be shared between services. If I understand you right, you are thinking about doing this in mathoid instead?

The mathoid service would internally decide if the url http://mathoid-service/convert/SHA1(postdata) has been queried before and either recalculate the respons or use the cached response.

Note that means Mathoid would need to get direct DB support or some other means of IPC, since it is running on two hosts with multiple processes on each of them. IMHO, doing it the other way around, via RESTBase, would be more beneficial:

  1. RESTBase gets a request, checks if the resource with the given SHA1 needs to be re-rendered
  2. If so, it sends a request to Mathoid and stores the result
  3. Returns the response to the client (from its storage) and refreshes the SHA1's TTL (if caching is needed)

@GWicke Ok. That was a misunderstanding. I was thinking that mathoid would inherit this functionality from wikimedia-services.
So let me correct my statement above:

Let me outline the strategy I envision for the math extension, after reading this task.
The Math extension sends a post request to the following url
http://restbase-service/math/SHA1(postdata)
and the payload postdata
The restbase service would internally decide if the url http://restbase-service/math/SHA1(postdata) has been queried before and either forward the request to mathoid or use the cached response.
Basically all that needs to be changed to the exisiting services is

  1. make the math extension calculate the sha1 from postdata and append that to the restbase-service url

with regard to 1 I started on https://gerrit.wikimedia.org/r/#/c/225054/

I think the crux is in repeat requests from clients like extensions, when they don't know if a render already exists. Options seem to be:

  1. repeat the POST, with RB hashing the request, recognizing that the result already exists & returning the same hash again.
  2. let the client calculate the hash as well, and then perform a GET to check if a render already exists, followed by a POST as in 1) if it doesn't
  3. reuse expansion HTML, as is already the case in Parsoid

Discussion:

  1. is fairly straightforward, but might not result in the best performance especially in PHP parser, which processes these requests synchronously without any concurrency.
  2. suffers basically from the same issue, and additionally runs the risk of inconsistent request serializations and thus hash calculations. It is also significantly more complex than 1), by requiring clients to implement serialization, hashing, and two requests instead of one. It might be a tiny bit faster for cache hits, but will be slower for misses.
  3. avoids the performance issues, and can be combined with either 1) or 2). However, it is currently only available in Parsoid.

In the longer term, an interesting option would be to make the service return the HTML for the extension expansion as well. Basically, this would be the equivalent of the API request that Parsoid already does for tag extensions, with the option of caching the response (and along with it, the hash of derived media etc).

@GWicke: At https://gerrit.wikimedia.org/r/#/c/225054/3/MathMathML.php I drafted the method with only one request. Is that in accordance with what you had in mind.

@Physikerwelt, I think it would be better if the extension just repeated the POST, and left any hashing to RB.

@GWicke: In the spirit of selection push down the first request only needs to get MathML and a pointer to svg image. In a second request (I have not drafter this in the WIP change) the specialpage would fetch the image. The Math extension does not even need to know the pointer. The pointer could be exchanged between restbase and mathoid. If restbase redirects to mathoid mathoid, restbase would need to calculate the key and pass it on to mathoid.

Change 225054 had a related patch set uploaded (by Physikerwelt):
WIP: Restbase support for Math extension

https://gerrit.wikimedia.org/r/225054

I think the crux is in repeat requests from clients like extensions, when they don't know if a render already exists. Options seem to be:

  1. repeat the POST, with RB hashing the request, recognizing that the result already exists & returning the same hash again.
  2. let the client calculate the hash as well, and then perform a GET to check if a render already exists, followed by a POST as in 1) if it doesn't
  3. reuse expansion HTML, as is already the case in Parsoid

Without limiting this discussion specifically to extensions, IMHO, a combination of (1) and (2) sounds plausible:

  • a client POSTs the blob
  • RESTBase deduces the hash and checks if it can be retrieved directly from storage, if not, the back-end service is called and the result is stored (together with the request)
  • RESTBase returns the hash location in a header (whenever a POST request is issued)

This implies clients will need to manage the location themselves, but it also means that it is up to the client to decide which way to go - the easy way or the more performant way.

Not sure what you mean by (3).

In the longer term, an interesting option would be to make the service return the HTML for the extension expansion as well. Basically, this would be the equivalent of the API request that Parsoid already does for tag extensions, with the option of caching the response (and along with it, the hash of derived media etc).

Right, but this pertains more to the way RESTBase and its back-end services internally work, while the first part relates to the SLA between RESTBase and outside clients. Or am I confusing stuff?

Not sure what you mean by (3).

The current practice of reusing HTML content for tag extension. This eliminates the need to perform *any* kind of request to the service end point, as the hash etc is already in the HTML.

Right, but this pertains more to the way RESTBase and its back-end services internally work, while the first part relates to the SLA between RESTBase and outside clients. Or am I confusing stuff?

It pertains to the scope of what the service implements, and implicitly opportunities to generate several kinds of contents (ex: html, mathml, SVG, PNG) from a single request. It increases the scope of caching, and minimizes the logic needed for tag extension to a simple post of the tag extension's name, attributes and body, plus perhaps some page metadata (title and revision?). The result is basically content-addressable tag extension expansions, usable from both PHP & Parsoid.

But, you are right that it does not avoid the need to perform a request.

Not sure what you mean by (3).

The current practice of reusing HTML content for tag extension. This eliminates the need to perform *any* kind of request to the service end point, as the hash etc is already in the HTML.

Oh, right. Yes, this! +1000.

This, with, as you already pointed out, a combo of (1) and (2) leaving the possibility to the client to still choose which way to go, which implies that current clients are not forced to adopt a certain way, but can converge in time to that. Prospective clients, should. however, then rely on (3) being de-facto standard.

This, with, as you already pointed out, a combo of (1) and (2) leaving the possibility to the client to still choose which way to go, which implies that current clients are not forced to adopt a certain way, but can converge in time to that. Prospective clients, should. however, then rely on (3) being de-facto standard.

To clarify, what I mean by clients can choose which way to go is that clients can store the hash they've gotten back from RESTBase and then reuse it on subsequent requests to check if the render is still available. They should not, however, calculate the hashes themselves, as that might lead to inconsistencies. There can be only one source of truth on IDs, and that should be RESTBase in this instance because it manages the storage for said resources.

So:

  • client issues a POST request with the data
  • RESTBase returns the location of the resource in a header
  • if wanted, the client can persist the hash somehow to avoid POSTing and use GET on subsequent requests

Just a comment regarding the config stanza. It seems to me that post_request is being created out of thin air, in the sense that it's not well-defined (but that's probably a topic of discussion for T86737). What bothers me here is that post_request has a different meaning in the two routes mentioned:

  • in /posttest/{hash}/png, post_request refers to the stored request object
  • in /postttest/ it should be dereferenced using the original incoming request

Also, $.post_request.hash is referenced in /posttest/ : post : post_request_storage : item_request. I'm guessing this is a typo that should be replaced with $.post_request.params.hash, but even then it is not really clear from this that the original request has such a parameter. In order words, it seems it silently forces the public API endpoint to have one such parameter.

Just a comment regarding the config stanza. It seems to me that post_request is being created out of thin air, in the sense that it's not well-defined (but that's probably a topic of discussion for T86737). What bothers me here is that post_request has a different meaning in the two routes mentioned:

  • in /posttest/{hash}/png, post_request refers to the stored request object
  • in /postttest/ it should be dereferenced using the original incoming request

This is actually fairly similar to the dual nature of storage.item_request, which is implicitly used both with GET and PUT. The main difference is that the same item_request is referenced in two separate stanzas, and one of them supports POST.

I'm open to suggestions on how to make this cleaner / more intuitive.

Also, $.post_request.hash is referenced in /posttest/ : post : post_request_storage : item_request. I'm guessing this is a typo that should be replaced with $.post_request.params.hash,

I don't think so; it's a computed property of the request, and not a request parameter. There might be better ways to encode it (post_request._hash or an x-request-hash header?), but I don't think that parameters is the right place for it.

but even then it is not really clear from this that the original request has such a parameter. In order words, it seems it silently forces the public API endpoint to have one such parameter.

I'm open to suggestions on how to pass this in instead.

This is actually fairly similar to the dual nature of storage.item_request, which is implicitly used both with GET and PUT. The main difference is that the same item_request is referenced in two separate stanzas, and one of them supports POST.

I'd say the main difference is actually the fact that item_request is local to the module and request, while post_request pertains to be a global variable (but local to the request nevertheless). It is my understanding that request templating is going to be done in lib/{server,restbase}.js, and not in the modules. Hence my concern of a module having two different contexts for post_request.

I'm open to suggestions on how to make this cleaner / more intuitive.

Yes, I know my comment wasn't really constructive, sorry about that. Well, if we have $.req always set (as the original incoming request), then post_request in /posttest/ could be replaced by $.req as they essentially refer to the same object. For the other route/case, it seems we'd need to find a way for a module to tell RESTBase to set a global. Or, alternatively, settle on the fact that the module is to return always the stored body (or one of its parts), thus eliminating the need for templating/dereferencing in that case.

Also, $.post_request.hash is referenced in /posttest/ : post : post_request_storage : item_request. I'm guessing this is a typo that should be replaced with $.post_request.params.hash,

I don't think so; it's a computed property of the request, and not a request parameter. There might be better ways to encode it (post_request._hash or an x-request-hash header?), but I don't think that parameters is the right place for it.

Ok, so that means that we'd need to add to lib/server.js the logic of computing the hash for all POST requests ?

but even then it is not really clear from this that the original request has such a parameter. In order words, it seems it silently forces the public API endpoint to have one such parameter.

I'm open to suggestions on how to pass this in instead.

I like your idea about setting the header for it the best :)

This is actually fairly similar to the dual nature of storage.item_request, which is implicitly used both with GET and PUT. The main difference is that the same item_request is referenced in two separate stanzas, and one of them supports POST.

I'd say the main difference is actually the fact that item_request is local to the module and request, while post_request pertains to be a global variable (but local to the request nevertheless). It is my understanding that request templating is going to be done in lib/{server,restbase}.js, and not in the modules. Hence my concern of a module having two different contexts for post_request.

The post_request variable is just a global that's defined by the simple_service module for some requests, and passed into the generic request templating method (which might warrant its own module). It happens to share the name with the config stanza, but is not the same thing. We could change that if you think that would be clearer. Ideas? $.post_req for symmetry with $.req?

I'm open to suggestions on how to make this cleaner / more intuitive.

Yes, I know my comment wasn't really constructive, sorry about that. Well, if we have $.req always set (as the original incoming request), then post_request in /posttest/ could be replaced by $.req as they essentially refer to the same object.

In /posttest/ $.post_request is the processed / mangled post request. It might have headers etc stripped & other stuff normalized, and an annotation for the hash. It differs from the original, incoming request. It matches what's written and later loaded back from the DB as $.post_request.

For the other route/case, it seems we'd need to find a way for a module to tell RESTBase to set a global. Or, alternatively, settle on the fact that the module is to return always the stored body (or one of its parts), thus eliminating the need for templating/dereferencing in that case.

The way I imagine request templating to work, is to simply pass an object with the globals ('model') into the compiled evaluation function. Often that would be just { req: theRequest }, but in others it's { req: theRequest, post_request: loadedPostRequest }, or any others we might want later.

Also, $.post_request.hash is referenced in /posttest/ : post : post_request_storage : item_request. I'm guessing this is a typo that should be replaced with $.post_request.params.hash,

I don't think so; it's a computed property of the request, and not a request parameter. There might be better ways to encode it (post_request._hash or an x-request-hash header?), but I don't think that parameters is the right place for it.

Ok, so that means that we'd need to add to lib/server.js the logic of computing the hash for all POST requests ?

No, I think this should live in simple_service.js.

For the browser preview use case, we also need to support direct POSTing to a content entry point without storing the POST data (and directly returning some preview result). We can set that up as a separate POST end point without any reference to post_request, but it might be a bit repetitive to document it all.

Not a big deal perhaps, but useful to keep in mind. Can we do better?

Overall I think this is good to go. Most of this should probably live in simple_service.js, which might no longer be completely simple afterwards. We could consider renaming that module to something like generic_service to reflect the added powers..

This look good to me. One additional consideration: for private wikis, especially if the graphs are hosted on the special pages (per user permission), what is the expected dataflow?

@Yurik, we are in the process of adding private wiki support (see T88016). Initially, that's focused on read access, but we should be able to expand to simple writes like these POSTs soon. Initially, we will likely want to limit support to graphs that are readable to anybody with 'read' rights. Storing POSTs could initially be locked down to MediaWiki extensions only.

So, the data flow would be basically the same, but with added authentication.

I just realized that we will need to support some kind of revision suppression / deletion for the case that sensitive data was included in a POST. This means that we'd need to keep track of the connection to the original resource, essentially the entire dependency graph. The same tracking could also let us eventually delete POSTs and derived content that are no longer referenced.

In practical terms, this might mean adding a required URL parameter to identify the dependency relationship, and recording it in a dependency tracking system (ex: T105766).

VisualEditor workflow might require a tricky workflow: large number of post requests, followed by "save", at which point it will repost last request with a significantly longer TTL. Just a thought.

VE should be able to directly POST to the individual content end points, without storage. That's what I meant in #1473135.

We might actually be able to generalize this further to something more general and flexible:

/{module:postservice}:
  x-modules:
    - name: simple_service
      type: file
      options:
        paths:
          /posttest/{hash}/png:
            get:
              on_setup:
                - setup_png_storage:
                    method: put 
                    uri: /{domain}/sys/key_value/postservice.png

              on_request:
                - try_return_on_success!:
                    method: get # can be omitted, get is default
                    # Support cache-control refreshs;
                    # key_value bucket returns 404 when receiving no-cache header
                    headers:
                      cache-control: '{cache-control}'
                    uri: /{domain}/sys/key_value/postservice.png/{hash}
                - get_post:
                    # If png not found, load original post request
                    uri: /{domain}/sys/post_data/postservice/{hash}
                - new_png:
                    method: post
                    uri: http://some.post.service/png
                    body: '{$.get_post.body}'
                - save_new_png:
                    method: put
                    uri: /{domain}/sys/key_value/postservice.png
                    headers: '{$.new_png.headers}'
                    body: '{$.new_png.body}'
                - return!: '{$.new_png}'

          /posttest/:
            post:
              on_setup:
                - setup_post_data_bucket:
                    method: put
                    uri: /{domain}/sys/post_data/postservice

              on_request:
                - save_post_request:
                    method: post
                    uri: /{domain}/sys/post_data/postservice/
                    body: '{$.request.body}'
                    # The post_data module returns a 
                    # `Content-Location: <hash>` header, which we forward.
                - return!: '{$.save_post}'

Another example, equivalent to current simple_service entry points:

/simple_service/{key}:
  get:
    on_setup:
      - uri: /{domain}/sys/key_value/example_service
    on_request:
      - try_return_on_success!:
          headers: '{headers}'
          uri: /{domain}/sys/key_value/example_service/{key}
      - new_content:
          uri: http://service/entrypoint/{key}
      - save_new_content:
          method: put
          headers: '{$.new_content.headers}'
          uri: /{domain}/sys/key_value/example_service/{key}
          body: '$.new_content.body'
      - return!: '$.new_content'

Notes:

  • A handler is made up of an array defining several steps, which are executed sequentially.
  • Each step can contain multiple requests, which are executed in parallel. The failure of any named request aborts the entire execution, and returns the failure.
  • Each request is named, which a) also defines the variable under which the response will be available for the remainder of the handler execution, and b) provides a quick description of what each request is doing.
  • Control structures (currently return! and try_return_on_success!) end on ! to avoid naming conflicts. While this might not be strictly needed, having a syntactic way to identify control structures can make this easier to read, and helps them stand out as branches in the request flow.
  • The hash is returned in a header by a new post_data module mounted at /{domain}/sys/post_data/, a thin wrapper over a key_value bucket to add the hashing. This makes the origin of the hash value less magic.
  • The cache control feature is pushed down into storage modules, which return 404 if no-cache is specified. This avoids the need for full-blown conditionals on headers. To disable no-cache refresh support, the header can simply be omitted from the storage request.

We could further expand this with catch! or other try_*! control structures later. The sequential & flat structure might make the request flow easier to understand and more flexible without adding (much) verbosity. We could even consider supporting this kind of handler definition in regular end point stanzas, thus removing the need to define simple_service entry points under /sys/.

What do you think?

To clarify, a multi-request step would look like this:

on_request:
  - do_one_thing:
      uri: http://bla
    do_another_thing:
      uri: http://another
  - next_request:
      uri: /blub/
      body:
        # both responses are available
        one: '{$.do_one_thing.body}'
        another: '{$.do_another_thing.body}'

[sic]

This looks great and exactly what we need - sequential requests with fail-over. The modularity of this approach also allows us to easily implement totally-custom modules underneath it.

One point on:

  • Control structures (currently return! and try_return_on_2xx!) end on ! to avoid naming conflicts. While this might not be strictly needed, having a syntactic way to identify control structures can make this easier to read, and helps them stand out as branches in the request flow.

I like the ! approach, but try_return_on_2xx! implies special parsing and casing, while, at least in the given examples return! seems superfluous because it does not manipulate the last response nor is there another request in the sequence chain to execute after it. I'm proposing to go with the following general rule for names ending with an !. If a request name ends in !, then the sequential chain should not be continued if a positive/given return code is encountered. Users can supply their own codes which are considered good in the request object, with 200 being the default. Also, the xxx format should be allowed to denote a group of acceptable codes, i.e. codes for which the request chain is not continued. If the whole chain has been consumed, then the last response is returned to the caller. So, the example could be:

/{module:postservice}:
  x-modules:
    - name: simple_service
      type: file
      options:
        paths:
          /posttest/{hash}/png:
            get:
              on_setup:
                - setup_png_storage:
                    method: put 
                    uri: /{domain}/sys/key_value/postservice.png

              on_request:
                - try_storage!:
                    return_codes: ['2xx', 500]
                    method: get # can be omitted, get is default
                    # Support cache-control refreshs;
                    # key_value bucket returns 404 when receiving no-cache header
                    headers:
                      cache-control: '{cache-control}'
                    uri: /{domain}/sys/key_value/postservice.png/{hash}
                - get_post:
                    # If png not found, load original post request
                    uri: /{domain}/sys/post_data/postservice/{hash}
                - new_png:
                    method: post
                    uri: http://some.post.service/png
                    body: '{$.get_post.body}'
                - save_new_png:
                    method: put
                    uri: /{domain}/sys/key_value/postservice.png
                    headers: '{$.new_png.headers}'
                    body: '{$.new_png.body}'

Here is another option:

/{module:postservice}:
  x-modules:
    - name: simple_service
      type: file
      options:
        paths:
          /posttest/{hash}/png:
            get:
              on_setup:
                - setup_png_storage:
                    method: put 
                    uri: /{domain}/sys/key_value/postservice.png

              on_request:
                - return_response_if_success!:
                    method: get # can be omitted, get is default
                    # Support cache-control refreshs;
                    # key_value bucket returns 404 when receiving no-cache header
                    headers:
                      cache-control: '{cache-control}'
                    uri: /{domain}/sys/key_value/postservice.png/{hash}
                - get_post:
                    # If png not found, load original post request
                    uri: /{domain}/sys/post_data/postservice/{hash}
                - new_png:
                    method: post
                    uri: http://some.post.service/png
                    body: '{$.get_post.body}'
                - save_new_png:
                    method: put
                    uri: /{domain}/sys/key_value/postservice.png
                    headers: '{$.new_png.headers}'
                    body: '{$.new_png.body}'
                - return!: '{$.new_png}'

          /posttest/:
            post:
              on_setup:
                - setup_post_data_bucket:
                    method: put
                    uri: /{domain}/sys/post_data/postservice

              on_request:
                - return_response!:
                    method: post
                    uri: /{domain}/sys/post_data/postservice/
                    body: '{$.request.body}'
                    # The post_data module returns a 
                    # `Content-Location: <hash>` header, which we forward.

Control structures:

  • return!: response templating
  • return_response!: directly return the response of a single request
  • return_response_if_success!: directly return the response of a single request, if successful

Here is another option:

/{module:postservice}:
  x-modules:
    - name: simple_service
      type: file
      options:
        paths:
          /posttest/{hash}/png:
            get:
              on_setup:
                - setup_png_storage:
                    method: put 
                    uri: /{domain}/sys/key_value/postservice.png

              on_request:
                - return_response_if_success!:
                    method: get # can be omitted, get is default
                    # Support cache-control refreshs;
                    # key_value bucket returns 404 when receiving no-cache header
                    headers:
                      cache-control: '{cache-control}'
                    uri: /{domain}/sys/key_value/postservice.png/{hash}
                - get_post:
                    # If png not found, load original post request
                    uri: /{domain}/sys/post_data/postservice/{hash}
                - new_png:
                    method: post
                    uri: http://some.post.service/png
                    body: '{$.get_post.body}'
                - save_new_png:
                    method: put
                    uri: /{domain}/sys/key_value/postservice.png
                    headers: '{$.new_png.headers}'
                    body: '{$.new_png.body}'
                - return!: '{$.new_png}'

          /posttest/:
            post:
              on_setup:
                - setup_post_data_bucket:
                    method: put
                    uri: /{domain}/sys/post_data/postservice

              on_request:
                - return_response!:
                    method: post
                    uri: /{domain}/sys/post_data/postservice/
                    body: '{$.request.body}'
                    # The post_data module returns a 
                    # `Content-Location: <hash>` header, which we forward.

+1

I'm still not in favour of special names, I think adding ! to the name and/or a response: stanza to signal response templating would cover our needs. Apart from that, I'm ok with the last suggestion, except I'd generalise the criterion for breaking the request chain and would put return_codes: or return_on: stanzas to !-requests so that we can specify explicitly the response codes.

A disadvantage I see with using special attributes in the body instead of top-level control structures is that it makes the control flow less clear. It's easier to overlook such a statement among the other template properties. Marking them with ! helps a bit, but I'm not sure that it's enough.

For arbitrary conditional returns, we could also add something like this later:

- return_response_if!:
    request:
      uri: http://example/
    condition:
      status: 200

This avoids mixing request / response templates with other attributes, which I could see getting confusing.

@mobrovac, do you see a situation where we need a name for a return_response_if_success! request?

Edit: Another point is that top-level control structures ultimately give us a bit more flexibility. For example, we could later add a catch_return! stanza that can be used to format error messages, somewhat similar to JS Promise.catch().

on_request:
  - do_this:
      uri: ...
  - do_that:
       uri: ...
  - catch_return!:
       status: 500
       body:
         type: this_that
         details: Something went wrong when processing this or that
         this_response: '{$.do_this}'
         that_response: '{$.do_that}'

A disadvantage I see with using special attributes in the body instead of top-level control structures is that it makes the control flow less clear. It's easier to overlook such a statement among the other template properties. Marking them with ! helps a bit, but I'm not sure that it's enough.

I'd argue it's a matter of perspective. Special naming means users must be familiar with these magical names in order to know they are different, while when encountering other, special stanzas can know right away there is something special there.

I fear that special naming may induce confusion. But, one could make the argument that in that case documentation is everything (which, note, is not as needed for special stanzas). I guess we could try and keep it updated if you feel strongly about special naming.

@mobrovac, do you see a situation where we need a name for a return_response_if_success! request?

I'm confused. Isn't return_response_if_success! its name?

Edit: Another point is that top-level control structures ultimately give us a bit more flexibility. For example, we could later add a catch_return! stanza that can be used to format error messages, somewhat similar to JS Promise.catch().

Right. The same can be achieved with a response stanza signifying we want to return something special.

Special naming means users must be familiar with these magical names in order to know they are different

The rule would be that anything ending in ! is a control structure, and not a name. It does not register the response under the control structure's name.

The rule would be that anything ending in ! is a control structure, and not a name. It does not register the response under the control structure's name.

Right, !-suffixed names are special, but I understood your proposal as return_if_success! != catch_return! based on the name, not the stanzas it contains. If not, then we are talking about the same thing.

@mobrovac: Yes, different control structures implement different semantics.

I think it could be helpful to look at an example of what you have in mind.

/{module:postservice}:
  x-modules:
    - name: simple_service
      type: file
      options:
        paths:
          /posttest/{hash}/png:
            get:
              on_setup:
                - setup_png_storage:
                  method: put 
                  uri: /{domain}/sys/key_value/postservice.png

              on_request:
                - get_from_storage!:
                    return_on: ['2xx', '5xx']  # breaks the chain if any of the status codes is encountered
                    method: get # can be omitted, get is default
                    # Support cache-control refreshs;
                    # key_value bucket returns 404 when receiving no-cache header
                    headers:
                      cache-control: '{cache-control}'
                    uri: /{domain}/sys/key_value/postservice.png/{hash}
                - get_post:
                    # If png not found, load original post request
                    uri: /{domain}/sys/post_data/postservice/{hash}
                - new_png:
                    method: post
                    uri: http://some.post.service/png
                    body: '{$.get_post.body}'
                - save_new_png:
                    method: put
                    uri: /{domain}/sys/key_value/postservice.png
                    headers: '{$.new_png.headers}'
                    body: '{$.new_png.body}'
                - return!:
                    response: '{$.new_png}'

          /posttest/:
            post:
              on_setup:
                - setup_post_data_bucket:
                    method: put
                    uri: /{domain}/sys/post_data/postservice

              on_request:
                - do_post:
                    method: post
                    uri: /{domain}/sys/post_data/postservice/
                    body: '{$.request.body}'

Notes:

  • a !-named block can be the only property of the array element
  • a !-named block can contain two special stanzas (either or both):
    • return_on - an array of status codes for which, if encountered, the sequential chain is broken and the response returned
    • response - a response template object signifying a custom returned response
  • the response of the last request in the chain is retuned, entailing we don't need a special return! block if we don't need to change the response (because of this, on_request of /posttest/ does not need a !-block)

If there is a !-named block, but no special stanza is present, the defaults are:

my_block!:
  return_on: [200]
  response: '{$.response}'

Simple service illustration:

/simple_service/{key}:
  get:
    on_setup:
      - uri: /{domain}/sys/key_value/example_service
    on_request:
      - try_return_on_success!:
          headers: '{headers}'
          uri: /{domain}/sys/key_value/example_service/{key}
      - new_content:
          uri: http://service/entrypoint/{key}
      - save_new_content:
          method: put
          headers: '{$.new_content.headers}'
          uri: /{domain}/sys/key_value/example_service/{key}
          body: '$.new_content.body'
      - return!:
          response: '$.new_content'

Important note: I left most of the names equal to your example(s), but they are not fixed; no importance is given to the name of a !-block

@mobrovac, thanks for creating the example. That helps!

@Eevans, @Pchelolo: what are your preferences?

I generally agree that it's nice to be able to specify the return_on and response templates. However, I don't like the mixture of special reserved keywords and request parts. What about adding a new reserved word request, so

on_request:
  - get_from_storage!:
      return_on: ['2xx', '5xx']  # breaks the chain if any of the status codes is encountered
      request:
        method: get # can be omitted, get is default                  
        headers:
          cache-control: '{cache-control}'
        uri: /{domain}/sys/key_value/postservice.png/{hash}

So, if the request name ends with !, it's a control structure, with special structure, that can contain return_on, response and request fields. If request name doesn't finish with !, it contains the request definition without special control structures. Also, responses of requests named with ! are not saved to the context.

@Pchelolo's option looks cleaner to me. The downside is more verbosity, with an extra level of indentation and an extra line per request.

I'm still not convinced that return and catch like control flow is clear enough in this scheme. Most of the options inside of a request stanza only pertain to the particular request, while some (return, catch) change the control flow of the entire request processing chain. I think there are advantages to highlighting the major control flow points at the top level, so that reading the outline already gives you an idea of the control flow and possible return points.

We have discussed this in a meeting and came to an agreement.

Model

There are two high-level blocks: on_setup and on_request. The former is run during RESTBase start-up, while the latter is executed on every incoming request and can be made up of multiple requests (further referred to as request blocks).

Request Block

Each request block can be a control one, no special naming (including the ! suffix) is used, as everything can be determined by the block's properties. A block's name is remembered and put in the current scope, so that is can be referenced in a later block; its reference is bound to the response returned for it. Therefore, request block names must be unique inside the same on_setup or on_request fragment.

A block is allowed to have the following properties:

  • request: a request template of the request to issue in the current block
  • return: a response template documenting the response to return
  • return_if: a conditional stanza specifying the condition to be met for stopping the execution of the sequential chain
  • catch: a conditional stanza specifying which error conditions should be ignored

Request blocks in the on_setup fragment are allowed to have only the request stanza.

Execution Flow

It is mandatory that a block has either a request, a return stanza or both. When a return stanza is specified without a matching return_if conditional, the sequence chain is stopped unconditionally and the response is returned. In the presence of a return_if stanza, its conditional is evaluated and, if satisfied, the chain is broken; otherwise, the execution flow switches to the next block.

Errors are defined as responses the status code of which is higher than 399. When an error is encountered, the chain is broken, unless there is a catch stanza listing that status code, in which case the execution continues with the next block.

Sequential and Parallel Blocks

The on_request fragment is an array of objects each of which is a named request block. All of the array elements are executed sequentially one by one until either the chain is broken (because of a return, return_if or an error) or the end of the array has been reached:

on_request:
  - req_block1: ...
  - req_block2: ...
  - ...
  - req_blockN: ...

Should an array element contain more than one request block, they are going to be executed in parallel:

on_request:
  - req_block1: ...
  - req_block2a: ...
    req_block2b: ...
  - ...
  - req_blockN: ...

Parallel request blocks cannot have the return nor the return_if stanzas and cannot be the last element of the chain array.

Outline

Here's the complete outline of the possible blocks/stanzas:

on_setup:
  - name1:
      uri: uri1
      method: m1
      headers:
        h1: v1
      body: body1
on_request:
  - name1:
      request:
        method: m2
        uri: uri2
        headers:
          h2: v2
        body: body2
      catch:
        status: [404, '5xx']
    name2:
      request:
        method: m3
        uri: uri3
        headers:
          h3: v3
        body: body3
  - name3:
      request:
        method: m4
        uri: uri4
        headers:
          h4: v4
        body: body4
      return_if:
        status: ['2xx', 404]
  - name4:
      return: '{$.name2}'

POST Service Definition

Using this specification, we can define the POST service configuration as follows:

/{module:postservice}:
  x-modules:
    - name: simple_service
      type: file
      options:
        paths:
          /posttest/{hash}/png:
            get:
              on_setup:
                - setup_png_storage:
                    method: put 
                    uri: /{domain}/sys/key_value/postservice.png
              on_request:
                - get_from_storage:
                    request:
                      method: get
                      headers:
                        cache-control: '{cache-control}'
                      uri: /{domain}/sys/key_value/postservice.png/{hash}
                     return_if:
                       status: 200
                     catch:
                       status: 200
                - get_post:
                    request:
                      uri: /{domain}/sys/post_data/postservice/{hash}
                - new_png:
                    request:
                      method: post
                      uri: http://some.post.service/png
                      body: '{$.get_post.body}'
                - save_new_png:
                    request:
                      method: put
                      uri: /{domain}/sys/key_value/postservice.png
                      headers: '{$.new_png.headers}'
                      body: '{$.new_png.body}'
                - return_png:
                    return: '{$.new_png}'

          /posttest/:
            post:
              on_setup:
                - setup_post_data_bucket:
                    method: put
                    uri: /{domain}/sys/post_data/postservice
              on_request:
                - do_post:
                    request:
                      method: post
                      uri: /{domain}/sys/post_data/postservice/
                      body: '{$.request.body}'

I think the last posttest stanza is missing a return:

on_request:
  - do_post:
      request:
        method: post
        uri: /{domain}/sys/post_data/postservice/
        body: '{$.request.body}'
      return: '{$.do_post}'

I think the last posttest stanza is missing a return:

No need for it. When there is no explicit return stanza, the last response of the last request in the chain is to be returned. As there is only one here, that one will be returned.

return: '{$.on_request}'

s/on_request/do_post/

No need for it. When there is no explicit return stanza, the last response of the last request in the chain is to be returned. As there is only one here, that one will be returned.

Allright ;)

The PR is merged, so now this can be used for development, this ticket can be closed.

Reopening the ticket as we have found out that we completely neglected the hashing part of the equation. We have iterated over IRC, and there seem to be two problems with it.

Eager or Lazy Hashing

The idea is that RESTBase, as the authoritative entity for storage, hashes the POST requests and returns the hash in the Content-Location header, so that clients may reuse it on subsequent requests without needing to send the whole payload. Now, should this be done automatically for each request or not? Since we cannot be sure each request would actually use the hash, it seems like an unnecessary overhead to do so. Instead, calculating the hash only if {$.req.hash} is specified makes more sense. Alternatively, a special config stanza could be introduced that specifies whether the hash should be computed at all.

Hash Context

A more fundamental problem is which part of the request to hash. The obvious answer the complete request object could not be what we want. In case of subsequent GET requests specifying the hash, it doesn't really matter as long as that hash (and the complete request) can be found in storage. However, what about request de-duplication? If the whole request is hashed, no two requests will ever have the same hash, if nothing else then because their X-Request-Id headers will always differ. The options seem to be:

  • Hash only the body. This approach assumes headers in POST requests have much less weight (which is mostly true, as everything relevant to the request tends to be specified in the body).
  • Hash the complete request, but remove useless headers, like X-Request-Id. While this approach seems more correct, it opens up new questions, such as how to define the set of useful and/or useless headers? Moreover, it is likely that these sets might not be absolute and correct for all of the requests.

Here's the PR with a basic implementation that hashes the whole request except x-request-id header.

And here I've tried to prototype graphoid spec with post storage, which uncovered one more issue: we can't substitute a hash to the URI in a request to key_value post store, because we don't support templates like this in the URI. As an optimisation, we can only template params that were in original URI params. It's an important optimisation, so I'm not completely sure what should be done here.

The post_data module PR has been merged and deployed, so this task can be closed finally.

Change 225054 abandoned by Physikerwelt:
WIP: Restbase support for Math extension

https://gerrit.wikimedia.org/r/225054