Page MenuHomePhabricator

Check stored / returned mime type vs. swagger spec; support content migration / re-generation
Closed, ResolvedPublic

Description

We are generally using mime types (content-type header) to version content types. This is very important especially for stored content:

  • The format of stored HTML / JSON might change, requiring a format migration.
  • Security issues might have left dangerous content in HTML we store. We'll need to re-run the serializer (via T114413) and migrate to the latest HTML mime type.

There are special considerations when comparing mime types, which are discussed in https://github.com/wikimedia/restbase/pull/312#discussion_r39576615. In short: We'll have to normalize the content-type before comparing. The content-type module might come in handy for this.

If the comparison fails, there are two main ways to handle this:

  1. Repeat the request with cache-control: no-cache: This is a fairly universal mechanism that should work out of the box, for example for Parsoid. The downside is that this might be a very expensive way to upgrade content, and it might not actually result in the same content, for example when templates are re-expanded.
  2. Call a specific migration service, like the html2html end point described for Parsoid in T114413. Then, save the response back to the database.

Event Timeline

GWicke raised the priority of this task from to Needs Triage.
GWicke updated the task description. (Show Details)
GWicke subscribed.
GWicke renamed this task from Check stored / returned mime type against the ones specified in the swagger spec, and support upgrades / re-generation to Check stored / returned mime type vs. swagger spec; support content migration / re-generation.Oct 22 2015, 10:09 PM
GWicke updated the task description. (Show Details)

As a general way to hook this up, I propose to extend the handler templating feature with a request post processor. A new stanza x-response-handler (we could bike-shed on the name) will be introduced along with x-request-handler and x-setup-handler. The syntax would be like this:

x-request-handler:
  - get_from_backend:
      request:
         uri: /bla/bla/bla
x-response-handler:
  - validate_mime_type:
      request:
        uri: /sys/post_process/validate_mime

The semantics of this would be the following: after a main request if done, a POST request is sent to each postprocessor, sequentially. Parallel postprocessors are not allowed. The request body is a JSON with original_request, original_response and endpoint_spec fields. The postprocessor can decide, what to do with the request - it can schedule a new request with no-cache, or do nothing, or modify the response somehow. None of the additional x-request-handler stanza will be supported in x-response-handler stanza.

Although this creates some overhead with posting more requests instead of doing it in place, these requests are internal within RESTBase, so they a pretty fast. Anyway, config-based solution is an only way to do that for purely config-based modules, like mobileapps or graphoid.

Also, the post_processor module would potentially end up as a collection of useful processors, which could be introduced on the endpoint with a couple of config lines. Such processors could be mime-type validation, content sanitisation of some kind, may be more.

@mobrovac @GWicke @Eevans Does this sound sane to you?

I support the idea of a response validator. There should also be a request validator in the same vein. @Pchelolo why a new handler stanza for this? Can't we have that as the last request in the x-request-handler chain? After all, it's a parametrised request.

There should also be a request validator in the same vein.

The request validator is hooked up into every public endpoint by default, so I don't think we need a special stanza or mandatory config changes for it.

Why a new handler stanza for this? Can't we have that as the last request in the x-request-handler chain? After all, it's a parametrised request.

Originally I thought about a new stanza for two reasons:

  • We need to provide a spec to the validator, for example to get an expected mime-type. However, we can inject the spec in the context.
  • To clarify separation between the actual request handler, and a request post-processor.

However, in the end it's all a question of taste, which config looks better:

x-request-handler:
  - get_from_backend:
      request:
        uri: /bla/bla/bla
  - validate_mime_type:
      request:
        uri: '{domain}/sys/response_processor/mime'
        method: post
        body:
          original_request: '{$.request}'
          original_response: '{$.get_from_backend}'
          endpoint_spec: '{$.spec}'

or

x-request-handler:
  - get_from_backend:
      request:
         uri: /bla/bla/bla
x-response-handler:
  - validate_mime_type:
      request:
        uri: '{domain}/sys/response_processor/validate_mime'

The first one gives more control over what's sent into a specific validator, but in general, we don't need that control, so this would be copy-pasta.

BTW, I've made a proof-of-concept implementation of the x-response-handler idea, and benchmarked it a bit. Obviously, we get some performance degradation introducing one more request, but it's not too big. On my machine it decreased from ~450 req/s to 430 req/s for the /page/html/{title} endpoint, which is about 5% overhead. Obviously, the data is for a most common case, when regenerating content is not needed.

I was originally thinking that we should also bake the response validation into the regular processing, based on the swagger spec. To me, the main need for a handler is to customize what happens when the default validation fails. We might provide a default implementation that repeats the request with no-cache headers set, but end points like Parsoid HTML might want to do something smarter like upgrading the HTML with an html2html end point.

As for syntax / naming.. perhaps x-invalid-response-handler ?

Hm, I'd like to keep this generalised. Here we are interested in the content-type header, but it's easy to envision that responses might also be validated against the body defined in the swagger spec.

Maybe call it x-response-validator and there we could specify what endpoint should be called in case of a validation failure?

@mobrovac, imho the validation should be mostly driven by the spec, rather than custom code. We already have a means to hook up custom code via modules (and can use this for custom validation), but what we don't have yet is a general way to leverage the spec for response validation.

@mobrovac, imho the validation should be mostly driven by the spec, rather than custom code. We already have a means to hook up custom code via modules (and can use this for custom validation), but what we don't have yet is a general way to leverage the spec for response validation.

Although I like the idea to have it all controlled by the spec, like "run this request if response does not conform to the spec", I don't think this isn't possible.

The problem is that an action could be different depending on what's invalid. It's easy to imagine, that for content-type mismatch, we'd need to call endpoint1, and for body-schema mismatch - endpoint2, for lack of some header - endpoint3 etc. Controlling this on a spec level is too complicated imho. Another problem, is that running an additional request could be optional, depending on the original request. For example, we need to repeat a request with no-cache only if the original request didn't contain no-cache to avoid infinite recursion.

So, we can do purely spec-based solution, but that would mean a significant cut in generality.

@mobrovac, imho the validation should be mostly driven by the spec, rather than custom code. We already have a means to hook up custom code via modules (and can use this for custom validation), but what we don't have yet is a general way to leverage the spec for response validation.

Exactly. My point was merely that x-invalid-response-header pertains to only one thing that could be validated using the spec (other include other headers, body format, etc).

Although I like the idea to have it all controlled by the spec, like "run this request if response does not conform to the spec", I don't think this isn't possible.

The problem is that an action could be different depending on what's invalid. It's easy to imagine, that for content-type mismatch, we'd need to call endpoint1, and for body-schema mismatch - endpoint2, for lack of some header - endpoint3 etc. Controlling this on a spec level is too complicated imho. Another problem, is that running an additional request could be optional, depending on the original request. For example, we need to repeat a request with no-cache only if the original request didn't contain no-cache to avoid infinite recursion.

So, we can do purely spec-based solution, but that would mean a significant cut in generality.

At first glance, I'd say no-cache requests should be exempted from it. If we get a bad response from such a request, there really isn't something we can do about it. At that point, we have the choice of either delivering the response as-is, or claim a 500.

But, more generally I agree that there might be a lot of situations where making it spec-driven will simply be too complicated. So how about allowing some simple and most-common use cases in the spec, while for others expect people to write custom modules?

My point was merely that x-invalid-response-header pertains to only one thing that could be validated using the spec (other include other headers, body format, etc).

Oh, I assumed that the handler would be general, for any kind of response validation failure. We might want to support some validation options in the spec, and define an interface for a failed validation to pass information about which part is at fault to the handler.

So how about allowing some simple and most-common use cases in the spec, while for others expect people to write custom modules?

+1. Lets start with something simple & cheap like content-type validation, and then see which other use cases make sense to integrate into the general validation, vs. implement as a module.

+1. Lets start with something simple & cheap like content-type validation, and then see which other use cases make sense to integrate into the general validation, vs. implement as a module.

I agree.

So, for now we validate only the response mime-type, only if no-cache is not set, and in case of a failure - run a request (possibly the whole request chain), specified in x-invalid-response-handler. When we find more use-cases, we can generalise that.

So, for now we validate only the response mime-type, only if no-cache is not set,

We can still validate & log validation failures from no-cache requests, but lets not call the handler (default or otherwise) if no-cache is already set.

and in case of a failure - run a request (possibly the whole request chain), specified in x-invalid-response-handler. When we find more use-cases, we can generalise that.

+1

This has been solved with ensure_content_type filter for quite a long time now. Resolving.