re-generation
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• GWicke
	Oct 22 2015, 10:06 PM

Description

We are generally using mime types (content-type header) to version content types. This is very important especially for stored content:

The format of stored HTML / JSON might change, requiring a format migration.
Security issues might have left dangerous content in HTML we store. We'll need to re-run the serializer (via T114413) and migrate to the latest HTML mime type.

There are special considerations when comparing mime types, which are discussed in https://github.com/wikimedia/restbase/pull/312#discussion_r39576615. In short: We'll have to normalize the content-type before comparing. The content-type module might come in handy for this.

If the comparison fails, there are two main ways to handle this:

Repeat the request with cache-control: no-cache: This is a fairly universal mechanism that should work out of the box, for example for Parsoid. The downside is that this might be a very expensive way to upgrade content, and it might not actually result in the same content, for example when templates are re-expanded.
Call a specific migration service, like the html2html end point described for Parsoid in T114413. Then, save the response back to the database.

Related Objects
Search...

Status	Assigned	Task
Resolved	• GWicke	T98540 Dummy version number bump for Parsoid html to nail down process for making changes to HTML format
Duplicate	• Pchelolo	T104462 Bump content type version in RESTBase
Resolved	• Pchelolo	T116333 Check stored / returned mime type vs. swagger spec; support content migration / re-generation
Open	None	T114413 Support various conversions in Parsoid's pb2pb endpoint

Event Timeline

• GWicke created this task.Oct 22 2015, 10:06 PM

• GWicke raised the priority of this task from to Needs Triage.

• GWicke updated the task description. (Show Details)

• GWicke subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 22 2015, 10:06 PM

• GWicke triaged this task as Medium priority.Oct 22 2015, 10:07 PM

• GWicke added a project: RESTBase.

• GWicke set Security to None.

• GWicke added a parent task: T98540: Dummy version number bump for Parsoid html to nail down process for making changes to HTML format.

• GWicke added a parent task: T104462: Bump content type version in RESTBase.

• GWicke added a subtask: T114413: Support various conversions in Parsoid's pb2pb endpoint.

• GWicke edited subscribers, added: • Pchelolo; removed: Aklapper.

• GWicke renamed this task from Check stored / returned mime type against the ones specified in the swagger spec, and support upgrades / re-generation to Check stored / returned mime type vs. swagger spec; support content migration / re-generation.Oct 22 2015, 10:09 PM

• GWicke updated the task description. (Show Details)

As a general way to hook this up, I propose to extend the handler templating feature with a request post processor. A new stanza x-response-handler (we could bike-shed on the name) will be introduced along with x-request-handler and x-setup-handler. The syntax would be like this:

x-request-handler:
  - get_from_backend:
      request:
         uri: /bla/bla/bla
x-response-handler:
  - validate_mime_type:
      request:
        uri: /sys/post_process/validate_mime

The semantics of this would be the following: after a main request if done, a POST request is sent to each postprocessor, sequentially. Parallel postprocessors are not allowed. The request body is a JSON with original_request, original_response and endpoint_spec fields. The postprocessor can decide, what to do with the request - it can schedule a new request with no-cache, or do nothing, or modify the response somehow. None of the additional x-request-handler stanza will be supported in x-response-handler stanza.

Although this creates some overhead with posting more requests instead of doing it in place, these requests are internal within RESTBase, so they a pretty fast. Anyway, config-based solution is an only way to do that for purely config-based modules, like mobileapps or graphoid.

Also, the post_processor module would potentially end up as a collection of useful processors, which could be introduced on the endpoint with a couple of config lines. Such processors could be mime-type validation, content sanitisation of some kind, may be more.

@mobrovac @GWicke @Eevans Does this sound sane to you?

• Pchelolo claimed this task.Nov 4 2015, 9:29 AM

I support the idea of a response validator. There should also be a request validator in the same vein. @Pchelolo why a new handler stanza for this? Can't we have that as the last request in the x-request-handler chain? After all, it's a parametrised request.

In T116333#1781534, @mobrovac wrote:

There should also be a request validator in the same vein.

The request validator is hooked up into every public endpoint by default, so I don't think we need a special stanza or mandatory config changes for it.

Why a new handler stanza for this? Can't we have that as the last request in the x-request-handler chain? After all, it's a parametrised request.

Originally I thought about a new stanza for two reasons:

We need to provide a spec to the validator, for example to get an expected mime-type. However, we can inject the spec in the context.
To clarify separation between the actual request handler, and a request post-processor.

However, in the end it's all a question of taste, which config looks better:

x-request-handler:
  - get_from_backend:
      request:
        uri: /bla/bla/bla
  - validate_mime_type:
      request:
        uri: '{domain}/sys/response_processor/mime'
        method: post
        body:
          original_request: '{$.request}'
          original_response: '{$.get_from_backend}'
          endpoint_spec: '{$.spec}'

x-request-handler:
  - get_from_backend:
      request:
         uri: /bla/bla/bla
x-response-handler:
  - validate_mime_type:
      request:
        uri: '{domain}/sys/response_processor/validate_mime'

The first one gives more control over what's sent into a specific validator, but in general, we don't need that control, so this would be copy-pasta.

BTW, I've made a proof-of-concept implementation of the x-response-handler idea, and benchmarked it a bit. Obviously, we get some performance degradation introducing one more request, but it's not too big. On my machine it decreased from ~450 req/s to 430 req/s for the /page/html/{title} endpoint, which is about 5% overhead. Obviously, the data is for a most common case, when regenerating content is not needed.

I was originally thinking that we should also bake the response validation into the regular processing, based on the swagger spec. To me, the main need for a handler is to customize what happens when the default validation fails. We might provide a default implementation that repeats the request with no-cache headers set, but end points like Parsoid HTML might want to do something smarter like upgrading the HTML with an html2html end point.

As for syntax / naming.. perhaps x-invalid-response-handler ?

Hm, I'd like to keep this generalised. Here we are interested in the content-type header, but it's easy to envision that responses might also be validated against the body defined in the swagger spec.

Maybe call it x-response-validator and there we could specify what endpoint should be called in case of a validation failure?

@mobrovac, imho the validation should be mostly driven by the spec, rather than custom code. We already have a means to hook up custom code via modules (and can use this for custom validation), but what we don't have yet is a general way to leverage the spec for response validation.

In T116333#1782100, @GWicke wrote:

@mobrovac, imho the validation should be mostly driven by the spec, rather than custom code. We already have a means to hook up custom code via modules (and can use this for custom validation), but what we don't have yet is a general way to leverage the spec for response validation.

Although I like the idea to have it all controlled by the spec, like "run this request if response does not conform to the spec", I don't think this isn't possible.

The problem is that an action could be different depending on what's invalid. It's easy to imagine, that for content-type mismatch, we'd need to call endpoint1, and for body-schema mismatch - endpoint2, for lack of some header - endpoint3 etc. Controlling this on a spec level is too complicated imho. Another problem, is that running an additional request could be optional, depending on the original request. For example, we need to repeat a request with no-cache only if the original request didn't contain no-cache to avoid infinite recursion.

So, we can do purely spec-based solution, but that would mean a significant cut in generality.

In T116333#1782100, @GWicke wrote:

@mobrovac, imho the validation should be mostly driven by the spec, rather than custom code. We already have a means to hook up custom code via modules (and can use this for custom validation), but what we don't have yet is a general way to leverage the spec for response validation.

Exactly. My point was merely that x-invalid-response-header pertains to only one thing that could be validated using the spec (other include other headers, body format, etc).

In T116333#1782176, @Pchelolo wrote:

Although I like the idea to have it all controlled by the spec, like "run this request if response does not conform to the spec", I don't think this isn't possible.

The problem is that an action could be different depending on what's invalid. It's easy to imagine, that for content-type mismatch, we'd need to call endpoint1, and for body-schema mismatch - endpoint2, for lack of some header - endpoint3 etc. Controlling this on a spec level is too complicated imho. Another problem, is that running an additional request could be optional, depending on the original request. For example, we need to repeat a request with no-cache only if the original request didn't contain no-cache to avoid infinite recursion.

So, we can do purely spec-based solution, but that would mean a significant cut in generality.

At first glance, I'd say no-cache requests should be exempted from it. If we get a bad response from such a request, there really isn't something we can do about it. At that point, we have the choice of either delivering the response as-is, or claim a 500.

But, more generally I agree that there might be a lot of situations where making it spec-driven will simply be too complicated. So how about allowing some simple and most-common use cases in the spec, while for others expect people to write custom modules?

My point was merely that x-invalid-response-header pertains to only one thing that could be validated using the spec (other include other headers, body format, etc).

Oh, I assumed that the handler would be general, for any kind of response validation failure. We might want to support some validation options in the spec, and define an interface for a failed validation to pass information about which part is at fault to the handler.

So how about allowing some simple and most-common use cases in the spec, while for others expect people to write custom modules?

+1. Lets start with something simple & cheap like content-type validation, and then see which other use cases make sense to integrate into the general validation, vs. implement as a module.

In T116333#1782987, @GWicke wrote:

+1. Lets start with something simple & cheap like content-type validation, and then see which other use cases make sense to integrate into the general validation, vs. implement as a module.

I agree.

So, for now we validate only the response mime-type, only if no-cache is not set, and in case of a failure - run a request (possibly the whole request chain), specified in x-invalid-response-handler. When we find more use-cases, we can generalise that.

So, for now we validate only the response mime-type, only if no-cache is not set,

We can still validate & log validation failures from no-cache requests, but lets not call the handler (default or otherwise) if no-cache is already set.

and in case of a failure - run a request (possibly the whole request chain), specified in x-invalid-response-handler. When we find more use-cases, we can generalise that.

• GWicke added a subscriber: ssastry.Nov 4 2015, 6:55 PM

• GWicke added a subscriber: cscott.

Arlolra closed subtask T114413: Support various conversions in Parsoid's pb2pb endpoint as Resolved.Jul 13 2016, 4:49 AM

Arlolra reopened subtask T114413: Support various conversions in Parsoid's pb2pb endpoint as Open.Aug 9 2016, 6:39 PM

• Pchelolo mentioned this in T104462: Bump content type version in RESTBase.Oct 12 2016, 9:45 PM

• Pchelolo merged a task: T104462: Bump content type version in RESTBase.

• Pchelolo added a project: Services (later).

• Pchelolo added subscribers: KartikMistry, • santhosh, Esanders, Jdforrester-WMF.

Liuxinyu970226 subscribed.Dec 28 2016, 7:45 AM

• mobrovac added a project: Platform Team Legacy (Later).Dec 20 2018, 12:07 PM

This has been solved with ensure_content_type filter for quite a long time now. Resolving.