Page MenuHomePhabricator

HTTP response headers compatibility for "Proton + RESTBase" vs "Proton directly"
Closed, ResolvedPublic

Description

This task is to check for compatibility of "Proton + RESTBase" vs "Proton direct" to make sure response headers, body etc received are the same. If not, ensure Proton's responses are compatible with what RESTBase was sending back to the client.

Event Timeline

NOTE: Below is an example of sample requests and their corresponding responses:
RESTBase + ProtonProton directly
cURL request: curl -X 'GET' http://mediawiki.development.instance:7231/en.wikipedia.beta.wmflabs.org/v1/page/pdf/Cat' -H 'accept: application/pdf'cURL request: curl -X 'GET' 'http://mediawiki.development.instance:3030/en.wikipedia.beta.wmflabs.org/v1/pdf/Cat/a4/desktop' -H 'accept: application/pdf' --output ~/Desktop/Cat.pdf -v
HTTP request URL: http://mediawiki.development.instance:7231/en.wikipedia.beta.wmflabs.org/v1/page/pdf/CatHTTP request URL: http://mediawiki.development.instance:3030/en.wikipedia.beta.wmflabs.org/v1/pdf/Dog/a4/desktop
Response code: 200Response code: 200
Response body: PDF fileResponse body: PDF file

Response headers (Proton + RESTBase):

< HTTP/1.1 200
< content-disposition: attachment; filename="Dog.pdf"; filename*=UTF-8''Dog.pdf
< content-type: application/pdf
< content-length: 820369
< cache-control: s-maxage=600, max-age=600
< etag: "550357/d3986c00-ad3f-11ed-a6ce-33b2da6a6557"
< content-location: https://en.wikipedia.beta.wmflabs.org/api/rest_v1/page/pdf/Dog
< access-control-allow-origin: *
< access-control-allow-methods: GET,HEAD
< access-control-allow-headers: accept, content-type, content-length, cache-control, accept-language, api-user-agent, if-match, if-modified-since, if-none-match, dnt, accept-encoding
< access-control-expose-headers: etag
< x-content-type-options: nosniff
< x-frame-options: SAMEORIGIN
< referrer-policy: origin-when-cross-origin
< x-xss-protection: 1; mode=block
< content-security-policy: default-src 'none'; frame-ancestors 'none'
< x-content-security-policy: default-src 'none'; frame-ancestors 'none'
< x-webkit-csp: default-src 'none'; frame-ancestors 'none'
< x-request-id: d1f586a0-addd-11ed-8ed2-5b91053923e1
< server: wikimedia-foundation.local
< Date: Thu, 16 Feb 2023 09:39:41 GMT
< Connection: keep-alive
< Keep-Alive: timeout=5

Response headers (Proton directly):

< HTTP/1.1 200 OK
< access-control-allow-origin: *
< access-control-allow-headers: accept, x-requested-with, content-type
< access-control-expose-headers: etag
< x-xss-protection: 1; mode=block
< x-content-type-options: nosniff
< x-frame-options: SAMEORIGIN
< content-security-policy: default-src 'self'; object-src 'none'; media-src *; img-src *; style-src *; frame-ancestors 'self'
< x-content-security-policy: default-src 'self'; object-src 'none'; media-src *; img-src *; style-src *; frame-ancestors 'self'
< x-webkit-csp: default-src 'self'; object-src 'none'; media-src *; img-src *; style-src *; frame-ancestors 'self'
< Content-Type: application/pdf
< Content-Disposition: download; filename="Cat.pdf"; filename*=UTF-8''Cat.pdf
< Date: Thu, 16 Feb 2023 08:05:42 GMT
< Connection: keep-alive
< Keep-Alive: timeout=5
< Transfer-Encoding: chunked

@pmiazga, the response gotten from hitting proton directly needs to be compatible with RESTBase responses. I see Proton has less stuff but some of them are already exactly the same like RESTBase responses but it lacks more information. I'll have to make sure we add more things to Proton's response headers.

What do you think?

The difference in the response headers that need to be added or modified for direct Proton to be compatible with how RESTBase + Proton works is:

To add:

< content-length: [content length of PDF]
< cache-control: s-maxage=600, max-age=600
< etag: [set response etag]
< content-location: https://en.wikipedia.beta.wmflabs.org/api/rest_v1/page/pdf/[Article-Handle]
< access-control-allow-methods: [Methods]
< x-request-id: [Request Id]
< referrer-policy: origin-when-cross-origin
< server: [server that processed the request]

To modify:

***< content-disposition: attachment; filename="Dog.pdf"; filename*=UTF-8''Dog.pdf
***< content-security-policy: default-src 'none'; frame-ancestors 'none'
***< x-content-security-policy: default-src 'none'; frame-ancestors 'none'
***< x-webkit-csp: default-src 'none'; frame-ancestors 'none'
***< access-control-allow-headers: accept, content-type, content-length, cache-control, accept-language, api-user-agent, if-match, if-modified-since, if-none-match, dnt, accept-encoding
NOTE: the ones with *** is already available in direct proton but needs to be modified to match what RESTBase emits.

Change 889765 had a related patch set uploaded (by D3r1ck01; author: Derick Alangi):

[mediawiki/services/chromium-render@master] [WIP] Make Proton's response headers RESTBase compat

https://gerrit.wikimedia.org/r/889765

Regarding headers:

  • All the filename related headers are very specific to be implemented in the API gateway so they should be re-implemented in the node service
  • Security headers are generic enough to be added in the API gateway. There is a relevant ticket for that.
  • Cache control is also app specific for proton

Something worth discussing is etag which we don't have decided on a way to implement without RESTBase in the req/res path.

DAlangi_WMF changed the task status from Open to In Progress.Mar 2 2023, 8:02 PM
DAlangi_WMF renamed this task from HTTP Response compatibility with "Proton + RESTBase" vs "Proton directly" to HTTP response headers compatibility for "Proton + RESTBase" vs "Proton directly".Mar 6 2023, 5:15 PM

The following response headers are still in TODO phase: etag, content-location and server.

I'm thinking about the etag one. This is something RESTBase emits but I'm wondering if it's useful in the context of PDF rendering? What could it be used for at the client side?

Cc @MSantos, @pmiazga, @daniel

The following response headers are still in TODO phase: etag, content-location and server.

I'm thinking about the etag one. This is something RESTBase emits but I'm wondering if it's useful in the context of PDF rendering? What could it be used for at the client side?

Cc @MSantos, @pmiazga, @daniel

ETags would be for caching in the Varnish layer. But Last-Modified is sufficient. If we want ETags, they could just be random. That would be sufficient for use in Varnish. How does RESTbase generate them?

I had a detailed conversation with Piotr yesterday about these 3 and it turns out we can ignore them. Below is our reasoning:

  1. etag: Since we don't cache or persist PDFs for long periods of time, we don't need an etag because the purpose of etag is for persistent caching. We are already setting a cache-control of 5 minutes which is already sufficient. PDFs are not cached on purpose since there is no purging for this service and content is mostly likely to change so caching a PDF we won't use is not ideal. On this note, we can ignore this response header and not send it because nothing should really be using it.
  1. content-location: The content location is the HTTP URL and it's already what the user provides to print PDF (including PDF format, device etc). This is already known before making the request so it can be ignored as well from the response header.
  1. server: Piotr was pretty serious about this one, that we shouldn't expose internal information about where the HTTP request gets handled within our WMF network. Sending this kind of information back to the client can expose details about our internal network for an attack. That is, if we send back an IP address for example, someone may study our subnet and use proton to gain access to our internal network resources and snapshot it using the PDF renderer. On this note, we should not send this back to the client.

@pmiazga, let me know if I captured everything here and @daniel @Jgiannelos, do you have any thoughts?

To follow-up on headers:

  1. etag @Jgiannelos could you make sure that we have caching for PDFs and there is no special handling in Varnish when etag header is not provided? If yes, we should be able to skip it.

The only thing I found is

headers: {
    'Content-type': 'text/html',
    // ETag used as deterministic uuid by server
    'If-Match': '<uuid>'
},

in this doc https://github.com/wikimedia/restbase/blob/master/doc/Transactions.md. So I kinda assume it may be used as a key in cassandra store if items are cached.

  1. content-location -> this one is usually around due to content negotiation. If Client asks for specific type of content, for example text/json, the content-location header can point client directly for specific content without having client renegotating content each time. Proton handles only application/pdf, therefore there is no requirement to handle this header unless it is required to any reasons.
  2. server - this one is tricky, as Server header supposed to be used to describe the software used by the origin server. But looks like we were sending the internal hostname, which IMHO is not the best approach. It could be used for some tracing, or some kind of internal activity. Most of the software sends Server header like "AmazonS3" or "Apache". IMHO we could skip it unless it has some special meaning to our system.

cc @Joe and @akosiaris - if you could take a look on @DAlangi_WMF and my comments - it would be awesome. Thank you

For varnish config I would defer to @Joe or @akosiaris if we have any special treatment for etag headers.
From a quick look at changeprop we don't send any purge events for /page/pdf so I assume that cache-control will be good enough.

I know @Joe is going to be out over the next few weeks. I'm uncertain how spun up on this @akosiaris is.

Are you all just looking for feedback on your current thinking for these headers? Or is the specific question whether or not you are able to omit the etag header for this service?

For varnish config I would defer to @Joe or @akosiaris if we have any special treatment for etag headers.
From a quick look at changeprop we don't send any purge events for /page/pdf so I assume that cache-control will be good enough.

I agree with this Yiannis.

Also, since eTags are used for caching and we don't have any aggressive caching for proton, technically speaking, we don't need the etag response header. We discussed about this (see above comments).

I know @Joe is going to be out over the next few weeks. I'm uncertain how spun up on this @akosiaris is.

Are you all just looking for feedback on your current thinking for these headers? Or is the specific question whether or not you are able to omit the etag header for this service?

@thcipriani, the conclusion was leaning towards us discarding the etag response header entirely because we don't cache PDFs aggressively (but however we do have a cache-control header for varnish).

So the concern here was whether varnish needs to etag response header for something internally which I'm not sure. Yiannis confirmed that changeprop doesn't send purge events for the /page/pdf endpoint so I think we're good here.

We just need approval of the patch from content transform so this can go in.

Change 907892 had a related patch set uploaded (by D3r1ck01; author: Derick Alangi):

[mediawiki/services/chromium-render@master] Enable feature parity for RESTBase/Proton content-disposition

https://gerrit.wikimedia.org/r/907892

Change 889765 merged by jenkins-bot:

[mediawiki/services/chromium-render@master] Make Proton's response headers RESTBase compat

https://gerrit.wikimedia.org/r/889765

Change 907892 merged by jenkins-bot:

[mediawiki/services/chromium-render@master] Enable feature parity for RESTBase/Proton content-disposition

https://gerrit.wikimedia.org/r/907892

Thanks @pmiazga for reviewing and +2'ing. We can call this resolved. Will create a task for deploying next week.

NOTE: The outputs of PDFs when gotten from proton directly and when gotten from Proton via RESTBase is a little different in that the one with direct proton will add to the PDF the source article it's redirecting from in the PDF itself. See the two PDFs for the scenarios below

Keeping this here for documentation purposes.