Page MenuHomePhabricator

Parsoid should not set Vary: Content-Type
Closed, ResolvedPublic

Description

Currently Parsoid sets the Vary: Content-Type. That doesn't make sense, we can only Vary on request headers while content-type is a response header and cannot be set on a get request, and POST/PUT requests are not cached anyway. This should be removed from Parsoid

Event Timeline

Pchelolo created this task.Jun 19 2018, 4:09 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 19 2018, 4:09 PM
ssastry assigned this task to cscott.Jun 19 2018, 4:19 PM
ssastry triaged this task as High priority.

You're right, we should probably be using Vary: Accept instead of Vary: Content-Type. There are cases where we send content-type in a POST, but as you say those aren't cached (and we typically embed the header in a JSON blob in that case anyway).

(For the record, Content-Type is an entity-header, so it is equally valid on request or responses, but as you say would typically not be set on a GET.)

Change 441313 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] Default Parsoid Vary header should be Accept not Content-Type

https://gerrit.wikimedia.org/r/441313

@cscott when would Vary: accept be useful? Each end point has a distinct content type and clear usage. What would be the benefits of (generally) varying accept ? The less we fragment the cache the better, and things tend to get complicated if/when we introduce variations on the theme like gzip et al. I would be inclined to support only Vary: accept-language since this is the only header that we will realistically vary content for.

Perhaps for content negotiation? Accept HTML vs. JSON vs. N3 vs. … on the same end point? (Though really that'd be a 302 endpoint rather than a 200 endpoint.)

The only thing that pops to mind is varying by content version (Parsoid HTML v1.5 vs v1.6 for example), but for that to be a valid cache fragmention criterion we'd need to be able to convert between various formats.

We already do vary by Parsoid HTML version, and have a pb2pb endpoint to do conversion between various formats. It's not very capable (doesn't support many conversions), but the up/downgrade endpoint is present in the API in case we need to use that in the future. (In the past we used this for the data-parsoid split, and we would probably use it again to split data-mw when that time comes.)

ssastry added a comment.EditedJun 21 2018, 7:51 PM

We already do vary by Parsoid HTML version, and have a pb2pb endpoint to do conversion between various formats. It's not very capable (doesn't support many conversions), but the up/downgrade endpoint is present in the API in case we need to use that in the future. (In the past we used this for the data-parsoid split, and we would probably use it again to split data-mw when that time comes.)

We currently don't vary by the html version ... as far as I know, RB can only store the latest version and could down-convert to an older version via pb2pb. But, right now, content version negotiation isn't fully implemented in RB, so, in practice the version conversion isn't exercised yet. I don't think there is any plan to store / fragment cache by html version. The plan is always to store the current version and for RB to call pb2pb if an older version is requested (which Parsoid can refuse to honour if it is a really old version that we've dropped support for).

So, let us separate this conversation from the language variant accept-language vary strategy. Maybe open a separate ticket and restrict this patch for now to only deal with the accept-language vary.

The pagebundle/to/pagebundle endpoint definitely looks at the Accept header and returns different things depending on what it is set to. If we are going to set a Vary header at all, it should include Accept.

I think this is/ought to be completely orthogonal to the question of what Vary header RESTBase and/or MCS returns. RESTBase can decide itself that it is only going to store one and return one version of the HTML and so it will not report Vary: Accept. But parsoid does Vary based on the Accept header, at least for the pb2pb endpoint.

Right, Parsoid's output does vary by that .. but, I think where we are seeing this differently is that you are saying we should emit that Vary header since it is a property of Parsoid output ... but I am interpreting the Vary header as a directive to RESTBase to vary its storage on that header type, but where it isn't clear that we actually want to do that at all.

It is informing RESTBase that the output depends on those headers. RESTBase is under no obligation to split its cache if it knows that it is holding one or more of those headers constant.

So, I suppose the question is which interpretation makes more sense ... you are saying that we shouldn't be concerned about what caching / storage layers decide to do or not do with it, and so it is upto RB to do what it wishes.

See also https://gerrit.wikimedia.org/r/441320 / T197792: MCS should proxy Vary and Content-Language headers header it gets from Parsoid -- there are multiple downstream consumers, not just the "big RESTBase cache". So IMO Parsoid should report what is correct for Parsoid, and let the downstream consumers filter it as/if they please.

Beside, the original bug wasn't about the presence of Vary, it was that I was using the wrong field name (Content-Type instead of Accept), and that's the narrow issue that https://gerrit.wikimedia.org/r/441313 fixes.

So, I suppose the question is which interpretation makes more sense ... you are saying that we shouldn't be concerned about what caching / storage layers decide to do or not do with it, and so it is upto RB to do what it wishes.

Right, makes sense. I can go with that.

For the record, I'm open to adding a new bug for whether/if to remove Vary: Accept and/or to constrain it to the pb2pb endpoint more narrowly -- or even to adding different Vary options and/or indicating that language conversion is active for a given page using some other non-HTTP-header mechanism. I'd just like to keep this bug/patch focused on the narrow issue of "cscott got confused by the HTTP spec and used the wrong header name" and fix that obvious wrong to cover my manifest embarrassment before we get distracted by the broader philosophical issues. ;)

Change 441313 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Default Parsoid Vary header should be Accept not Content-Type

https://gerrit.wikimedia.org/r/441313

cscott closed this task as Resolved.Jun 22 2018, 10:04 PM
Vvjjkkii renamed this task from Parsoid should not set Vary: Content-Type to lnaaaaaaaa.Jul 1 2018, 1:03 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed cscott as the assignee of this task.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii edited subscribers, added: cscott; removed: gerritbot, Aklapper.
mobrovac renamed this task from lnaaaaaaaa to Parsoid should not set Vary: Content-Type.Jul 1 2018, 11:35 AM
mobrovac closed this task as Resolved.
mobrovac assigned this task to cscott.
mobrovac updated the task description. (Show Details)
mobrovac removed a subscriber: cscott.