Page MenuHomePhabricator

Split slash decoding from general percent normalization in Varnish VCL
Closed, ResolvedPublic

Description

Varnish has normalization logic for URLs, which avoids fragmenting the cache on slightly different percent encodings. While mostly standards-conform, this normalization pass also decodes slashes by default, in line with MediaWiki's policy of allowing literal slashes in titles.

For anything but /wiki/{title}, we would like to only apply standards-compliant normalizations, without decoding slashes.

Options for doing this would be:

  • Modify the normalization logic in VCL to (optionally) leave encoded slashes alone, and apply the conservative version to anything but /wiki/.
  • Split slash decoding from the general normalization logic, and only apply it for /wiki/.

Event Timeline

Considering the lack of argument support in VCL subs, the simplest approach to me seems to be:

  1. Rename normalize_path to normalize_mediawiki_path.
  2. Create a variant without slash decoding as normalize_path.
  3. Decide which to call based on !~ "^/api/", or possibly a whitelist match.

@BBlack, does this sound sane to you?

I had to look at the code and the RFC again to refresh my memory (it's been a while). Basically, the normalize_path code isn't standards-compliant at all in the generic sense.

To recap RFC3986, three character classes are defined:

unreserved: [-_~.0-9A-Za-z] (alphanumeric, dash, underscore, tilde, period)
gen-delim: [:/?#@] as well as the square brackets [ and ] themselves
sub-delim: [!$&'()*+,;=]

The two sets of delimiters are both "reserved" as potential delimiters for application-layer schemes. A generic URI-processor which doesn't know application-specific things can't legally decode %-encoded forms of the reserved delimiters, whereas the consuming application can (because it can split the data up on those un-encoded delimiter characters, and then decode %-encoded delimiter characters within each parsed chunk as data). Only the unreserved set can be decoded at any arbitrary time without application-layer knowledge (and those chars never should've been encoded in the first place, but it is possible and legal).

So for the case of a generic VCL meant to handle multiple applications applying URI %-encoding normalization, we can only (and probably should!) decode the unreserved set.

What normalize_path does is decode an arbitrary subset of the reserved set: [;@$!*(),/:]. Every character it decodes is from one of the two reserved lists. This is only "legal" because we have inside information about MediaWiki - we know that it doesn't care about any of those as field delimiters.

So, I get why normalize_path does what it does for MediaWiki specifically, although I'd say for completeness, there's probably an argument that we should go ahead and decode the unreserved set while we're in there, just in case.

What I don't get is why RestBase wants the exact same arbitrary subset of delimiters pre-decoded, except the /. I'd think either RB would want no decoding, or only unreserved-decoding, but not "decode almost the exact same set of delimiters as MW, minus one"...

What I don't get is why RestBase wants the exact same arbitrary subset of delimiters pre-decoded, except the /.

Just as for MediaWiki, the goal is to normalize the percent encoding to ensure consistent purging & avoid cache fragmentation. As far as I am aware, none of the REST APIs we are interested in here use application-specific delimiter schemes within a path component. This is definitely true for RESTBase. This means that different encodings of any of those non-delimiter characters within path components are a source of undesirable variance, which we would like to avoid by normalizing their encoding:

  • Consistently use upper (or lowercase hex) for percent encoded parts.
  • Consistently decode (or encode) reserved characters other than '/'.

Side-tracking on the forward-slash a bit: I kind-of understand why we do it for the MW case and why it happens to break nothing there, but isn't a great idea in general for others. Ignoring the forward-slash problem and moving on:

I guess maybe a better question for me to ask is why MediaWiki choses the exact subset of delimiters to decode that it does. Partially-answering my own question: it's clear that [?&#] aren't in the set because MW still cares about those delimiters for query/frag purposes. That leaves ['+=] as the mystery set of "delimiters which we don't decode/normalize for no obvious (to me!) reason yet".

it's clear that [?&#] aren't in the set because MW still cares about those delimiters for query/frag purposes

Since we are only concerned about the path component of the URL, just [?#] should be sufficient.

That leaves ['+=] as the mystery set of "delimiters which we don't decode/normalize for no obvious (to me!) reason yet".

The reason for + might be to work around a bug in PHP's urlencode. I don't see a reason for the others. = is only significant in query strings, which we don't normalize.

it's clear that [?&#] aren't in the set because MW still cares about those delimiters for query/frag purposes

Since we are only concerned about the path component of the URL, just [?#] should be sufficient.

Well for that matter, since parsing stops on the first unescape ? for the query part, and the fragment comes after that, even '#' should be decodable, right?

That leaves ['+=] as the mystery set of "delimiters which we don't decode/normalize for no obvious (to me!) reason yet".

The reason for + might be to work around a bug in PHP's urlencode. I don't see a reason for the others. = is only significant in query strings, which we don't normalize.

Ok, so even if we allow the + to vary for bug-reasons, in the net it still seems like the set we decode for MediaWiki should be expanded to include ['=#] (all before the query-string starts), but not '/', for RB. I don't know if I want to mess with it for the MW case, but I at least want to understand it. It still seems to me like the only generic decoding we can do for all traffic is the unreserved set (none of which we try to decode now), and then MW and RB get their own custom sets as a per-service thing.

Well for that matter, since parsing stops on the first unescape ? for the query part, and the fragment comes after that, even '#' should be decodable, right?

Normally clients don't send fragment identifiers at all, but I think there are enough broken clients that still do. Most URLs we are interested in don't have a query string, so stopping decoding when encountering a literal # as well should be more robust.

It still seems to me like the only generic decoding we can do for all traffic is the unreserved set (none of which we try to decode now), and then MW and RB get their own custom sets as a per-service thing.

That sounds sane to me. The RB decoding should be applicable to basically all REST services. I'll look into creating a variant that's applicable to RB, possibly later tonight.

@BBlack: A patch implementing REST path encoding normalization based on a blacklist of delimiters is now available at https://gerrit.wikimedia.org/r/#/c/273146/. This is untested. I'm not familiar with how different C blocks interact during VCL compilation. Hopefully there won't be name conflicts in a big file after VCL is converted to C. The string.h include is also only done once, assuming that it is scoped to the entire VCL config.

Status update: I'm refactoring the above patch a bit to try to eliminate the code duplication, should be ready for some testing later today...

Change 274083 had a related patch set uploaded (by BBlack):
normalize_path: move to own include file

https://gerrit.wikimedia.org/r/274083

Change 274084 had a related patch set uploaded (by BBlack):
normalize_path: make it a C function

https://gerrit.wikimedia.org/r/274084

Change 274085 had a related patch set uploaded (by BBlack):
normalize_path: optional forward-slash, RB variant

https://gerrit.wikimedia.org/r/274085

Change 274086 had a related patch set uploaded (by BBlack):
normalize_path: stop on fragment marker

https://gerrit.wikimedia.org/r/274086

Change 274087 had a related patch set uploaded (by BBlack):
normalize_path: assert(url)

https://gerrit.wikimedia.org/r/274087

Change 274088 had a related patch set uploaded (by BBlack):
normalize_path: refactor control flow

https://gerrit.wikimedia.org/r/274088

Change 274089 had a related patch set uploaded (by BBlack):
normalize_path: fully parameterize the decoded set

https://gerrit.wikimedia.org/r/274089

Having dug into this, the other problem with the existing patch is that it would still do a lot of unacceptable decodings, such as the space character, and all of the US-ASCII unprintable range.

I've uploaded a 7-patch set here (untested yet): https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+branch:production+topic:what-is-normal,n,z

The first 4 are pretty easy to review/verify, and would get us to a state where RB gets the same decoding as MW, minus the slash, without a bunch of duplication. The next two are general refactors/improvements which might require more review/verification and take longer to merge safely. The final patch is similarly harder to review/verify/deploy, and does the larger changes talked about here: always decodes the unreserved set, and makes the decoded set fully controllable by the caller if you want to add more decoded characters than MW currently uses. I'm not sure if we even want to go down that path, but I think this is how we'd get there. It would take another followup commit to actually add the new characters for the RB case.

Change 274083 merged by BBlack:
normalize_path: move to own include file

https://gerrit.wikimedia.org/r/274083

Change 274084 merged by BBlack:
normalize_path: make it a C function

https://gerrit.wikimedia.org/r/274084

Change 274085 merged by BBlack:
normalize_path: optional forward-slash, RB variant

https://gerrit.wikimedia.org/r/274085

I've merged the first 3 patches, which essentially gets us to "RB gets the same path decoding as MW, except for forward slashes". I suspect this is actually enough to get past the immediate issue, right?

I still question the next patch (fragment support) - clients shouldn't be sending it in the first place, and even if they do, either we already stopped on the query-string, or there was no query-string and we don't care if we mangle the invalid/unused fragment identifier. If anything, we should probably just strip fragments for all inbound requests early.

Regardless of the fragment stuff, if we want to go further and support a custom set of decodings and such, those patches will need more reviewing and testing than these first few did, and I don't think I'll have time to validate all of that in the next couple of days.

Change 274282 had a related patch set uploaded (by BBlack):
text VCL: move bits-events and normalization above PURGE

https://gerrit.wikimedia.org/r/274282

Change 274282 merged by BBlack:
ext VCL: move bits-events and normalization above PURGE

https://gerrit.wikimedia.org/r/274282

Change 274458 had a related patch set uploaded (by GWicke):
WIP: Add a cluster_be_recv_pre_purge handler & normalize paths

https://gerrit.wikimedia.org/r/274458

Change 274458 merged by BBlack:
Add a cluster_be_recv_pre_purge handler & normalize paths

https://gerrit.wikimedia.org/r/274458

Mentioned in SAL [2016-03-03T19:58:13Z] <urandom> rolling restart of restbase in staging to apply config change : T127387

Mentioned in SAL [2016-03-03T20:03:35Z] <urandom> rolling restart of restbase staging complete : T127387

Mentioned in SAL [2016-03-03T20:20:41Z] <urandom> rolling restart of restbase production to apply config change : T127387

Mentioned in SAL [2016-03-03T20:26:39Z] <urandom> rolling restart of restbase production complete : T127387

Mentioned in SAL [2016-03-03T21:40:09Z] <urandom> rolling restart of restbase staging (config change) : T127387

Mentioned in SAL [2016-03-03T21:45:16Z] <urandom> rolling restart of restbase staging complete : T127387

Mentioned in SAL [2016-03-03T21:53:52Z] <urandom> rolling restart of restbase production (config change) : T127387

Mentioned in SAL [2016-03-03T22:01:54Z] <urandom> rolling restart of restbase production complete : T127387

Summary of the latest status:

  • Path normalization is applied to regular & PURGE requests.
  • We verified purging of special characters (including : and parentheses).
  • Cache TTL is now at 12h.

Next steps:

  • See if there are any more purging issues.
  • Later / at our leisure, improve normalization a bit further:
    • Normalize case of quoted delimiters ([/#?%] for RB), e.g. normalize %2f to %2F.
    • More complete normalization of non-delimiters:
      • Decode all unreserved characters.
      • Encode all other characters.

We are also auditing the RESTBase code to make sure that we redirect non-canonical requests to the canonical URL.

GWicke lowered the priority of this task from High to Medium.Mar 4 2016, 1:19 AM
Jdlrobson subscribed.

I don't think there are any open patches...

Change 274086 abandoned by BBlack:
normalize_path: stop on fragment marker

https://gerrit.wikimedia.org/r/274086

Change 274087 abandoned by BBlack:
normalize_path: assert(url)

https://gerrit.wikimedia.org/r/274087

Change 274088 abandoned by BBlack:
normalize_path: refactor control flow

https://gerrit.wikimedia.org/r/274088

Change 274089 abandoned by BBlack:
normalize_path: fully parameterize the decoded set

https://gerrit.wikimedia.org/r/274089

Change 391216 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] [WIP] normalize_path: fully normalize MW RB URL paths

https://gerrit.wikimedia.org/r/391216

Change 391216 merged by BBlack:
[operations/puppet@production] Fully normalize upload paths

https://gerrit.wikimedia.org/r/391216

Change 407488 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] URL Path Normalization: refactor, add to cache_text

https://gerrit.wikimedia.org/r/407488

Change 407489 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] URL Path Normalization: add to cache_misc

https://gerrit.wikimedia.org/r/407489

Change 407643 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] URL Path Normalization: fully normalize cache_text

https://gerrit.wikimedia.org/r/407643

Change 407670 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] URL Normalization: strip fragment

https://gerrit.wikimedia.org/r/407670

Change 407488 merged by BBlack:
[operations/puppet@production] URL Path Normalization: refactor, add to cache_text

https://gerrit.wikimedia.org/r/407488

Pchelolo claimed this task.

doesn't seem to be any more movement or things to be done here.