Split slash decoding from general percent normalization in Varnish VCL
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• GWicke
	Feb 18 2016, 10:48 PM

Description

Varnish has normalization logic for URLs, which avoids fragmenting the cache on slightly different percent encodings. While mostly standards-conform, this normalization pass also decodes slashes by default, in line with MediaWiki's policy of allowing literal slashes in titles.

For anything but /wiki/{title}, we would like to only apply standards-compliant normalizations, without decoding slashes.

Options for doing this would be:

Modify the normalization logic in VCL to (optionally) leave encoded slashes alone, and apply the conservative version to anything but /wiki/.
Split slash decoding from the general normalization logic, and only apply it for /wiki/.

Details

Subject	Repo	Branch	Lines +/-
URL Normalization: strip fragment	operations/puppet	production	+29 -18
URL Path Normalization: add to cache_misc	operations/puppet	production	+14 -1
URL Path Normalization: fully normalize cache_text	operations/puppet	production	+9 -8
URL Path Normalization: refactor, add to cache_text	operations/puppet	production	+172 -188
Fully normalize upload paths	operations/puppet	production	+305 -6
normalize_path: fully parameterize the decoded set	operations/puppet	production	+69 -25
normalize_path: refactor control flow	operations/puppet	production	+9 -14
normalize_path: assert(url)	operations/puppet	production	+36 -36
normalize_path: stop on fragment marker	operations/puppet	production	+2 -2
Add a cluster_be_recv_pre_purge handler & normalize paths	operations/puppet	production	+28 -9
ext VCL: move bits-events and normalization above PURGE	operations/puppet	production	+2 -2
normalize_path: optional forward-slash, RB variant	operations/puppet	production	+13 -7
normalize_path: make it a C function	operations/puppet	production	+73 -67
normalize_path: move to own include file	operations/puppet	production	+84 -84

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• GWicke	T127370 Varnish not purging RESTBase URIs
		Resolved		• Pchelolo	T127387 Split slash decoding from general percent normalization in Varnish VCL

Event Timeline

• GWicke created this task.Feb 18 2016, 10:48 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 18 2016, 10:48 PM

• GWicke updated the task description. (Show Details)Feb 18 2016, 10:49 PM

• GWicke updated the task description. (Show Details)

Considering the lack of argument support in VCL subs, the simplest approach to me seems to be:

Rename normalize_path to normalize_mediawiki_path.
Create a variant without slash decoding as normalize_path.
Decide which to call based on !~ "^/api/", or possibly a whitelist match.

@BBlack, does this sound sane to you?

• GWicke mentioned this in T127370: Varnish not purging RESTBase URIs.Feb 19 2016, 10:29 PM

I had to look at the code and the RFC again to refresh my memory (it's been a while). Basically, the normalize_path code isn't standards-compliant at all in the generic sense.

To recap RFC3986, three character classes are defined:

unreserved: [-_~.0-9A-Za-z] (alphanumeric, dash, underscore, tilde, period)
gen-delim: [:/?#@] as well as the square brackets [ and ] themselves
sub-delim: [!$&'()*+,;=]

The two sets of delimiters are both "reserved" as potential delimiters for application-layer schemes. A generic URI-processor which doesn't know application-specific things can't legally decode %-encoded forms of the reserved delimiters, whereas the consuming application can (because it can split the data up on those un-encoded delimiter characters, and then decode %-encoded delimiter characters within each parsed chunk as data). Only the unreserved set can be decoded at any arbitrary time without application-layer knowledge (and those chars never should've been encoded in the first place, but it is possible and legal).

So for the case of a generic VCL meant to handle multiple applications applying URI %-encoding normalization, we can only (and probably should!) decode the unreserved set.

What normalize_path does is decode an arbitrary subset of the reserved set: [;@$!*(),/:]. Every character it decodes is from one of the two reserved lists. This is only "legal" because we have inside information about MediaWiki - we know that it doesn't care about any of those as field delimiters.

So, I get why normalize_path does what it does for MediaWiki specifically, although I'd say for completeness, there's probably an argument that we should go ahead and decode the unreserved set while we're in there, just in case.

What I don't get is why RestBase wants the exact same arbitrary subset of delimiters pre-decoded, except the /. I'd think either RB would want no decoding, or only unreserved-decoding, but not "decode almost the exact same set of delimiters as MW, minus one"...

What I don't get is why RestBase wants the exact same arbitrary subset of delimiters pre-decoded, except the /.

Just as for MediaWiki, the goal is to normalize the percent encoding to ensure consistent purging & avoid cache fragmentation. As far as I am aware, none of the REST APIs we are interested in here use application-specific delimiter schemes within a path component. This is definitely true for RESTBase. This means that different encodings of any of those non-delimiter characters within path components are a source of undesirable variance, which we would like to avoid by normalizing their encoding:

Consistently use upper (or lowercase hex) for percent encoded parts.
Consistently decode (or encode) reserved characters other than '/'.

Side-tracking on the forward-slash a bit: I kind-of understand why we do it for the MW case and why it happens to break nothing there, but isn't a great idea in general for others. Ignoring the forward-slash problem and moving on:

I guess maybe a better question for me to ask is why MediaWiki choses the exact subset of delimiters to decode that it does. Partially-answering my own question: it's clear that [?&#] aren't in the set because MW still cares about those delimiters for query/frag purposes. That leaves ['+=] as the mystery set of "delimiters which we don't decode/normalize for no obvious (to me!) reason yet".

it's clear that [?&#] aren't in the set because MW still cares about those delimiters for query/frag purposes

Since we are only concerned about the path component of the URL, just [?#] should be sufficient.

That leaves ['+=] as the mystery set of "delimiters which we don't decode/normalize for no obvious (to me!) reason yet".

The reason for + might be to work around a bug in PHP's urlencode. I don't see a reason for the others. = is only significant in query strings, which we don't normalize.

In T127387#2051279, @GWicke wrote:

it's clear that [?&#] aren't in the set because MW still cares about those delimiters for query/frag purposes

Since we are only concerned about the path component of the URL, just [?#] should be sufficient.

Well for that matter, since parsing stops on the first unescape ? for the query part, and the fragment comes after that, even '#' should be decodable, right?

That leaves ['+=] as the mystery set of "delimiters which we don't decode/normalize for no obvious (to me!) reason yet".

The reason for + might be to work around a bug in PHP's urlencode. I don't see a reason for the others. = is only significant in query strings, which we don't normalize.

Ok, so even if we allow the + to vary for bug-reasons, in the net it still seems like the set we decode for MediaWiki should be expanded to include ['=#] (all before the query-string starts), but not '/', for RB. I don't know if I want to mess with it for the MW case, but I at least want to understand it. It still seems to me like the only generic decoding we can do for all traffic is the unreserved set (none of which we try to decode now), and then MW and RB get their own custom sets as a per-service thing.

Well for that matter, since parsing stops on the first unescape ? for the query part, and the fragment comes after that, even '#' should be decodable, right?

Normally clients don't send fragment identifiers at all, but I think there are enough broken clients that still do. Most URLs we are interested in don't have a query string, so stopping decoding when encountering a literal # as well should be more robust.

It still seems to me like the only generic decoding we can do for all traffic is the unreserved set (none of which we try to decode now), and then MW and RB get their own custom sets as a per-service thing.

That sounds sane to me. The RB decoding should be applicable to basically all REST services. I'll look into creating a variant that's applicable to RB, possibly later tonight.

@BBlack: A patch implementing REST path encoding normalization based on a blacklist of delimiters is now available at https://gerrit.wikimedia.org/r/#/c/273146/. This is untested. I'm not familiar with how different C blocks interact during VCL compilation. Hopefully there won't be name conflicts in a big file after VCL is converted to C. The string.h include is also only done once, assuming that it is scoped to the entire VCL config.

Status update: I'm refactoring the above patch a bit to try to eliminate the code duplication, should be ready for some testing later today...

Change 274083 had a related patch set uploaded (by BBlack):
normalize_path: move to own include file

https://gerrit.wikimedia.org/r/274083

Change 274084 had a related patch set uploaded (by BBlack):
normalize_path: make it a C function

https://gerrit.wikimedia.org/r/274084

Change 274085 had a related patch set uploaded (by BBlack):
normalize_path: optional forward-slash, RB variant

https://gerrit.wikimedia.org/r/274085

Change 274086 had a related patch set uploaded (by BBlack):
normalize_path: stop on fragment marker

https://gerrit.wikimedia.org/r/274086

Change 274087 had a related patch set uploaded (by BBlack):
normalize_path: assert(url)

https://gerrit.wikimedia.org/r/274087

Change 274088 had a related patch set uploaded (by BBlack):
normalize_path: refactor control flow

https://gerrit.wikimedia.org/r/274088

Change 274089 had a related patch set uploaded (by BBlack):
normalize_path: fully parameterize the decoded set

https://gerrit.wikimedia.org/r/274089

Having dug into this, the other problem with the existing patch is that it would still do a lot of unacceptable decodings, such as the space character, and all of the US-ASCII unprintable range.

I've uploaded a 7-patch set here (untested yet): https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+branch:production+topic:what-is-normal,n,z

The first 4 are pretty easy to review/verify, and would get us to a state where RB gets the same decoding as MW, minus the slash, without a bunch of duplication. The next two are general refactors/improvements which might require more review/verification and take longer to merge safely. The final patch is similarly harder to review/verify/deploy, and does the larger changes talked about here: always decodes the unreserved set, and makes the decoded set fully controllable by the caller if you want to add more decoded characters than MW currently uses. I'm not sure if we even want to go down that path, but I think this is how we'd get there. It would take another followup commit to actually add the new characters for the RB case.

Change 274083 merged by BBlack:
normalize_path: move to own include file

https://gerrit.wikimedia.org/r/274083

Change 274084 merged by BBlack:
normalize_path: make it a C function

https://gerrit.wikimedia.org/r/274084

Change 274085 merged by BBlack:
normalize_path: optional forward-slash, RB variant

https://gerrit.wikimedia.org/r/274085

I've merged the first 3 patches, which essentially gets us to "RB gets the same path decoding as MW, except for forward slashes". I suspect this is actually enough to get past the immediate issue, right?

I still question the next patch (fragment support) - clients shouldn't be sending it in the first place, and even if they do, either we already stopped on the query-string, or there was no query-string and we don't care if we mangle the invalid/unused fragment identifier. If anything, we should probably just strip fragments for all inbound requests early.

Regardless of the fragment stuff, if we want to go further and support a custom set of decodings and such, those patches will need more reviewing and testing than these first few did, and I don't think I'll have time to validate all of that in the next couple of days.

• Mholloway subscribed.Mar 1 2016, 7:40 PM

Change 274282 had a related patch set uploaded (by BBlack):
text VCL: move bits-events and normalization above PURGE

https://gerrit.wikimedia.org/r/274282

Change 274282 merged by BBlack:
ext VCL: move bits-events and normalization above PURGE

https://gerrit.wikimedia.org/r/274282

Change 274458 had a related patch set uploaded (by GWicke):
WIP: Add a cluster_be_recv_pre_purge handler & normalize paths

https://gerrit.wikimedia.org/r/274458

Mentioned in SAL [2016-03-02T19:21:32Z] <mobrovac> restbase rolling restart for https://gerrit.wikimedia.org/r/274456 T127387

Change 274458 merged by BBlack:
Add a cluster_be_recv_pre_purge handler & normalize paths

https://gerrit.wikimedia.org/r/274458

Mentioned in SAL [2016-03-03T19:58:13Z] <urandom> rolling restart of restbase in staging to apply config change : T127387

Mentioned in SAL [2016-03-03T20:03:35Z] <urandom> rolling restart of restbase staging complete : T127387

Mentioned in SAL [2016-03-03T20:20:41Z] <urandom> rolling restart of restbase production to apply config change : T127387

Mentioned in SAL [2016-03-03T20:26:39Z] <urandom> rolling restart of restbase production complete : T127387

• Mholloway unsubscribed.Mar 3 2016, 8:29 PM

• Pchelolo mentioned this in T109742: Set up Varnish purging and caching for RESTBase end points.Mar 3 2016, 8:30 PM

Mentioned in SAL [2016-03-03T21:40:09Z] <urandom> rolling restart of restbase staging (config change) : T127387

Mentioned in SAL [2016-03-03T21:45:16Z] <urandom> rolling restart of restbase staging complete : T127387

Mentioned in SAL [2016-03-03T21:53:52Z] <urandom> rolling restart of restbase production (config change) : T127387

Mentioned in SAL [2016-03-03T22:01:54Z] <urandom> rolling restart of restbase production complete : T127387

Summary of the latest status:

Path normalization is applied to regular & PURGE requests.
We verified purging of special characters (including : and parentheses).
Cache TTL is now at 12h.

Next steps:

See if there are any more purging issues.
Later / at our leisure, improve normalization a bit further:
- Normalize case of quoted delimiters ([/#?%] for RB), e.g. normalize %2f to %2F.
- More complete normalization of non-delimiters:
  - Decode all unreserved characters.
  - Encode all other characters.

We are also auditing the RESTBase code to make sure that we redirect non-canonical requests to the canonical URL.

• GWicke lowered the priority of this task from High to Medium.Mar 4 2016, 1:19 AM

• bearND moved this task from Incoming to Tracking on the Mobile-Content-Service board.Mar 10 2016, 5:33 PM

• Mholloway subscribed.May 27 2016, 11:15 PM

• GWicke mentioned this in T136346: Links with plus signs are not urlencoded.Jul 19 2016, 10:11 PM

• ema moved this task from Backlog to Caching on the Traffic board.Sep 30 2016, 3:16 PM

• GWicke edited projects, added Services (later); removed Services.Oct 12 2016, 10:32 PM

• NHarateh_WMF added a project: Product-Infrastructure-Team-Backlog-Deprecated.Apr 25 2017, 12:24 PM

• NHarateh_WMF moved this task from Needs triage to Tracking on the Product-Infrastructure-Team-Backlog-Deprecated board.Apr 25 2017, 12:25 PM

• NHarateh_WMF moved this task from Tracking to Backlog on the Mobile-Content-Service board.Apr 25 2017, 4:31 PM

• NHarateh_WMF moved this task from Backlog to Incoming on the Mobile-Content-Service board.Apr 25 2017, 4:37 PM

I don't think there are any open patches...

• Pchelolo mentioned this in T178037: Parsoid uses non-canonical URL encoding in <link> in <head>.Oct 17 2017, 10:00 AM

Change 274086 abandoned by BBlack:
normalize_path: stop on fragment marker

https://gerrit.wikimedia.org/r/274086

Change 274087 abandoned by BBlack:
normalize_path: assert(url)

https://gerrit.wikimedia.org/r/274087

Change 274088 abandoned by BBlack:
normalize_path: refactor control flow

https://gerrit.wikimedia.org/r/274088

Change 274089 abandoned by BBlack:
normalize_path: fully parameterize the decoded set

https://gerrit.wikimedia.org/r/274089

Change 391216 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] [WIP] normalize_path: fully normalize MW RB URL paths

https://gerrit.wikimedia.org/r/391216

gerritbot added a project: Patch-For-Review.Nov 14 2017, 2:44 PM

Change 391216 merged by BBlack:
[operations/puppet@production] Fully normalize upload paths

https://gerrit.wikimedia.org/r/391216

Change 407488 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] URL Path Normalization: refactor, add to cache_text

https://gerrit.wikimedia.org/r/407488

Change 407489 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] URL Path Normalization: add to cache_misc

https://gerrit.wikimedia.org/r/407489

Change 407643 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] URL Path Normalization: fully normalize cache_text

https://gerrit.wikimedia.org/r/407643

Change 407670 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] URL Normalization: strip fragment

https://gerrit.wikimedia.org/r/407670

• Mholloway unsubscribed.Feb 2 2018, 5:26 PM

Change 407488 merged by BBlack:
[operations/puppet@production] URL Path Normalization: refactor, add to cache_text

https://gerrit.wikimedia.org/r/407488

• Mholloway removed projects: Product-Infrastructure-Team-Backlog-Deprecated, Mobile-Content-Service.Jun 27 2018, 6:15 PM

BBlack mentioned this in T207615: Varnish won't purge thumbnails of specific file.Oct 29 2018, 11:44 AM

• ema mentioned this in T210295: ATS path normalization.Nov 23 2018, 3:14 PM

• mobrovac added a project: Platform Team Legacy (Later).Dec 20 2018, 12:07 PM

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 8:32 PM