Page MenuHomePhabricator

Varnish not purging RESTBase URIs
Closed, ResolvedPublic

Description

As of Gerrit 271436 we have enabled purging of the Mobile Content Service's RESTBase URIs. However, they are not being properly purged.

This a currently a stub, we'll fill in the info as we perform tests and explain the workflow.

Event Timeline

I've created a page on en.wiki and started curling it and changing the content. Here's the result:

X-Cache: cp1054 miss(0), cp4008 miss(0), cp4008 frontend miss(0)
Revision 1 returned

X-Cache: cp1054 miss(0), cp4008 miss(0), cp4008 frontend hit(1)
Revision 1 returned

Edited page, created new revision.
Update propagated
Purges should’ve happened

X-Cache: cp1054 hit(1), cp4008 miss(0), cp4008 frontend miss(0)
Revision 1 returned

X-Cache: cp1054 hit(1), cp4008 miss(0), cp4008 frontend hit(1)
Revision 1 returned

X-Cache: cp1054 hit(1), cp4008 miss(0), cp4008 frontend hit(2)
Revision 1 returned

Edited page
Update propagated to RESTBase
Waited ~5 minutes

X-Cache: cp1054 hit(2), cp4008 miss(0), cp4008 frontend miss(0)
Revision 1 returned

X-Cache: cp1054 hit(2), cp4008 miss(0), cp4008 frontend hit(1)
Revision 1 returned

Waited 5 more minutes

X-Cache: cp1054 hit(2), cp4008 hit(1), cp4018 frontend miss(0)
Revision 1 returned

One issue we are seeing is inconsistent percent encoding. We should probably modify the normalization logic in VCL to (optionally) leave encoded slashes alone, and then apply that to all requests, including purges.

However, this is not the full explanation, as we are also seeing purge failures for titles that don't contain any special characters. We have not been able to establish whether purges get lost on the way to Varnishes, or whether the purged URL is incorrect. To verify this, we'd need to edit a page, and then see if /api/rest_v1/page/mobile-sections/{title} is being purged.

If you can give me an example I can work through on my own, it would probably be easier for me to work through the layers. Is this hooked up for testwiki as well, where I can comfortably create nonsense?

@BBlack, we tested with a user page and https://www.mediawiki.org/wiki/HyperSwitch (which has since been deleted). test2wiki should definitely work, not sure about the latest status of testwiki wrt Varnish.

The URL rewriting in the backend is only happening after the recv_purge call, and in recv_purge there is a [return (lookup);](https://github.com/wikimedia/operations-puppet/blob/11d2e3faa37f21a8fd4f14619048596de9c65797/modules/varnish/templates/vcl/wikimedia.vcl.erb#L249). So, it seems that we might be purging the un-rewritten URL in the backend, but cache the rewritten version.

I've run a test in labs:

First Request:

X-Cache: deployment-cache-text04 hit (5), deployment-cache-text04 frontend miss (0)
Varnish frontend:
deployment-cache-text04.deployment-prep.eqiad.wmflabs 24 2016-02-18T23:13:44 0.000707626 198.73.209.1 miss/200 106 GET http://en.wikipedia.beta.wmflabs.org/api/rest_v1/page/summary/PurgeTestPage - - - - curl/7.46.0
Varnish backend:
deployment-cache-text04.deployment-prep.eqiad.wmflabs 88 2016-02-18T23:13:44 0.000105858 127.0.0.1 hit/200 107 GET http://en.wikipedia.beta.wmflabs.org/api/rest_v1/page/summary/PurgeTestPage - - - 198.73.209.1 curl/7.46.0

Second request:
X-Cache: deployment-cache-text04 hit (5), deployment-cache-text04 frontend hit (1)
Varnish frontend:
deployment-cache-text04.deployment-prep.eqiad.wmflabs 94 2016-02-18T23:15:15 0.000109196 198.73.209.1 hit/200 106 GET http://en.wikipedia.beta.wmflabs.org/api/rest_v1/page/summary/PurgeTestPage - - - - curl/7.46.0
Varnish backend:
 — nothing —

Purge the page:
Varnish frontend:
deployment-cache-text04.deployment-prep.eqiad.wmflabs 121 2016-02-18T23:16:35 0.000170708 127.0.0.1 hit/204 0 PURGE http://en.wikipedia.beta.wmflabs.org/api/rest_v1/page/summary/PurgeTestPage - - - - vhtcpd
Varnish backend:
deployment-cache-text04.deployment-prep.eqiad.wmflabs 1848 2016-02-18T23:16:35 0.000109434 127.0.0.1 miss/204 0 PURGE http://en.wikipedia.beta.wmflabs.org/api/rest_v1/page/summary/PurgeTestPage - - - - vhtcpd

Third request:
X-Cache: deployment-cache-text04 hit (6), deployment-cache-text04 frontend miss (0)
Varnish frontend:
deployment-cache-text04.deployment-prep.eqiad.wmflabs 137 2016-02-18T23:17:38 0.000854731 198.73.209.1 miss/200 106 GET http://en.wikipedia.beta.wmflabs.org/api/rest_v1/page/summary/PurgeTestPage - - - - curl/7.46.0
Varnish backend:
deployment-cache-text04.deployment-prep.eqiad.wmflabs 2495 2016-02-18T23:17:38 0.000097036 127.0.0.1 hit/200 107 GET http://en.wikipedia.beta.wmflabs.org/api/rest_v1/page/summary/PurgeTestPage - - - 198.73.209.1 curl/7.46.0

Purge requests get to both backend and fronted. In frontend it's correctly purged, but in backend a purge has no effect. That supports @GWicke theory about rewrites.

Yeah, that makes a lot of sense. In the short term, we can find a way to refactor VCL around this but...

In the long term, this really is just another example of the pain of the rewrite in the first place. The public API paths should match the internal ones with no rewrites happening.

The rewrite is:

if (req.url ~ "^/api/rest_v1/") {
        set req.url = "/" + req.http.host + regsub(req.url, "^/api/rest_v1/", "/v1/");

There's nothing in the rewritten version that wasn't already available in the original. The hostname is already available in the Host: header sent to restbase, and the rewrite is basically swapping the string /api/rest_v1/ for /$hostname/v1/. Even if you don't want to change anything else (e.g. about API paths available to internal services, or the structure of code/paths within the RB code), surely in the JS code you can insert a snippet that hooks very early in request processing and mangles all incoming requests similarly...

One other interesting piece of information is that we used the wrong format for the Cache-Control header. This has been fixed in PR 522 and will be deployed shortly.

Mentioned in SAL [2016-02-19T00:55:28Z] <mobrovac> restbase deploy start of fa1207e95 for T127370

Change 271754 had a related patch set uploaded (by BBlack):
cache_text: mangle RB requests just before fetch

https://gerrit.wikimedia.org/r/271754

Change 271754 merged by BBlack:
cache_text: mangle RB requests just before fetch

https://gerrit.wikimedia.org/r/271754

@BBlack, thank you for the VCL change. Now purging is working great for titles without reserved characters. The only remaining issue is T127387, where your input would be very welcome.

We are also looking into getting rid of the need for the Host header to path rewrite in Varnish. We might see movement on that next week.

GWicke claimed this task.

Resolving this in favor of T127387.