Page MenuHomePhabricator

Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content
Open, MediumPublic

Description

For all Wikimedia sites, add the following rewrite:

RewriteRule /data/(.*)/(.*)$ /wiki/Special:PageData/$1/$2 [QSA]

See T161527: RFC: Canonical data URLs for machine readable page content for the rationale.

Event Timeline

The patch is merged but not deployed so I think we should wait a little. But given how the RFC implemented and here, I think we need to change the code base to accept slot too (even thought it ignores it for now). Let me clarify: right now, Special:PageData/foo/bar goes to page foo and ignores bar, it should go to bar and ignores foo.

@Ladsgroup I agree that the rewrite should only be done once Special:PageData is live.

The fix for the foo/bar problem is now also merged. Thanks for noticing, I completely missed that, even though I wrote the spec :)

daniel triaged this task as Medium priority.Jun 13 2017, 5:03 PM

Change 360887 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/puppet@production] Add /data/ Redirect for commons

https://gerrit.wikimedia.org/r/360887

Change 360891 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/puppet@production] Add /data/ url redirect in beta cluster (Wikipedia only)

https://gerrit.wikimedia.org/r/360891

Mentioned in SAL (#wikimedia-releng) [2017-06-22T19:02:18Z] <Amir1> cherry-picking gerrit:360891/1 (T163922)

Change 360891 merged by Filippo Giunchedi:
[operations/puppet@production] Add /data/ url redirect in beta cluster (Wikipedia only)

https://gerrit.wikimedia.org/r/360891

daniel added a subscriber: Dereckson.

@Dereckson you marked this ticket as blocked on the ops boards - but I don't see what it's blocked on. How do we move forward?

Is this just blocked on the question of HTTP 301 vs. 303, which is still open on Gerrit? Or is there something else? We should really get this redirect in place, we’ve already been exposing /data/ URLs in RDF exports and the query service for a while now.

There are other comments from yours truly in the last review, namely maintaining the status quo of configuring the redirect, aside from the 303 vs 301 part, on which I can be convinced with a good enough argument, but I haven't yet seen a reply.

@akosiaris I replied on the patch. Basically: 301, 302, 303 are all wrong. Pick one and give us a redirect :)

Using an absolute target URL for consistency seems like a good idea, even if it's not necessary.

As there is no argument for 303, we only have to sort between 301 and 302, and the main question for that is the stable or not property of the target URL.

If the redirection will be stable and always point to the same resource at the same URL, we can use a 301 (permament). If not, that will be a 302.

Dereckson raised the priority of this task from Medium to High.Sep 26 2017, 11:02 AM

[ Setting priority to high per Lucas statement the URL is already documented. ]

*sigh* 301 it is, then. I wrote some more on the patch. Let's just ignore the pesky RFC ;)

Change 360887 merged by Alexandros Kosiaris:
[operations/puppet@production] Add /data/ Redirect for commons

https://gerrit.wikimedia.org/r/360887

Change 380774 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[operations/puppet@production] Fix /data/ redirect for commons

https://gerrit.wikimedia.org/r/380774

Change 380774 merged by Elukey:
[operations/puppet@production] Fix /data/ redirect for commons

https://gerrit.wikimedia.org/r/380774

daniel raised the priority of this task from High to Unbreak Now!.EditedSep 27 2017, 9:53 AM

I see the fix got merged, but it doesn't seem to be live yet.

In general, this raises the question of testing this kind of patch. Do we have an environment where this would be possible, in particular also for people that don't have shell access to production servers?

It takes around half an hour (to an hour) to nodes to pick it up and restart.

It seems to be live on some servers and not yet on others. (And also, Varnish cashes the redirects.) I’m running this command:

until ! curl -s -I https://commons.wikimedia.org/data/main/Data:Bundestagswahl2017/wahlkreis46.map?breakCache=$RANDOM | grep -qF RW_PROTO; do sleep 60s; done; notify-send 'redirect fixed (at least when not cached)'

and it occasionally sends out notifications already.

All the appservers are now returning the good version of the redirect, I think that some of them are still showing up broken due to caching.

elukey@tin:~$ apache-fast-test broken pybal
testing 1 urls on 247 servers, totalling 247 requests
spawning threads...............................

https://commons.wikimedia.org/data/main/Data:Bundestagswahl2017/wahlkreis46.map
 * 301 Moved Permanently https://commons.wikimedia.org/wiki/Special:PageData/main/Data:Bundestagswahl2017/wahlkreis46.map

(broken was my config to test the Data:Bundestagswahl2017 url)

elukey lowered the priority of this task from Unbreak Now! to Medium.Sep 27 2017, 10:47 AM
elukey added a subscriber: ema.

I believe this is only a matter of cleaning up urls that show up garbled, @ema just did it for https://commons.wikimedia.org/data/main/Data:Bundestagswahl2017/wahlkreis46.map via https://wikitech.wikimedia.org/wiki/Multicast_HTCP_purging#One-off_purge)

Is there any way to find out which URLs are garbled? Can we look for RW_PROTO in all the cached redirects, or something like that?

Is there any way to find out which URLs are garbled? Can we look for RW_PROTO in all the cached redirects, or something like that?

There is a way (https://wikitech.wikimedia.org/wiki/Varnish#One-off_purges_.28bans.29) but it is going to be risky if we don't get the purge pattern right, since we might risk to cut too many objects from the cache. I would avoid it if possible (and wait the normal caching expiry workflow), but we can discuss with the Traffic team another approach if you want.

If the TTL isn’t too long (I saw a cap of 1 day in the puppet config, is that correct?), then normal expiry is probably enough.

If the TTL isn’t too long (I saw a cap of 1 day in the puppet config, is that correct?), then normal expiry is probably enough.

It doesn't work like that. The time that request can be cached is determined by the headers the response sets. http://book.varnish-software.com/3.0/HTTP.html is a pretty interesting read if you 've never read it before. It's also a rabbithole (ableit not a very big one ;-). In absence of these (like in this case where only the Age header was set) it's not easy to deduce when the page is going to be removed from all existing caches (some of which we don't really control, like the browser cache)

Anyway, I 've purged the caches in order to resolve this faster instead of waiting it out. For the interested the commands were (in that sequence)

varnishadm  ban "obj.status == 301 && req.http.host ~ commons.wikimedia.org"
varnishadm -n frontend  ban "obj.status == 301 && req.http.host ~ commons.wikimedia.org"

Do remember to force refresh to test it as your browser probably has the result cached as well.

Okay, in that case we can close this issue now, right?

Nope :D We need this for all clients not just Commons.

Actually, as per the RFC, this is for all Wikimedia wikis. It's independent of Wikibase/Wikidata. Wikidata just happens to be the driving use case.

Change 382163 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[operations/puppet@production] Change /data/ redirect to Special:Pagedata

https://gerrit.wikimedia.org/r/382163

Change 382163 abandoned by Lucas Werkmeister (WMDE):
Change /data/ redirect to Special:Pagedata

Reason:
Abandoning in favor of https://gerrit.wikimedia.org/r/#/c/382172, which makes PageData the proper title.

https://gerrit.wikimedia.org/r/382163