path for canonical URLs for machine readable page content
Open, MediumPublic
Actions

Assigned To

None

Authored By

	daniel
	Apr 26 2017, 4:45 PM

Description

For all Wikimedia sites, add the following rewrite:

RewriteRule /data/(.*)/(.*)$ /wiki/Special:PageData/$1/$2 [QSA]

See T161527: RFC: Canonical data URLs for machine readable page content for the rationale.

Details

Subject	Repo	Branch	Lines +/-
Change /data/ redirect to Special:Pagedata	operations/puppet	production	+1 -1
Fix /data/ redirect for commons	operations/puppet	production	+1 -1
Add /data/ Redirect for commons	operations/puppet	production	+3 -0
Add /data/ url redirect in beta cluster (Wikipedia only)	operations/puppet	production	+2 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T91505 [Epic] Adding new datatypes to Wikidata (tracking)
Resolved	• Jonas	T57549 [Story] Add a new datatype for geoshapes
Resolved	Ladsgroup	T160535 [Task] Provide RDF mapping for geoshape data type.
Open	None	T176764 Use the new /data/ path for canonical wikibase entity data URIs.
Open	None	T163921 [Epic] Implement canonical data URLs for machine readable page content
Open	None	T163922 Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content
Resolved	Ladsgroup	T163923 Create Special:PageData as a canonical entry point for machine readable page data.

Event Timeline

daniel created this task.Apr 26 2017, 4:45 PM

daniel created subtask T163923: Create Special:PageData as a canonical entry point for machine readable page data..Apr 26 2017, 5:03 PM

Dereckson moved this task from Backlog to Blocked on development on the Wikimedia-Site-requests board.May 3 2017, 8:27 PM

Dereckson moved this task from Backlog to Blocked on the Wikimedia-Apache-configuration board.

daniel added a project: Wikidata-Former-Sprint-Board.Jun 12 2017, 1:23 PM

The patch is merged but not deployed so I think we should wait a little. But given how the RFC implemented and here, I think we need to change the code base to accept slot too (even thought it ignores it for now). Let me clarify: right now, Special:PageData/foo/bar goes to page foo and ignores bar, it should go to bar and ignores foo.

@Ladsgroup I agree that the rewrite should only be done once Special:PageData is live.

The fix for the foo/bar problem is now also merged. Thanks for noticing, I completely missed that, even though I wrote the spec :)

daniel triaged this task as Medium priority.Jun 13 2017, 5:03 PM

Change 360887 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/puppet@production] Add /data/ Redirect for commons

https://gerrit.wikimedia.org/r/360887

gerritbot added a project: Patch-For-Review.Jun 22 2017, 6:01 PM

Change 360891 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/puppet@production] Add /data/ url redirect in beta cluster (Wikipedia only)

https://gerrit.wikimedia.org/r/360891

Mentioned in SAL (#wikimedia-releng) [2017-06-22T19:02:18Z] <Amir1> cherry-picking gerrit:360891/1 (T163922)

I cherry-picked the patch in beta cluster and it works just fine: https://en.wikipedia.beta.wmflabs.org/data/main/Albert_Einstein

Change 360891 merged by Filippo Giunchedi:
[operations/puppet@production] Add /data/ url redirect in beta cluster (Wikipedia only)

https://gerrit.wikimedia.org/r/360891

@Dereckson you marked this ticket as blocked on the ops boards - but I don't see what it's blocked on. How do we move forward?

daniel moved this task from Proposed to Blocked External on the Wikidata-Former-Sprint-Board board.Jul 26 2017, 4:37 PM

Lucas_Werkmeister_WMDE mentioned this in T175988: Bad formatting of geoshape link in Wikidata Query.Sep 15 2017, 12:26 PM

Is this just blocked on the question of HTTP 301 vs. 303, which is still open on Gerrit? Or is there something else? We should really get this redirect in place, we’ve already been exposing /data/ URLs in RDF exports and the query service for a while now.

There are other comments from yours truly in the last review, namely maintaining the status quo of configuring the redirect, aside from the 303 vs 301 part, on which I can be convinced with a good enough argument, but I haven't yet seen a reply.

@akosiaris I replied on the patch. Basically: 301, 302, 303 are all wrong. Pick one and give us a redirect :)

Using an absolute target URL for consistency seems like a good idea, even if it's not necessary.

As there is no argument for 303, we only have to sort between 301 and 302, and the main question for that is the stable or not property of the target URL.

If the redirection will be stable and always point to the same resource at the same URL, we can use a 301 (permament). If not, that will be a 302.

[ Setting priority to high per Lucas statement the URL is already documented. ]

*sigh* 301 it is, then. I wrote some more on the patch. Let's just ignore the pesky RFC ;)

Change 360887 merged by Alexandros Kosiaris:
[operations/puppet@production] Add /data/ Redirect for commons

https://gerrit.wikimedia.org/r/360887

The URL https://commons.wikimedia.org/data/main/Data:Bundestagswahl2017/wahlkreis46.map now redirects to https://commons.wikimedia.org/%7bENV:RW_PROTO%7d://commons.wikimedia.org/wiki/Special:PageData/main/Data:Bundestagswahl2017/wahlkreis46.map … (%7b and %7d are { and })

Change 380774 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[operations/puppet@production] Fix /data/ redirect for commons

https://gerrit.wikimedia.org/r/380774

Change 380774 merged by Elukey:
[operations/puppet@production] Fix /data/ redirect for commons

https://gerrit.wikimedia.org/r/380774

I see the fix got merged, but it doesn't seem to be live yet.

In general, this raises the question of testing this kind of patch. Do we have an environment where this would be possible, in particular also for people that don't have shell access to production servers?

Restricted Application added subscribers: Liuxinyu970226, Jay8g, TerraCodes. · View Herald TranscriptSep 27 2017, 9:53 AM

It takes around half an hour (to an hour) to nodes to pick it up and restart.

It seems to be live on some servers and not yet on others. (And also, Varnish cashes the redirects.) I’m running this command:

until ! curl -s -I https://commons.wikimedia.org/data/main/Data:Bundestagswahl2017/wahlkreis46.map?breakCache=$RANDOM | grep -qF RW_PROTO; do sleep 60s; done; notify-send 'redirect fixed (at least when not cached)'

and it occasionally sends out notifications already.

Also I think some are behind varnish, e.g. this works fine: https://commons.wikimedia.org/data/main/Data:Amsterdam_Districts.map

All the appservers are now returning the good version of the redirect, I think that some of them are still showing up broken due to caching.

elukey@tin:~$ apache-fast-test broken pybal
testing 1 urls on 247 servers, totalling 247 requests
spawning threads...............................

https://commons.wikimedia.org/data/main/Data:Bundestagswahl2017/wahlkreis46.map
 * 301 Moved Permanently https://commons.wikimedia.org/wiki/Special:PageData/main/Data:Bundestagswahl2017/wahlkreis46.map

(broken was my config to test the Data:Bundestagswahl2017 url)

I believe this is only a matter of cleaning up urls that show up garbled, @ema just did it for https://commons.wikimedia.org/data/main/Data:Bundestagswahl2017/wahlkreis46.map via https://wikitech.wikimedia.org/wiki/Multicast_HTCP_purging#One-off_purge)

Is there any way to find out which URLs are garbled? Can we look for RW_PROTO in all the cached redirects, or something like that?

@elukey, @Lucas_Werkmeister_WMDE, @Ladsgroup Thanks for the quick fix!

In T163922#3639504, @Lucas_Werkmeister_WMDE wrote:

Is there any way to find out which URLs are garbled? Can we look for RW_PROTO in all the cached redirects, or something like that?

There is a way (https://wikitech.wikimedia.org/wiki/Varnish#One-off_purges_.28bans.29) but it is going to be risky if we don't get the purge pattern right, since we might risk to cut too many objects from the cache. I would avoid it if possible (and wait the normal caching expiry workflow), but we can discuss with the Traffic team another approach if you want.

If the TTL isn’t too long (I saw a cap of 1 day in the puppet config, is that correct?), then normal expiry is probably enough.

In T163922#3639831, @Lucas_Werkmeister_WMDE wrote:

If the TTL isn’t too long (I saw a cap of 1 day in the puppet config, is that correct?), then normal expiry is probably enough.

It doesn't work like that. The time that request can be cached is determined by the headers the response sets. http://book.varnish-software.com/3.0/HTTP.html is a pretty interesting read if you 've never read it before. It's also a rabbithole (ableit not a very big one ;-). In absence of these (like in this case where only the Age header was set) it's not easy to deduce when the page is going to be removed from all existing caches (some of which we don't really control, like the browser cache)

Anyway, I 've purged the caches in order to resolve this faster instead of waiting it out. For the interested the commands were (in that sequence)

varnishadm  ban "obj.status == 301 && req.http.host ~ commons.wikimedia.org"
varnishadm -n frontend  ban "obj.status == 301 && req.http.host ~ commons.wikimedia.org"

Do remember to force refresh to test it as your browser probably has the result cached as well.

Okay, in that case we can close this issue now, right?

Nope :D We need this for all clients not just Commons.

Lucas_Werkmeister_WMDE removed projects: Patch-For-Review, Wikidata-Former-Sprint-Board.Sep 27 2017, 4:06 PM

Actually, as per the RFC, this is for all Wikimedia wikis. It's independent of Wikibase/Wikidata. Wikidata just happens to be the driving use case.

Change 382163 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[operations/puppet@production] Change /data/ redirect to Special:Pagedata

https://gerrit.wikimedia.org/r/382163

gerritbot added a project: Patch-For-Review.Oct 4 2017, 1:34 PM

Change 382163 abandoned by Lucas Werkmeister (WMDE):
Change /data/ redirect to Special:Pagedata

Reason:
Abandoning in favor of https://gerrit.wikimedia.org/r/#/c/382172, which makes PageData the proper title.

https://gerrit.wikimedia.org/r/382163

daniel added a project: User-Daniel.Apr 20 2018, 11:53 AM

Umherirrender removed a project: Patch-For-Review.Apr 5 2022, 6:14 PM

Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page contentOpen, MediumPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Create a URL rewrite to handle the /data/ path for canonical URLs for machine readable page content
Open, MediumPublic
Actions

Related Objects
Search...