Page MenuHomePhabricator

Add CORS-enabled, cacheable way to access contents of Data namespace
Open, LowPublic

Description

Currently, accessing data on commons via the canonical URI eventually ends up at index.php?action=raw:

Special:PageData adds an Access-Control-Allow-Origin: * header to its redirect, but there is no such header on the first redirect, nor on the final response, which means that in a cross-origin scenario (e. g. fetching the data via JavaScript on another website) the request is blocked. (Aside: if I understand correctly, Firefox doesn’t yet support CORS on redirects at all. See Bug 1346749.) To enhance availability of these data, there should be a way for other websites to dynamically access the data in a way where the response can be cached.

There is a workaround to this problem: use the Action API (which supports CORS), with action=query, prop=revisions, and rvprop=content (and selecting the pages with titles, pageids, or some generator, etc.), and extract the page content from response.query.pages.{pageId}.revisions[0]['*']. However, API responses are never cached, resulting in more work for the API servers as well as lots of unnecessary data transfer. Correction: the response is cached according to the maxage specified in the request; however, I’m not sure if this works if pages are edited or purged (as far as I can tell, the browser doesn’t validate its cached data), so it can be hard to choose the right maxage. (This workaround also requires that you parse and transform the canonical URI, which is ugly.)

As far as I understand, this will definitely require at least an Access-Control-Allow-Origin: * header on the /data/main, redirect, which can be added like this. After that, there are different options. We can make index.php send that header as well on any successful GET request to the Data: namespace, but this seems a bit risky. Alternatively, we could make Special:PageData redirect to something other than index.php?action=raw (which isn’t a very nice solution anyways), e. g. an endpoint of the REST API. The REST API already supports the endpoint /page/wikitext/{title}, but only for POST requests. We could add support for GET requests to it, or perhaps add another endpoint (after all, the content model isn’t actually wikitext).

See also the related issue T150290: add CORS to all redirects in chain from https://www.wikidata.org/entity/{Q...}, which also features a redirect chain with partial CORS support.

Event Timeline

Lucas_Werkmeister_WMDE added subscribers: Ladsgroup, hoo.

@hoo or @Ladsgroup could you perhaps take a look? You seem to be more familiar with CORS than I am :)

Correction: the response is cached according to the maxage specified in the request; however, I’m not sure if this works if pages are edited or purged (as far as I can tell, the browser doesn’t validate its cached data), so it can be hard to choose the right maxage. (This workaround also requires that you parse and transform the canonical URI, which is ugly.)

This is also true for action=raw. action=raw is not purged and only gets an (s)maxage if set by url parameter. (exception: &action=raw&ctype=text/javascript and &action=raw&ctype=text/css are purged, but other action=raw urls are not. See Title::getCdnUrls).

Oh, that’s a shame… and the text/x-wiki content type on action=raw is also not ideal, application/json would be better.

Huh, action=raw should really only emit x-wiki for wikitext. The Content-Type header should come from ContentHandler, and should be application/json for JSON content. I wonder why this isn't the case here.

Even with ctype=text/javascript, I get x-wiki. That'S bad.