Problem
Wikimedia is managing a growing amount of machine readable data as wiki page content. The latest addition is the Data namespace on commons, which hosts tabular data like Data:Dolmens_of_the_Preseli_Hills.tab and geographic data like Data:Avignon_City_Wall.map.
There is currently no canonical URL for referring to and retrieving these data sets. Canonical URLs are needed as stable identifiers (URIs) in linked data.
Concrete need: Wikidata can reference geo-shape data from the Data namespace on Commons. To represent such references in RDF, the data set needs a canonical URI. See T159517: [RFC] RDF mapping for geo-shape / URIs for commons data pages
Proposed Solution
- Use URLs of the form https://commons.wikimedia.org/data/main/Data:Avignon_City_Wall.map to identify and retrieve machine readable page content. "main" refers to the main slot, see T107595.
- The /data/<slot> path is rewritten to a special page, Special:PageData
- Special Special:PageData will redirect (with status 303) to an appropriate (and typically cacheable) URL for retrieving the page data. For now, this will use the action=raw interface.
- Special:PageData may apply content negotiation based on the Accept header sent by the client. In the first iteration, it will only check if any accept header sent by the client is compatible with the content model of the requested page.
- The 303 redirects are not cecheable for now, because they depend on the Accept header; complex normalization would be needed to allow the cache to vary on the Accept header without causing massive cache fragementation.
Note that in contrast to Wikidata entity URIs, the above URIs identify descriptions (data), not the thing described by the data. They also do not identify wiki pages, as the /wiki/ path does.
Also note that the primary purpose of these URLs are to act as canonical stable identifiers (URIs). They should be resolvable, but they are not intended as a full-fledged data access API. They may however be implemented to redirect to such an API.
Status Quo
- There is a way to get raw page data for most data types, using action=raw with the "ugly" URL form: https://commons.wikimedia.org/w/index.php?title=Data:Avignon_City_Wall.map&action=raw. However, this is not supported for data types that have "direct editing " disabled. E.g. https://www.wikidata.org/w/index.php?title=Q23&action=raw does not work.
- Wikidata uses https://www.wikidata.org/entity/Q23 as the canonical URI of concepts, and https://www.wikidata.org/wiki/Special:EntityData/Q23 as the canonical URI of the description. Both apply content negotiation and trigger a 303 redirect. The canonical URL for a specific serialization has the form https://www.wikidata.org/wiki/Special:EntityData/Q23.ttl.
Concerns an Alternatives Considered
- Do not include the namespace after /data/, e.g. https://commons.wikimedia.org/data/Avignon_City_Wall.map
- That would mean this URL pattern cannot be used as a general mechanism to refer to page content. It would be specific to the Data namespace on Commons.
- Use "raw" instead of "data", e.g. https://commons.wikimedia.org/raw/Data:Avignon_City_Wall.map
- "raw" is less descriptive, and may not be correct if content negotiation is applied.
- Use REST API URLS
- The REST API offers fairly clean URLs, but they still expose details about the web application and API version. Even the fact that they expose that this is an API is too specific in a context where URLs are used as identifiers.
- "URLs don't need to be pretty"
- While URLs do not have to be pretty, they should be stable, especially when they are to be used as stable unique identifiers. Removing all application specific information from the URL provides more stability by adding a layer of abstraction.
- We could apply content negotiation to the established page URLs using the /wiki/ path. Such URLs are already in use for referring to Wikipedia pages in RDF.
- The semantics of /wiki is "a wiki page", while the intended semantics of /data is "a machine readable data set".
- The /wiki path has no room for addressing individual slots - in fact, it refers to the page as rendered using information from all slots (compare T107595).
- The /wiki path on Wikimedia sites is well established and heavily used. It's risky to overload it with new semantics and behavior.
- The proposed URL scheme does not have room for slot names. We will not be able to refer to slots other than the main slot.
- The proposal was amended to use the /data/<slot>/ prefix, for forward compatibility. The intended meaning or semantics of <slot> is not yet fixed, though it is expected to align with slot names (compare T107595).
- The proposed schemes are not stable against page renames. We could use page IDs instead of the title.
- Page IDs are also brittle: sometimes, a page is moved to an archive-style title, and a new page is created using the old title. In such a case, the intended semantics of the data URLs is unknown.
- Most entry points, including the REST API, relies on titles, not page IDs.
- Page IDs will often not be known to the code that constructs the data URL. It may take a database or API request to determine the page ID.
- Page IDs don't allow for "eyeballing", they are not self-explanatory.
- The URL pattern should include a versioning mechanism
- The idea of versioning is somewhat contrary to the idea of stable canonical identifiers. The canonical identifier should stay canonical, and not be replaced by a new canonical URL. The primary concern is the identity of the object identified, not the format of the data returned when resolving the URL. This situation is contrary to the situation for APIs: here, it's important to know exactly the format of the data returned, and how to request which bits of data. Here, versioning is a good thing.
- The proposed URL pattern introduces a new API for MediaWiki; there is no need for another API beyond the old school action API, the traditional web API and the new REST API.
- The proposed URL pattern is merely a naming convention; it can act as a front for any of the existing APIs. Its primary aim is to provide stable identifiers, to allow fine grained data access.
- The concerns of identifiers and APIs are related, but dissimilar, as explained above. They can be seen as complementary.