Change Details

**Revised after public discussion, April 1 2017** == Problem == Wikimedia is managing a growing amount of machine readable data as wiki page content. The latest addition is the Data namespace on commons, which hosts tabular data like [[https://commons.wikimedia.org/wiki/Data:Dolmens_of_the_Preseli_Hills.tab|Data:Dolmens_of_the_Preseli_Hills.tab]] and geographic data like [[https://commons.wikimedia.org/wiki/Data:Avignon_City_Wall.map|Data:Avignon_City_Wall.map]]. There is currently no canonical URL for referring to and retrieving these data sets. Canonical URLs are needed as stable identifiers (URIs) in linked data. **Concrete need:** Wikidata can reference geo-shape data from the Data namespace on Commons. To represent such references in RDF, the data set needs a canonical URI. See {T159517} == Proposed Solution == * Use URLs of the form https://commons.wikimedia.org/data/Data:Avignon_City_Wall.map to identify and retrieve machine readable page content. * The `/data/` path is rewritten to a special page, Special:PageData * Special Special:PageData will redirect (with status 303) to an appropriate (and typically cacheable) URL for retrieving the page data. For now, this will use the `action=raw` interface. * Special:PageData may apply content negotiation based on the Accept header sent by the client. In the first iteration, it will only check if any accept header sent by the client is compatible with the content model of the requested page. * The 303 redirects are not cecheable for now, because they depend on the Accept header; complex normalization would be needed to allow the cache to vary on the Accept header without causing massive cache fragementation. Note that in contrast to Wikidata entity URIs, the above URIs identify //descriptions// (data), not the thing described by the data. == Status Quo == * There is a way to get raw page data for most data types, using action=raw with the "ugly" URL form: <https://commons.wikimedia.org/w/index.php?title=Data:Avignon_City_Wall.map&action=raw>. However, this is not supported for data types that have "direct editing " disabled. E.g. <https://www.wikidata.org/w/index.php?title=Q23&action=raw> does not work. * Wikidata uses <https://www.wikidata.org/entity/Q23> as the canonical URI of concepts, and <https://www.wikidata.org/wiki/Special:EntityData/Q23> as the canonical URI of the description. Both apply content negotiation and trigger a 303 redirect. The canonical URL for a specific serialization has the form <https://www.wikidata.org/wiki/Special:EntityData/Q23.ttl>. == Concerns an Alternatives Considered == * Do not include the namespace after /data/, e.g. https://commons.wikimedia.org/data/Avignon_City_Wall.map * That would mean this URL pattern cannot be used as a general mechanism to refer to page content. It would be specific to the Data namespace on Commons. * Use "raw" instead of "data", e.g. https://commons.wikimedia.org/raw/Data:Avignon_City_Wall.map * "raw" is less descriptive, and may not be correct if content negotiation is applied. * Use REST API URLS * The REST API offers fairly clean URLs, but they still expose details about the web application and API version. Even the fact that they expose the fact that this is an API is too specific in a context where URLs are used as identifiers. * "URLs don't need to be pretty" * While URLs do not have to be pretty, they should be stable, especially when they are to be used as stable unique identifiers. Remocing all application specific information from the URL provides more stability by adding a layer of abstraction. == Open Questions and Concerns == * We could apply content negotiation to the established page URLs using the `/wiki/` path. Such URLs are already in use for referring to Wikipedia pages in RDF (e.g. by DBpedia and also by Wikidata). On the other hand, the `/wiki/` path is really a UI entry point, and it seems like a good idea to keep the UI separate from the data identifiers. * The proposed URL scheme does not have room for slot names. We will not be able to refer to slots other than the main slot. Possible solution: https://commons.wikimedia.org/data/main/Avignon_City_Wall.map. This is looking more and more thike the REST API URLs. * The porposed schemes are not stable against page renames. We could use page IDs instead of the title. That makes the URLs a lot less intuitive, and requires database access in order to construct them. == Resources == * https://www.w3.org/TR/cooluris/ * https://www.w3.org/TR/dwbp/#UniqueIdentifiers * https://www.w3.org/TR/ld-bp/#HTTP-URIS * https://data.gov.uk/resources/uris * http://philarcher.org/diary/2013/uripersistence/#minimal * https://www.w3.org/Provider/Style/URI.html

**Revised after public discussion, April 1 2017 and April 13 2017** NOTE: Last call for comments! If no new pertinent concerns are raised by April 26 2017, this RFC will be approved for implementation! == Problem == Wikimedia is managing a growing amount of machine readable data as wiki page content. The latest addition is the Data namespace on commons, which hosts tabular data like [[https://commons.wikimedia.org/wiki/Data:Dolmens_of_the_Preseli_Hills.tab|Data:Dolmens_of_the_Preseli_Hills.tab]] and geographic data like [[https://commons.wikimedia.org/wiki/Data:Avignon_City_Wall.map|Data:Avignon_City_Wall.map]]. There is currently no canonical URL for referring to and retrieving these data sets. Canonical URLs are needed as stable identifiers (URIs) in linked data. **Concrete need:** Wikidata can reference geo-shape data from the Data namespace on Commons. To represent such references in RDF, the data set needs a canonical URI. See {T159517} == Proposed Solution == * Use URLs of the form https://commons.wikimedia.org/data/main/Data:Avignon_City_Wall.map to identify and retrieve machine readable page content. "main" refers to the main slot, see T107595. * The `/data/<slot>` path is rewritten to a special page, Special:PageData * Special Special:PageData will redirect (with status 303) to an appropriate (and typically cacheable) URL for retrieving the page data. For now, this will use the `action=raw` interface. * Special:PageData may apply content negotiation based on the Accept header sent by the client. In the first iteration, it will only check if any accept header sent by the client is compatible with the content model of the requested page. * The 303 redirects are not cecheable for now, because they depend on the Accept header; complex normalization would be needed to allow the cache to vary on the Accept header without causing massive cache fragementation. Note that in contrast to Wikidata entity URIs, the above URIs identify //descriptions// (data), not the thing described by the data. They also do not identify wiki pages, as the /wiki/ path does. Also note that the primary purpose of these URLs are to act as canonical stable identifiers (URIs). They should be resolvable, but they are not intended as a full-fledged data access API. They may however be implemented to redirect to such an API. == Status Quo == * There is a way to get raw page data for most data types, using action=raw with the "ugly" URL form: <https://commons.wikimedia.org/w/index.php?title=Data:Avignon_City_Wall.map&action=raw>. However, this is not supported for data types that have "direct editing " disabled. E.g. <https://www.wikidata.org/w/index.php?title=Q23&action=raw> does not work. * Wikidata uses <https://www.wikidata.org/entity/Q23> as the canonical URI of concepts, and <https://www.wikidata.org/wiki/Special:EntityData/Q23> as the canonical URI of the description. Both apply content negotiation and trigger a 303 redirect. The canonical URL for a specific serialization has the form <https://www.wikidata.org/wiki/Special:EntityData/Q23.ttl>. == Concerns an Alternatives Considered == * Do not include the namespace after /data/, e.g. https://commons.wikimedia.org/data/Avignon_City_Wall.map * That would mean this URL pattern cannot be used as a general mechanism to refer to page content. It would be specific to the Data namespace on Commons. * Use "raw" instead of "data", e.g. https://commons.wikimedia.org/raw/Data:Avignon_City_Wall.map * "raw" is less descriptive, and may not be correct if content negotiation is applied. * Use REST API URLS * The REST API offers fairly clean URLs, but they still expose details about the web application and API version. Even the fact that they expose the fact that this is an API is too specific in a context where URLs are used as identifiers. * "URLs don't need to be pretty" * While URLs do not have to be pretty, they should be stable, especially when they are to be used as stable unique identifiers. Remocing all application specific information from the URL provides more stability by adding a layer of abstraction. * We could apply content negotiation to the established page URLs using the `/wiki/` path. Such URLs are already in use for referring to Wikipedia pages in RDF. * The semantics of /wiki is "a wiki page", while the intended semantics of /data is "a machine readable data set". * The /wiki path has no room for addressing individual slots - in fact, it refers to the page as rendered using information from all slots (compare T107595). * The /wiki path on Wikimedia sites is well established and heavily used. It's risky to overload it with new semantics and behavior. * The proposed URL scheme does not have room for slot names. We will not be able to refer to slots other than the main slot. * The proposal was amended to use the /data/<slot>/ prefix, for forward compatibility. The intended meaning or semantics of <slot> is not yet fixed, though it is expected to align with slot names (compare T107595). * The proposed schemes are not stable against page renames. We could use page IDs instead of the title. * Page IDs are also brittle: sometimes, a page is moved to an archive-style title, and a new page is created using the old title. In such a case, the intended semantics of the data URLs is unknown. * Most entry points, including the REST API, relies on titles, not page IDs. * Page IDs will often not be known to the code that constructs the data URL. It may take a database or API request to determine the page ID. * Page IDs don't allow for "eyeballing", they are not self-explanatory. * The URL pattern should include a versioning mechanism * The idea of versioning is somewhat contrary to the idea of stable canonical identifiers. The canonical identifier should stay canonical, and not be replaced by a new canonical URL. The primary concern is the identity of the object identified, not the format of the data returned when resolving the URL. This situation is contrary to the situation for APIs: here, it's important to know exactly the format of the data returned, and how to request which bits of data. Here, versioning is a good thing. * The proposed URL pattern introduces a new API for MediaWiki; there is no need for another API beyond the old school action API, the traditional web API and the new REST API. * The proposed URL pattern is merely a naming convention; it can act as a from fro any of the existing APIs. Its primary aim is to provide stable identifiers, to provide fine grained data access. * The concerns of identifiers and APIs are related, but dissimilar, as explained above. They can be seen as complementary. == Resources == * https://www.w3.org/TR/cooluris/ * https://www.w3.org/TR/dwbp/#UniqueIdentifiers * https://www.w3.org/TR/ld-bp/#HTTP-URIS * https://data.gov.uk/resources/uris * http://philarcher.org/diary/2013/uripersistence/#minimal * https://www.w3.org/Provider/Style/URI.html

**Revised after public discussion, April 1 2017** and April 13 2017** NOTE: Last call for comments! If no new pertinent concerns are raised by April 26 2017, this RFC will be approved for implementation! == Problem == Wikimedia is managing a growing amount of machine readable data as wiki page content. The latest addition is the Data namespace on commons, which hosts tabular data like [[https://commons.wikimedia.org/wiki/Data:Dolmens_of_the_Preseli_Hills.tab|Data:Dolmens_of_the_Preseli_Hills.tab]] and geographic data like [[https://commons.wikimedia.org/wiki/Data:Avignon_City_Wall.map|Data:Avignon_City_Wall.map]]. There is currently no canonical URL for referring to and retrieving these data sets. Canonical URLs are needed as stable identifiers (URIs) in linked data. **Concrete need:** Wikidata can reference geo-shape data from the Data namespace on Commons. To represent such references in RDF, the data set needs a canonical URI. See {T159517} == Proposed Solution == * Use URLs of the form https://commons.wikimedia.org/data/main/Data:Avignon_City_Wall.map to identify and retrieve machine readable page content. "main" refers to the main slot, see T107595. * The `/data/<slot>` path is rewritten to a special page, Special:PageData * Special Special:PageData will redirect (with status 303) to an appropriate (and typically cacheable) URL for retrieving the page data. For now, this will use the `action=raw` interface. * Special:PageData may apply content negotiation based on the Accept header sent by the client. In the first iteration, it will only check if any accept header sent by the client is compatible with the content model of the requested page. * The 303 redirects are not cecheable for now, because they depend on the Accept header; complex normalization would be needed to allow the cache to vary on the Accept header without causing massive cache fragementation. Note that in contrast to Wikidata entity URIs, the above URIs identify //descriptions// (data), not the thing described by the data. They also do not identify wiki pages, as the /wiki/ path does. Also note that the primary purpose of these URLs are to act as canonical stable identifiers (URIs). They should be resolvable, but they are not intended as a full-fledged data access API. They may however be implemented to redirect to such an API. == Status Quo == * There is a way to get raw page data for most data types, using action=raw with the "ugly" URL form: <https://commons.wikimedia.org/w/index.php?title=Data:Avignon_City_Wall.map&action=raw>. However, this is not supported for data types that have "direct editing " disabled. E.g. <https://www.wikidata.org/w/index.php?title=Q23&action=raw> does not work. * Wikidata uses <https://www.wikidata.org/entity/Q23> as the canonical URI of concepts, and <https://www.wikidata.org/wiki/Special:EntityData/Q23> as the canonical URI of the description. Both apply content negotiation and trigger a 303 redirect. The canonical URL for a specific serialization has the form <https://www.wikidata.org/wiki/Special:EntityData/Q23.ttl>. == Concerns an Alternatives Considered == * Do not include the namespace after /data/, e.g. https://commons.wikimedia.org/data/Avignon_City_Wall.map * That would mean this URL pattern cannot be used as a general mechanism to refer to page content. It would be specific to the Data namespace on Commons. * Use "raw" instead of "data", e.g. https://commons.wikimedia.org/raw/Data:Avignon_City_Wall.map * "raw" is less descriptive, and may not be correct if content negotiation is applied. * Use REST API URLS * The REST API offers fairly clean URLs, but they still expose details about the web application and API version. Even the fact that they expose the fact that this is an API is too specific in a context where URLs are used as identifiers. * "URLs don't need to be pretty" * While URLs do not have to be pretty, they should be stable, especially when they are to be used as stable unique identifiers. Remocing all application specific information from the URL provides more stability by adding a layer of abstraction. == Open Questions and Concerns == * We could apply content negotiation to the established page URLs using the `/wiki/` path. Such URLs are already in use for referring to Wikipedia pages in RDF (e.g. by DBpedia and also by Wikidata). On the other hand, the `RDF. * The semantics of /wiki/` path is really a UI entry point"a wiki page", and it seems like a good idea to keep the UI separate from thwhile the intended semantics of /data is "a machine readable data identifiersset". * The proposed URL scheme doe/wiki path has not have room for slot names. We will not be able to refer toaddressing individual slots other than the main slot.- in fact, Possible solution: https://commons.wikimedia.org/data/main/Avignon_City_Wall.mapit refers to the page as rendered using information from all slots (compare T107595). * The /wiki path on Wikimedia sites is well established and heavily used. This is looking more and more thike the REST API URLsIt's risky to overload it with new semantics and behavior. * The porposed schemes are not stable against page renames. We could use page IDs instead of the title. That makes the URLs a lot less intuitive,roposed URL scheme does not have room for slot names. and requires database access in ordWe will not be able to refer to constructslots other than them main slot. * The proposal was amended to use the /data/<slot>/ prefix, for forward compatibility. The intended meaning or semantics of <slot> is not yet fixed, though it is expected to align with slot names (compare T107595). * The proposed schemes are not stable against page renames. We could use page IDs instead of the title. * Page IDs are also brittle: sometimes, a page is moved to an archive-style title, and a new page is created using the old title. In such a case, the intended semantics of the data URLs is unknown. * Most entry points, including the REST API, relies on titles, not page IDs. * Page IDs will often not be known to the code that constructs the data URL. It may take a database or API request to determine the page ID. * Page IDs don't allow for "eyeballing", they are not self-explanatory. * The URL pattern should include a versioning mechanism * The idea of versioning is somewhat contrary to the idea of stable canonical identifiers. The canonical identifier should stay canonical, and not be replaced by a new canonical URL. The primary concern is the identity of the object identified, not the format of the data returned when resolving the URL. This situation is contrary to the situation for APIs: here, it's important to know exactly the format of the data returned, and how to request which bits of data. Here, versioning is a good thing. * The proposed URL pattern introduces a new API for MediaWiki; there is no need for another API beyond the old school action API, the traditional web API and the new REST API. * The proposed URL pattern is merely a naming convention; it can act as a from fro any of the existing APIs. Its primary aim is to provide stable identifiers, to provide fine grained data access. * The concerns of identifiers and APIs are related, but dissimilar, as explained above. They can be seen as complementary. == Resources == * https://www.w3.org/TR/cooluris/ * https://www.w3.org/TR/dwbp/#UniqueIdentifiers * https://www.w3.org/TR/ld-bp/#HTTP-URIS * https://data.gov.uk/resources/uris * http://philarcher.org/diary/2013/uripersistence/#minimal * https://www.w3.org/Provider/Style/URI.html