Page MenuHomePhabricator

RFC: Canonical data URLs for machine readable page content
Closed, ResolvedPublic

Description

NOTE: Approved after final call on April 26 2017. Implementation is being tracked on T163921: [Epic] Implement canonical data URLs for machine readable page content

Problem

Wikimedia is managing a growing amount of machine readable data as wiki page content. The latest addition is the Data namespace on commons, which hosts tabular data like Data:Dolmens_of_the_Preseli_Hills.tab and geographic data like Data:Avignon_City_Wall.map.

There is currently no canonical URL for referring to and retrieving these data sets. Canonical URLs are needed as stable identifiers (URIs) in linked data.

Concrete need: Wikidata can reference geo-shape data from the Data namespace on Commons. To represent such references in RDF, the data set needs a canonical URI. See T159517: [RFC] RDF mapping for geo-shape / URIs for commons data pages

Proposed Solution

  • Use URLs of the form https://commons.wikimedia.org/data/main/Data:Avignon_City_Wall.map to identify and retrieve machine readable page content. "main" refers to the main slot, see T107595.
  • The /data/<slot> path is rewritten to a special page, Special:PageData
  • Special Special:PageData will redirect (with status 303) to an appropriate (and typically cacheable) URL for retrieving the page data. For now, this will use the action=raw interface.
  • Special:PageData may apply content negotiation based on the Accept header sent by the client. In the first iteration, it will only check if any accept header sent by the client is compatible with the content model of the requested page.
  • The 303 redirects are not cecheable for now, because they depend on the Accept header; complex normalization would be needed to allow the cache to vary on the Accept header without causing massive cache fragementation.

Note that in contrast to Wikidata entity URIs, the above URIs identify descriptions (data), not the thing described by the data. They also do not identify wiki pages, as the /wiki/ path does.

Also note that the primary purpose of these URLs are to act as canonical stable identifiers (URIs). They should be resolvable, but they are not intended as a full-fledged data access API. They may however be implemented to redirect to such an API.

Status Quo

Concerns an Alternatives Considered

  • Do not include the namespace after /data/, e.g. https://commons.wikimedia.org/data/Avignon_City_Wall.map
    • That would mean this URL pattern cannot be used as a general mechanism to refer to page content. It would be specific to the Data namespace on Commons.
  • Use "raw" instead of "data", e.g. https://commons.wikimedia.org/raw/Data:Avignon_City_Wall.map
    • "raw" is less descriptive, and may not be correct if content negotiation is applied.
  • Use REST API URLS
    • The REST API offers fairly clean URLs, but they still expose details about the web application and API version. Even the fact that they expose that this is an API is too specific in a context where URLs are used as identifiers.
  • "URLs don't need to be pretty"
    • While URLs do not have to be pretty, they should be stable, especially when they are to be used as stable unique identifiers. Removing all application specific information from the URL provides more stability by adding a layer of abstraction.
  • We could apply content negotiation to the established page URLs using the /wiki/ path. Such URLs are already in use for referring to Wikipedia pages in RDF.
    • The semantics of /wiki is "a wiki page", while the intended semantics of /data is "a machine readable data set".
    • The /wiki path has no room for addressing individual slots - in fact, it refers to the page as rendered using information from all slots (compare T107595).
    • The /wiki path on Wikimedia sites is well established and heavily used. It's risky to overload it with new semantics and behavior.
  • The proposed URL scheme does not have room for slot names. We will not be able to refer to slots other than the main slot.
    • The proposal was amended to use the /data/<slot>/ prefix, for forward compatibility. The intended meaning or semantics of <slot> is not yet fixed, though it is expected to align with slot names (compare T107595).
  • The proposed schemes are not stable against page renames. We could use page IDs instead of the title.
    • Page IDs are also brittle: sometimes, a page is moved to an archive-style title, and a new page is created using the old title. In such a case, the intended semantics of the data URLs is unknown.
    • Most entry points, including the REST API, relies on titles, not page IDs.
    • Page IDs will often not be known to the code that constructs the data URL. It may take a database or API request to determine the page ID.
    • Page IDs don't allow for "eyeballing", they are not self-explanatory.
  • The URL pattern should include a versioning mechanism
    • The idea of versioning is somewhat contrary to the idea of stable canonical identifiers. The canonical identifier should stay canonical, and not be replaced by a new canonical URL. The primary concern is the identity of the object identified, not the format of the data returned when resolving the URL. This situation is contrary to the situation for APIs: here, it's important to know exactly the format of the data returned, and how to request which bits of data. Here, versioning is a good thing.
  • The proposed URL pattern introduces a new API for MediaWiki; there is no need for another API beyond the old school action API, the traditional web API and the new REST API.
    • The proposed URL pattern is merely a naming convention; it can act as a front for any of the existing APIs. Its primary aim is to provide stable identifiers, to allow fine grained data access.
    • The concerns of identifiers and APIs are related, but dissimilar, as explained above. They can be seen as complementary.

Resources

Event Timeline

The description of the requirements seems to fit the REST API:

  • API versioning & content negotiation.
  • REST URL structure.
  • Integration with CDN layer.
  • Machine and user readable API specs & documentation.

The REST URL hierarchy makes it quite easy to route specific end points directly to specialized backends, while still presenting a consistent & well-documented API to end users. In other words, exposing functionality through the REST API does not imply the use of RESTBase where that does not make sense.

@GWicke REST URLs and canonical URIs are quite different conceptually, though it's nice when they coincide. However, URIs by nature should not include interface version information, because they identify the resource independently of representation.

What URI structure would you propose?

However, URIs by nature should not include interface version information, because they identify the resource independently of representation.

The REST API versioning policy explicitly describes how representation concerns are handled through content negotiation, and not by incrementing major API versions. Changes in major API versions are expected to be extremely rare. The major API version is basically an insurance policy for the case that we'd want to introduce a fundamentally different URL layout, without breaking existing users and resource references.

What URI structure would you propose?

Within the REST API, data associated with pages is typically exposed using the /api/rest_v1/page/{type}/{title}{/revision} pattern. Examples: HTML, summary, data-parsoid, PDF.

I think we have several concepts there that needs to be refined.

  1. Canonical object URI - this is the URI that uniquely identifies an object in Wikimedia world, and, by extension, in the whole world of linked data. Note that in theory that URI does not have to produce any content when accessed (in fact, it may not even use any accessible scheme like http:). However, it is common and beneficial to link it with:
  2. Data access URI - this is the address one can use to retrieve some representation of the object identified above. The kind of representation varies a lot, sometimes it is a text description, sometimes it is some kind of RDF, sometimes it may be negotiated page, etc. I suggest we use content negotiation as much as we can and choose sane defaults when we can't. I also suggest that we link alternative representations to this data URI.
  3. Human-readable URI - since we are in the wiki world, our content is meant to be edited by humans, and thus have human-readable (at least to certain extent :) representation, where you can interpret and edit it. Not every object would have these (e.g. individual values in Wikidata don't) but many interesting ones would.

I would suggest to design a scheme that supports each of the above, and allows to go between them in automatic way - i.e. having one of them, it is easy for a simple script to get to the others. I'd also suggest to use redirects and content negotiation to reconcile the differences between how we represent things in Wiki and how we want external URLs to look like.

/api/rest_v1/page/{type}/{title}{/revision} pattern

I do not think we should include revisions in data URIs, not unless we intend to represent our revision structure in linked data formats (which I hope we don't, 99.99% of intended usage won't need it). Also, api/rest_v1 part should not be part of the canonical data URI. api part because canonical URI should be the same, however you access it - via specific API or not, it identifies the object, not specific way of retrieving it, and rest_v1 - for the same reason, plus canonical URI should not change if we change our API version. Basically, unless we radically change the whole data structure, the URI should be forever (and even if we do there's an argument for preserving an old one, so even more basically, canonical URIs are forever).

I would propose the following scheme for Commons:

https://commons.wikimedia.org/data/Avignon_City_Wall as canonical URI (we can add .map if it's important, but if we can avoid it, it looks nicer without it). This URI would be redirected to the following places:

  • if accessed with Accept type known to us and having representation, either produces this representation directly or redirects to https://commons.wikimedia.org/wiki/Data:Avignon_City_Wall.map?action=rawdata&format=text/csv
  • if Accept suggests it is a browser asking for HTML, redirected to https://commons.wikimedia.org/wiki/Data:Avignon_City_Wall.map
  • if there is no Accept, choose a sane default for it - e.g. JSON or something - and proceed as if this were the requested type.

Relying on redirects for canonical URI should solve the caching problem, at the (supposedly minimal) cost of extra HTTP request in some cases. Of course, tools that care about it could access target URLs directly.

@GWicke:

  • I'd rather not use a REST URL as the URI. A good REST API exposes version, api method, etc. It's as explicit as possible. A good URI is minimal. It's purpose is to identify a resource on an abstract level, not (in general) a particular version or serialization, not a particular access API. A URI shoudl be short and mnemonic.
  • We can resolve the URI to a REST URL instead of action=raw. I'd actually like that better. Maybe we can kill action=raw some day.

@Smalyshev:
I agree with what you said. I'd like to mention a few things:

  • for commons data, we may not need a canonical object (concept) URI. At least, we don't need it for the Wikidata use case. Wikidata wants to reference a data set, not the thing described by that data set.
  • we cannot drop the file extension. In some cases, both Data:Foo.map and Data:Foo.tab exist.
  • csv should not be the default format, since we lose information when mapping to csv, and csv itself is under-specified.

for commons data, we may not need a canonical object (concept) URI.

Well, if we plan to refer to it in RDF, we need some URI. RDF does not make distinction between real concept URIs and "just" URIs which don't mean anything special, and we could kind of pretend that this is "just" URI. But why do it if we do have an identifyable dataset which can be (and in many contexts is) treated as its own entity? We might as well have proper concept URI for it. May come handy later.

we cannot drop the file extension. In some cases, both Data:Foo.map and Data:Foo.tab exist.

OK, it's not a problem to keep the extension. However, it won't be efficient to make all redirects in PHP, so redirect should happen before content handlers come into play, which means redirects should be uniform across all data types.

I'm not sure what should happen if we ask, say, for text/csv and this content handler has no CSV representation. Error page?

csv should not be the default format,

I assume all handlers would support at least HTML (as redirect to wiki page) and some default representation, probably type-dependent, does not have to be CSV. They can support CSV additionally if the type is appropriate.

Well, if we plan to refer to it in RDF, we need some URI. RDF does not make distinction between real concept URIs and "just" URIs which don't mean anything special, and we could kind of pretend that this is "just" URI. But why do it if we do have an identifyable dataset which can be (and in many contexts is) treated as its own entity? We might as well have proper concept URI for it. May come handy later.

We do not plan to refer to the thing that is described by the data set. We plan to refer to the data set.

Maybe we are misunderstanding each other. In Wikidata, each entity has two URIs associated with it: the concept URI (.../data/Q12345) for the actual thing and the data URI (.../wiki/Special:EntityData/Q12345) for the description of the thing. When referencing datasets on commons, we really mean the data, so we should use the URI of the data. We don't know what thing the data describes (maybe the item that contains the reference? maybe not?), and we have no reason to define a URI for the thing that the data describes.

. In Wikidata, each entity has two URIs associated with it: the concept URI (.../data/Q12345) for the actual thing and the data URI (.../wiki/Special:EntityData/Q12345)

ITYM .../entity/Q12345.
Yes, the first one is the canonical entity URI and the second is data access URI and also entity dataset URI. We probably don't need both for Commons, since we do not plan to represent this kind of commons data in RDF, we only need to refer to it.

When referencing datasets on commons, we really mean the data, so we should use the URI of the data.

Again, since we don't have both wd:Q1234 and wdata:Q1234, it could be the same URI (or, alternatively, the only URI is the "URI of the data"). In Wikidata, it's different URIs because we have two different things. But for Commons, we don't need to represent these things, so we can just have one URI, since we don't really define what this URI is - it's never a subject for any RDF predicate as I understand, at least for .tab/.map data.

I understand you are arguing for having something like wdata: URI to be used. I am not against it, though as I said above practically there is not much difference. But I'd prefer nicer URI than wdata: - we didn't have a choice in Wikidata since we had to have two URIs and only one of them got to be "nice", but here we have only one, it may as well look like https://commons.wikimedia.org/data/Avignon_City_Wall.map. We could also use direct link but that IMHO is a) less nice and b) creates unnecessary binding to namespaces, which are also internationalized and it becomes kinda messy. I'd rather have clean static URL on the front and have it resolve to different things as needed on the back. And I'm very much for the URI in RDF not having any query params, etc. - it looks less nice and creates unnecessary dependency IMO.

@Smalyshev Oh, I was just trying to clarify the semantics of the URI. I wasn't trying to argue against your point. In fact, I agree with everything you wrote above :)

This RFC is scheduled for discussion on IRC tonight, at 21:00 UTC (2pm PDT, 23:00 CEST), on freenode, in the channel #wikimedia-office.

I think we have several concepts there that needs to be refined.

  1. Canonical object URI - this is the URI that uniquely identifies an object in Wikimedia world, and, by extension, in the whole world of linked data.

I think there is a spectrum here between reserving the object domain only for abstract concepts & treating most things as representations, and making some of those representations first-class objects, related to the underlying concept via an is-representation-of edge. For addressable resources in the REST sense, I think that usability concerns should play a prominent role in finding the right balance between URLs and headers. For example, I don't think that using a single URL for all resources related to the concept of "Barack Obama" would make a lot of sense to API users.

@GWicke yes, indeed. This ticket is not about concept URIs, but about URIs for page content. Page content may be a description of a real world concept, or be otherwise related to such concepts. In order to express such a relationship, both (the concept and the description) need canonical URIs. At least the URI for the description should be resolvable.

On the next level, we may want URIs/URLs for specific serializations/formats/projections as well as for for specific revisions. In my mind, RESTbase URLs are suitable for that. But for the "abstract" description URI (basically, page title level), RESTbase URLs expose too much technical detail for my taste.

I think the URL most users would consider canonical is /wiki/{title}. Wouldn't this already provide a reasonable URL for the concept of the page?

/wiki/{title} assumes HTML representation. If we could make it content-negotiate to another representation (JSON, CSV or whatever is appropriate for map data - geoJSON?) it'd be fine. But if it's hard to do, we'd better use URI that makes it easier.

/wiki/{title} is the user interface URL. I think it makes sense to have a separate URL for the data. This will also make it much easier to introduce a URL that supports content negotiation via 303 redirects.

From the current task description:

Problem:
There is currently no canonical URI/URL for referring to and retrieving these data sets.

Can someone please elaborate on this? This task has a whole lot of text about potential solutions and it's not clear to me why a "nice" URL is needed at all. I don't follow the use-cases mentioned in T161527#3135095:

https://commons.wikimedia.org/data/Avignon_City_Wall as canonical URI (we can add .map if it's important, but if we can avoid it, it looks nicer without it). This URI would be redirected to the following places [...]

Who's going to be typing this URL or using it? Actual readers and users? Why does the URL need to look nice?

If the idea is for computer code to access these URLs, who cares if the URL is "ugly"? I don't ever give any thought to whether an api.php URL looks clean or not, for example. Plus api.php URLs, with their standardized and supported query strings, can support arbitrary additions like &format=json that make the request more explicit than "data" about what's going to come back to the client from the server. What's wrong with query strings?

If the idea really is for human beings and real-life users to be using these /data/ URLs, personally I would think that you would want to support the simplest URL transformation possible, as I often am forced to mangle a URL by hand. (You should see me trying to add 2000 to a Phabricator Maniphest URL that's using a Bugzilla ID.) Replacing "wiki" in the URL with "data" could be somewhat convenient if the page title is used. But again, it's really unclear to me who the audience is, given that most users don't pay any attention to pretty URLs at all. Most users also rarely type URLs. And it's not clear to me why a regular user would want to access the raw data about an entity anyway.

(It feels like we're discussing programmatic access and computers really do not care about pretty URLs. As I said to Tim after office hours, the last resort argument of people to defend pretty URLs is then to say, well the URL is used for caching!)

@MZMcBride It's not about the URLs bein "pretty" URLs as much as it's about the URLs being stable. That is much easier to achieve if the URL is "clean": it should not expose technical details about the underlying web application or access mechanism. Clean URLs provide a layer of abstraction. This improves stability of data that uses the URLs as identifiers.

Software that uses URLs can be updated. Data that uses URLs cannot. Ideally, URLs used as URIs should stay valid forever. So it's impogenertant to think about what they should look like, so we can still happily serve them when we have replaced MediaWiki with a hive of genetically engineered cyborg termites in 100 years.

The basic idea is that stable URIs should expose a minimum of information. Have a look at https://www.w3.org/TR/cooluris/ and the other resources I linked in the description.

This RFC was discussed in a public meeting on IRC on March 29th. Full log: https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-03-29-21.01.log.html

The outcome was that @daniel will revise the RFC based on the discussion, and then put it on last call. If no new pertinent concerns are raised, the revised RFC is due for final review and possible approval on April 12th.

daniel renamed this task from Canonical data URIs and URLs for machine readable page content to Canonical data URLs for machine readable page content.Apr 1 2017, 10:05 AM
daniel updated the task description. (Show Details)
daniel updated the task description. (Show Details)

"E.g. https://www.wikidata.org/w/index.php?title=Q23&action=raw does not work." seems completely orthogonal to this task.

We already have /w/index.php, /w/api.php, and /wiki/ entry points. These are stable paths and we have more than ten years of evidence of this.

Using /raw/ would essentially be implementing action paths (cf. T19981: Short URLs for page actions such as /history/Title (Enable wgActionPaths at WMF))? Part of the beauty of using /wiki/ is that it's vaguely more internationalized. During discussions of proposals to change /wiki/ to use action paths such as /view/ or /history/, using only English-language verbs in the URL has been criticized for its lack of localization.

Maybe I'm still just missing the obvious, but with a move toward "multi-content revisions," I would think the concept of wanting to get "all the data from the English Wikipedia's Barack Obama article" (i.e., /data/Barack_Obama) would become even murkier and ill-defined. I still don't understand who the audience is and what their use-cases are for this proposed new entry point. I guess I need to focus on T159517: [RFC] RDF mapping for geo-shape / URIs for commons data pages.

@MZMcBride good point about MCR slots. When designing a scheme like this, we should account for them. I of all people should know. I suppose I need to amend my proposal.

With respect to /w/index.php, /w/api.php: we are not treating them as stable identifier prefixes anywhere. They are API entry points. Interfaces for interacting with the data, not identifiers for the data.

Tim asked me to not use the term URI, since they are the same as URLs in this context. But perhaps it's a useful distinction after all: not everything that is a decent URL is an acceptable URI. URI should be really stable, not for 10 years but for 100. It should be easy and straight forward to make them work with a completely different technology. Having ".php" anywhere in there is a no-go.

Possible solution: https://commons.wikimedia.org/data/main/Avignon_City_Wall.map.

We don't need it for main slots, as it should be the default. We may want it if we ever need to have URLs for non-main slots. I'm not sure yet if we even have to have it- usually only the main data set has addressable URLs, and the parts of it - such as individual values or substructures - do not.

@Smalyshev the problem is that page titles can contain slashes. If we don't always provide the slot name, how do we handle https://en.wikipedia.org/data/AC/DC? Is "AC" a slot name, or part of the title? What if it is both?

One option is to not use another slash, but something like https://commons.wikimedia.org/data-main/Avignon_City_Wall.map.

One example where we may need slot-specific data URLs is MediaInfo. What URL/URI should we use for the machine readable description of Foo.jpg? Perhaps https://commons.wikimedia.org/data-mediainfo/File:Foo.jpg? On the other hand, for Wikibase entities, we can always use something like https://commons.wikimedia.org/wiki/Special:EntitiyData/M26743276. But I'd really like to get rid if the Special:EntitiyData stuff in URIs...

Perhaps https://commons.wikimedia.org/data-mediainfo/File:Foo.jpg

So what would be the content of https://commons.wikimedia.org/data/File:Foo.jpg then? I'm still not clear on why would we need several data slots on the same page.

Special:EntityData doesn't look good for me as canonical URL. As internal implementation, sure, why not, but we need something cleaner for the external canonical URL. Dash may not be that bad, it's frequently used to connect similar elements as part of the name.

Perhaps https://commons.wikimedia.org/data-mediainfo/File:Foo.jpg

So what would be the content of https://commons.wikimedia.org/data/File:Foo.jpg then? I'm still not clear on why would we need several data slots on the same page.

The main slot data URL would refer to what action=raw currently gives you: the wikitext of the file description page (not the structured meta-data): https://commons.wikimedia.org/w/index.php?title=File:Foo.jpg&action=raw

Special:EntityData doesn't look good for me as canonical URL. As internal implementation, sure, why not, but we need something cleaner for the external canonical URL.

Is what we currently use in the canonical data URLs for Wikidata. Yes, I don't like it either. http://www.wikidata.org/data/Q12345 would be much nicer.

Dash may not be that bad, it's frequently used to connect similar elements as part of the name.

I'm also leading to using the dash. data would be equivalent to main-data; We can then also have mediainfo-data, etc.

I'm also leading to using the dash. data would be equivalent to main-data

I like the idea of no dash meaning the same as main-. Good defaults rule :)

I'm late to the party, but I'd like to make a couple of points below.

  1. Would it make sense to separate .map and .tab urls into something like https://commons.wikimedia.org/map/Avignon_City_Wall and https://commons.wikimedia.org/data/Dolmens_of_the_Preseli_Hills respectively? This would allow us to have both map and data using the same name without an extension. This also helps avoid confusion that .map and .tab are specific representations of some canonical thing. We'd be able to redirect to, say, an HTML representation by appending .html or XML representation by appending .xml.
  1. Having read the comments above, I'm in favor of making URLs pretty, simply because our use case in the future may include users directly sharing these endpoints rather than only machines reading these URLs. If we can create pretty URLs at no extra cost, why not do it.
  1. Could someone explain the downside of removing the "Data:" namespace from the description:

That would mean this URL pattern cannot be used as a general mechanism to refer to page content. It would be specific to the Data namespace on Commons.

An example would help clarify the situation. Also, is the above quote still true given the first point?

  1. Regarding page renames:

The porposed schemes are not stable against page renames. We could use page IDs instead of the title. That makes the URLs a lot less intuitive, and requires database access in order to construct them.

Is the idea behind using IDs to prevent future uses of an ID? If that's the case, can we find a middle ground where the date the page is created or moved is also included in the URL? Something like https://commons.wikimedia.org/map/01-01-2017/Avignon_City_Wall? If some other piece of data wants to use the title 'Avignon_City_Wall' at a later date, then the URL would look something like: https://commons.wikimedia.org/map/04-11-2017/Avignon_City_Wall.

This RFC was discussed in a public RFC meeting on the wikimedia-office channel on April 12. It was agreed that the RDF be put on Final Call: if no new pertinent concerns are raised by April 26th, the RFC will be approved for implementation.

Meeting log: https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-04-12-21.07.html

daniel renamed this task from Canonical data URLs for machine readable page content to RFC: Canonical data URLs for machine readable page content.Apr 26 2017, 4:29 PM
daniel claimed this task.
daniel moved this task from Inbox to To Do on the User-Daniel board.
daniel removed a project: TechCom-RFC.

This RFC has been approved after final call for comment on April 26. Implementation is tracked on T163921: [Epic] Implement canonical data URLs for machine readable page content.