Page MenuHomePhabricator

Decide how the WD items should be sourced
Closed, ResolvedPublic

Description

When adding claims to Wikidata items, it would be great to source (ideally) each and every one of them. As I see it, there are two options:

  1. Use Wikipedia as a source. This is less than ideal in terms of being a reliable source, but it's straightfoward to do (and while it's not recommended, it's common practice, for better or worse). Every item in the WLM database contains information about the Wikipedia page it was fetched from, for example //sv.wikipedia.org/w/index.php?title=Lista_%C3%B6ver_arbetslivsmuseer_i_Blekinge_l%C3%A4n&oldid=30834404. As it includes the page revision id, it's easy to link to the correct version.
  2. Use the registrant_url value. For example, in the Norwegian building data, each item has an url pointing the the Kulturminnesøk service. The advantage is that it's a reliable, official data source. There are two disadvantages:
    1. The WLM database comprises data downloaded from Wikipedia pages, which have been edited by the community. There's no way of knowing which information is supported by the registrant_url and which was added manually by someone.
    2. Many of the data sets don't even have a registrant_url.

In the end, pointing back to the Wikipedia page certainly seems better than nothing. In some cases we do know where the data on Wikipedia came from _originally_ -- such as the Swedish museum dataset -- but again, there's always a possibility that the Wikipedia page contains info added by someone manually. The WLM database is updated continuously, so it contains the freshest dump of whatever is included in the Wikipedia page. This makes it tricky to guess which statements are supported by the "official" sources.

Event Timeline

There is a third alternative which is to ource it to the monuments databse itself (imported from: WLM database) (with country+id+date as extra info). This ends up being largely equivalent to imported from: Wikipedia.

I would say that as a default we use:
imported from: Wikipedia
url: permalink.
(this would of course be ommitted as soon as we have a better reference)

For individual datasets we could spend a little time investigating (a separate task per batch), or asking relevant communities to investigate, which properties could be sourced better (be it through registrar_url or some other type of reference. True that they may have been edited since the initial import, but then that is also through for normal data on Wikidata.

An outcome of this task would be to create a flow chart of the different steps.

Note that as per https://www.wikidata.org/wiki/Wikidata:Bots, each and every statement added using a bot account will have to be sourced.

First step: created an item for the Monuments database https://www.wikidata.org/wiki/Q28563569

Is this to be used together with P143? (when no better sources are available)
If so it should be used (as a compared part of the reference) together with:

First step: created an item for the Monuments database https://www.wikidata.org/wiki/Q28563569

Is this to be used together with P143? (when no better sources are available)
If so it should be used (as a compared part of the reference) together with:

  • [[ https://www.wikidata.org/wiki/P813 | P813 ]]= today (as a non-compared part of the reference)
  • [[ https://www.wikidata.org/wiki/P854 | P854 ]] = https://tools.wmflabs.org/heritage/api/api.php?action=search&format=json&srcountry=<namespace>&srlanguage=<lang>&srid=<unique_id> where e.g <namespace>= se-bbr, <lang> = sv and <unique_id>=21000001001755 (as a compared part of the reference)

The links in your comment seem to be broken?

The links in your comment seem to be broken?

Repaired

So to try and summarise this (@Alicia_Fagerving_WMSE please point out if any of this is not what we are actually doing):

What we are de-facto doing today is:

  • Image/commonscat claims:
    • No source is added
  • All other claims
    • Are sourced as imported from the monuments database

The se-arbetsl dataset did not have any registrar_url and for se-fmis and se-bbr it coincides with the url generated by the id property.
Note however that for se-ship (and similar future datasets) we should have made sure to include registrar_url somehow, likely via an P973 claim.
Since the se-ship links were all broken at some point in the last month or so we should not import these now.

In addition to the info in T155241#3234671:

When regristrar_url is identical to the link created by the id property it is left out. Otherwise it is used together with P937 as a qualifier on the heritage status property.