Page MenuHomePhabricator

Language handling for adding citations with citoid in wikidata
Open, NormalPublic

Description

Zotero gives us back an unvalidated language values which may point to the language of the metadata, of the source, or both. The value is typically a language code of some sort although it can also be text i.e. "English", "Francais". We need to resolve this language value in three ways:

  1. language -> wikidata item (for statement)

generated table of a limited set of these, as the full space of possibilities is larger than the number of valid mediawiki language codes: https://www.wikidata.org/wiki/Help:Wikimedia_language_codes/lists/all

may want to pre-compute these a la: https://gerrit.wikimedia.org/g/mediawiki/extensions/Wikibase/+/master/repo/maintenance/updateUnits.php

  1. language -> content language (for label)

valid content languages: https://www.wikidata.org/wiki/Special:ApiSandbox#action=query&format=json&meta=wbcontentlanguages&formatversion=2&wbclcontext=term

  1. language -> monolingual snak language (for metadata)

valid monolingual snak languages: https://www.wikidata.org/wiki/Special:ApiSandbox#action=query&format=json&meta=wbcontentlanguages&formatversion=2&wbclcontext=monolingualtext

At present, this does not give us a complete way to handle languages. If the code is not a valid code as a content language or monolingual snak we need to set a fallback language and I'm not sure how to do this.

At present we are using openrefine to get the wikidata item for statements, which is slow, example: https://tools.wmflabs.org/openrefine-wikidata/en/api?query=%7B%22query%22%3A%22hu%22%2C%22limit%22%3A1%2C%22type%22%3A%5B%22Q1288568%22%2C%22Q33742%22%2C%22Q951873%22%2C%22Q33384%22%2C%22Q34770%22%2C%22Q1002697%22%5D%2C%22type_strict%22%3A%22any%22%7D

There are some fallbacks available where there is a wiki associated with a language code available from each wiki's api, example:

https://pt.wikipedia.org/wiki/Especial:ApiSandbox#action=query&format=json&meta=siteinfo&formatversion=2&siprop=general

But this does not handle the most common direction of having a more specific language code like en-US or fcr and wanting to fall back on en and fr respectively.

See also: T217239

ISO to wikimedia language codes gist: https://gist.github.com/mvolz/1e99234373833838581e558e99904201

Event Timeline

Mvolz created this task.Feb 27 2019, 4:52 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 27 2019, 4:52 PM
Mvolz triaged this task as Normal priority.Feb 27 2019, 4:52 PM

If you need a mapping from ISO language codes to Wikimedia ones, Wikidata-Toolkit has such a mapping: https://github.com/Wikidata/Wikidata-Toolkit/blob/3e62f93b137c25961c5a12172c7f213a720ecb67/wdtk-datamodel/src/main/java/org/wikidata/wdtk/datamodel/interfaces/WikimediaLanguageCodes.java

It might not be the most up to date mapping you can get though (I would be interested in updating it but I haven't investigated how to do that)

For monolingual text values, you can fall back to und (undetermined) if the language is not recognized.

For the label, I think it would be acceptable to use the English label – it’s fairly common for works (especially works of art) to have the label in a different language (for example, Q1061035’s German label is La trahison des images, the French title), and if I understand correctly this would be the label of an item for the citation, which should qualify as a work.

For the Wikidata item, it’s probably best to omit the statement altogether (or use unknown value? but that usually means that the value is known to be unknown, which doesn’t really fit here) if no corresponding Wikidata item is found.

Mvolz added a comment.EditedFeb 27 2019, 5:26 PM

Default fallbacks:

monolingual snak -> und code?
label -> fall back on en or user language - or both?

Mvolz added a comment.Feb 27 2019, 6:01 PM

If you need a mapping from ISO language codes to Wikimedia ones, Wikidata-Toolkit has such a mapping: https://github.com/Wikidata/Wikidata-Toolkit/blob/3e62f93b137c25961c5a12172c7f213a720ecb67/wdtk-datamodel/src/main/java/org/wikidata/wdtk/datamodel/interfaces/WikimediaLanguageCodes.java
It might not be the most up to date mapping you can get though (I would be interested in updating it but I haven't investigated how to do that)

This is great, thanks!

For monolingual text values, you can fall back to und (undetermined) if the language is not recognized.
For the label, I think it would be acceptable to use the English label – it’s fairly common for works (especially works of art) to have the label in a different language (for example, Q1061035’s German label is La trahison des images, the French title), and if I understand correctly this would be the label of an item for the citation, which should qualify as a work.
For the Wikidata item, it’s probably best to omit the statement altogether (or use unknown value? but that usually means that the value is known to be unknown, which doesn’t really fit here) if no corresponding Wikidata item is found.

👍

Mvolz added a comment.Feb 28 2019, 9:29 AM

On thinking on this more, I think we can be quite greedy with label; we can set it at least in the guessed text language, user language, and English. This is because we're dealing with publications published in a particular language when we create items, like books and journal articles, where an untranslated label wouldn't necessarily be a bad thing and even may be the most correct thing; i.e. for edition or translations items, you actually might want the title as published, not the translated title (as opposed to book, where you might want the name of the book as it was published in each given language.)

Mvolz updated the task description. (Show Details)Feb 28 2019, 10:45 AM