Page MenuHomePhabricator

[2.5] Study how to integrate Citoid in Wikidata
Closed, ResolvedPublic

Description

Output
We will conduct research on integrating Citoid in Wikidata, aiming to drastically reduce the number of unsourced statements in Wikidata at risk of deletion and to facilitate their reuse across other Wikimedia projects. (dependent on Segment 4 (WMDE) • Output 5)

Target
End-to-end integration of Citoid in Wikidata.

Event Timeline

DarTar lowered the priority of this task from High to Medium.

There's some info about this already on T196353:

Process:

Create preliminary gadget to do the same
Decide how mappings will be handled - will it be hard-coded, as in a gadget, or will there be some user-editable way to do it, as in TemplateData + VisualEditor, via JSON, or perhaps modelling it in wikibase itself using wikidata items/properties?
General way: Pros: Can be used with any wikibase instance. Hardcoding to specific properties is a bit problematic because the properties are theoretically quite changeable. It would be much less useful on non wikidata implementations of wikibase.

Cons: As we discovered with TemplateData + VisualEditor, users are sometimes frustrated at the inflexibility of only having a JSON block to work with in order to insert data, and want the Citoid extension to be able to make more fine tuned adjustments. This definitely still works for VisualEditor because there are many different languages all with different templates, but there will be less of an argument to make on this with Wikidata because we really only have one Wikibase instance and the possibility of using it on other instances is more theoretical than actual.

Conversation from IRC about modelling citoid ontology in wikidata and using that:

(11:14:34) mvolz: Can a wikidata dev give me a sanity check on something? Would it be feasible to write a gadget that uses sparql to query data in wikibase and then use that information to insert items? Or would it be too slow/better to use an interface message with some JSON in it, if it's something that isn't expected to change from query to query (i.e. you'd expect to get the same single result each time, so querying wikidata isn't strictly necessary.)
(11:21:55) sjoerddebruin: I think the constraints system uses queries...
(11:27:04) mvolz: sjoerddebruin: thanks, that is very helpful, I will look and see how they're doing it :)
(11:28:04) sjoerddebruin: I don't see Lucas around here, but I suggest talking with him. :)
(11:29:55) DanielK_WMDE: mvolz: it depends on the query, and if it runs always, or only when the user presses a button, whether it's just one query or many...
(11:30:21) DanielK_WMDE: generally speaking, havign the user press a button, then run a query, then do something, should be fine by the system
(11:30:37) DanielK_WMDE: depending on the query, it may be annoyingly slow to the user. expectation management is key
(11:32:33) spectre left the room (quit: Ping timeout: 248 seconds).
(11:33:59) BRPever [~androirc@49.126.25.90] entered the room.
(11:34:25) BRPever left the room (quit: Changing host).
(11:34:25) BRPever [~androirc@wikimedia/BRPever] entered the room.
DanielK_WMDE danmichaelo Danny_B
DanielK_WMDE danmichaelo Danny_B davic
(11:40:50) mvolz: DanielK_WMDE: so I'm thinking about how we want to insert items as references, and I was contemplating putting in the entire ontology for which citoid properties are equivalent to the wikidata ones. So the queries would occur every time an item was being created as reference, and probably maybe 10 queries for something like SELECT equivalent wikidata property WHERE instance of citoid property AND name equals "bookTitle".
(11:40:56) mvolz: There are some things we absolutely have to do queries for though, like the item which corresponds to the journal, for instance, or even author names, so theoretically that could get really bogged down adding those as well...
(11:41:27) mvolz: And so the list of properties could just go into an interface message to cut down on that.
(11:42:43) mvolz: But if all the queries are done in parlellel, these ones are all quite short.
(11:42:53) DanielK_WMDE: Lookup by name is always a bit problematic with wikidata. But talk to Lucas, he's the expert.
(11:42:56) mvolz: shorter than probably the longer ones we'd be waiting for anyway?
(11:42:59) mvolz: Ok :).
(11:43:14) DanielK_WMDE: but the mapping between citoied and wikidata sounds like it should be rather small, and cacheable
(11:43:20) mvolz: Yes
(11:43:36) DanielK_WMDE: generating that froim a sparql query is sensible, but re-running that query over and over... less so.
(11:43:56) DanielK_WMDE: otoh, we do have varnish caching for queries, that may be enough
(11:44:10) DanielK_WMDE: I still recommed caching that on the client.
(11:44:21) DanielK_WMDE: people who add one reference are likely to add more soon

And more about the extension from Lucas: "This situation, if I understand it correctly, reminds me of the mapping for normalized units – to convert a statement “1 km” into a normalized value of “1000 m”, Wikibase uses a JSON configuration file describing the mappings between units. That configuration file is generated from data on Wikidata (using the updateUnits.php maintenance script), but that only happens periodically (the last time was over a year ago). Do you think a similar approach might make sense for Citoid integration?"

Also sample interface message:

https://test.wikidata.org/wiki/MediaWiki:Citoid_qid_type_map.json

I have a very WIP gadget as well: P7368

Something that also needs further research: how are we going to resolve items. In particular, "published in" and author. Potentially we will have to create multiple items in the service of creating a new reference item, and this creates a challenge in terms of how much should this be automated versus allowing the user to decide this- at some point if we are asking a user to resolve 20 authors it becomes too tedious for them to complete, so we have to be conscientious of this and whether we want to offer this at all, or if we do, make sure it's optional.

I'm not sure whether to resolve or decline this as Research wasn't involved here :).

Mvolz changed the task status from Declined to Resolved.Jun 27 2019, 8:56 AM
Mvolz claimed this task.
Mvolz updated the task description. (Show Details)