Page MenuHomePhabricator

Wikidata: Investigate bibliographic bot
Closed, ResolvedPublic

Description

I want engineers to explore if there is an existing solution with references on Wikidata consisting of only a URL.

Acceptance Criteria:

  • If there is a solution, share findings with team as well as estimated scope of work.
    • Citioid for Wikidata might be working on this.
  • Share if there is a prototype User:Aude/citoid.js
  • If there is no existing solution, share with the team a possible solution and estimated scope of work.

Related Tickets:
T165782, T99046

Wishlist proposal: https://meta.wikimedia.org/wiki/Community_Wishlist_Survey_2021/Bots_and_gadgets/Bibliographic_bot_for_Wikidata

Event Timeline

Preliminary investigation

What follows is my preliminary investigation on this wish. It is not complete yet, mostly because I think we need to refine the scope, and then maybe put additional effort into one direction.

Problem statement, AIUI

Many references on Wikidata only consist of a bare URL. This is subpar because it doesn't expose detail about the source, at least not in a structured format (note, I'm making this statement given my experience as a Wikipedia editor, but with little knowledge about Wikidata). We want a way to reduce the amount of such references.

Possible solutions

I can see three ways in which we can act in order to mitigate the issue:

  1. Provide visual guidance on the VisualEditor, so users will add structured references right from the start (or use this to autofill details on existing bare URLs while editing).
  2. Provide an on-wiki gadget that does the same thing as 1.
  3. Provide an external tool, with a nice interface and OAuth-based, that editors can use to change existing references on a given page to make them structured.
  4. Create a bot (without an OAuth-based interface) that periodically scans for bare URLs and replaces them with structured data.

Below is an initial investigation on each of these, but we might want to focus only on some of them. Note that options 3. and 4. are not mutually exclusive. In fact, we might merge them into a single tool that scans things on its own but can also be manually controlled (like IABot).

1. While editing

This approach is essentially what citoid does out of Wikidata. When you click the button to add a reference and the dialog appears, you have an option to autogenerate data from a URL. So what we would need here is citoid integration with Wikidata. There is already a task for this: T199197: [2.11] Integrate Citoid in Wikidata (see also the subtasks). That task is stalled, but TTBOMK, some teams were (are?) already working on this. We might want to get in touch with them before exploring this direction or deciding that we don't want to go there.

Also, I think it's worth noting that according to T199197, the integration should already be on for beta wikidata. I went to tried this at https://wikidata.beta.wmflabs.org/wiki/Q381012 (which is a test page linked in that same task), but I didn't get any option to generate structured data from a URL. So either it's been disabled there, or I'm missing something.

2. Gadget

There's already a user script for this: User:Aude/citoid.js (which was mentioned above), plus the fork mentioned by Sam. AIUI, this is just a stopgap until option 1 is implemented. IMHO, it's not worth putting effort into implementing/refining a stopgap. Also, if we were to go this route, there wouldn't be much to do, beyond making sure that the code works and has all the functionality that we need. Note that these scripts use citoid as backend.

3. Manual OAuth tool

There is already a tool for this that works on Wikipedia: https://refill.toolforge.org/. This tool is written in python and looks very good, it has a nice interface and it's localized. It also works well. The problem, as implied before, is that it only works on Wikipedia, so not on Wikidata. The tool also uses citoid as backend.

What to do in this direction depends on the approach we want to follow. If a web interface is enough, we can maybe get in touch with the maintainer and add wikidata support to this tool. If we also want a dedicated backend (i.e. a bot), then building on top of refill may or may not be the best thing to do.

4. Bot

I couldn't find an existing bot that performs this task (I grepped bot user pages on wikidata with a few keywords like P854). We can architect such a bot pretty much however we feel right. I believe the only certain thing is that we need to use citoid as backend. It exposes a REST API which apparently has support for a "wikibase" output format. Essentially, we'd have to put together:

  • A part which queries Wikidata for bare URLs. I don't have a solution for this due to my lack of knowledge about Wikidata, but I'm confident that there's a way to do this with high precision.
  • A part which queries the citoid API to obtain structured information about a URL. This should be easy.
  • A part which parses the citoid output and puts it back on the wiki. I am not familiar with edit APIs for wikibase, but again, I'm confident that a solution exists.

I cannot properly evaluate the technical complexity because I don't know Wikibase/Wikidata well enough, but it seems like we'd only need a search API and an edit API, both of which seem reasonable requests.

Conclusions

I suggest looking into 3. and 4., in particular whether we want to do both, and how hard it would be compared to just doing one. If we are to do just one, I think 4. is what makes the most sense, although it's probably the one that requires more work. I also suggest ignoring 1. and 2. altogether. Once we have decided which route(s) we want to go, we can collect additional information and/or get in touch with relevant people in order to better estimate this.

A part which queries Wikidata for bare URLs.

A Sparql query can do this, and perhaps we'd want to query based on missing reference properties, and not just alone Reference URLs (e.g. a URL and title might be set, but no author or published date).

A part which parses the citoid output and puts it back on the wiki.

It's certainly be possible to do this, given the data returned by Citoid, but I'm wondering if this is reliable enough for an unsupervised process. For example, adding a reference while editing gives the user a change to check that the title, author, date etc. are correct and what they want, before they save the data. If a bot comes along later, finds a bare Reference URL, and fills in the missing data, there doesn't seem to be any process by which the user can easily confirm that it's correct. This feels like it might be a concern, but it didn't come up during the Wishlist Survey, so perhaps I'm misunderstanding things.

It seems to me that the first thing is to find out if anyone else is working on the while-editing tool: there's a 'Manual' tab in the references section on Test Wikidata, which looks like it's meant to have an 'Automatic' section next to it... but I guess it's not functioning yet (the code looks like it's missing any other tab, and it doesn't look like $wbRefTabsEnabled is configurable)? So maybe we could help with that work, if it's been stalled?

Thanks for looking into this! I think before you move forward we should have a call to talk through what the Wikidata team has already done with Marielle in this area. It's a deceptively simple problem on the surface.

A part which queries Wikidata for bare URLs.

A Sparql query can do this, and perhaps we'd want to query based on missing reference properties, and not just alone Reference URLs (e.g. a URL and title might be set, but no author or published date).

Yeah, I was also thinking about SPARQL, but I've never queried wikidata so I don't really know whether it's the right tool. I think retrieving a list of pages that use reference URLs would be trivial, the non-trivial part is making sure that a given reference only has the reference URL and nothing else.

A part which parses the citoid output and puts it back on the wiki.

It's certainly be possible to do this, given the data returned by Citoid, but I'm wondering if this is reliable enough for an unsupervised process. For example, adding a reference while editing gives the user a change to check that the title, author, date etc. are correct and what they want, before they save the data. If a bot comes along later, finds a bare Reference URL, and fills in the missing data, there doesn't seem to be any process by which the user can easily confirm that it's correct. This feels like it might be a concern, but it didn't come up during the Wishlist Survey, so perhaps I'm misunderstanding things.

Right, this is a valid concern. I think it all boils down to the accuracy degree of citoid, and also whether it prefers false positives or false negatives when generating data (i.e. whether it knows the probability of the auto-generated data being correct for a given field, and what it does based on this probability, in terms of "generate data anyway" and "don't generate any data"). If it's not accuracy-focused, an entirely automatic process might indeed be problematic, and we might want to implement option 1 or 3 instead.

It seems to me that the first thing is to find out if anyone else is working on the while-editing tool

+1

there's a 'Manual' tab in the references section on Test Wikidata, which looks like it's meant to have an 'Automatic' section next to it... but I guess it's not functioning yet (the code looks like it's missing any other tab, and it doesn't look like $wbRefTabsEnabled is configurable)? So maybe we could help with that work, if it's been stalled?

I think it used to work on the beta cluster, given T228411 and the comments at T199197, but maybe it was disabled since then.

Thanks for looking into this! I think before you move forward we should have a call to talk through what the Wikidata team has already done with Marielle in this area. It's a deceptively simple problem on the surface.

Thank you @Lydia_Pintscher, this would be fantastic!

NRodriguez awarded a token.
NRodriguez subscribed.

This investigation provided all information necessary to understand scope of work and re-estimate the original expectations of the wish, Thanks for your work!

Thanks for evaluating this. Can you suggest a path forward? Just making Aude.js or @MichaelSchoenitzer 's fork work again? Cc @JeanFred