Page MenuHomePhabricator

Future of the OpenRefine Wikidata reconciliation interface
Open, Needs TriagePublic

Description

This ticket documents the development needs around the OpenRefine Wikidata reconciliation interface.

Background

When using OpenRefine reconciliation to match table cells to Wikidata items, OpenRefine communicates with Wikidata via a protocol called the reconciliation API. This is a generic API for data matching which is implemented by other data providers too.

For Wikidata, this API is implemented by a Toollabs project: https://tools.wmflabs.org/openrefine-wikidata/. This is a Python web service which implements the API on top of Wikidata's existing APIs. It is developed on GitHub. This web service is also used by other Wikibase instances to provide a reconciliation service for these.

Where should this service live, in which language should it be written?

Since the current wrapper is relatively lightweight, it could well be rewritten in any other language (for instance in the interest of using a more suitable web framework, or just because the people developing it are more familiar with another stack).

I have tried to suggest to WMDE that it could potentially be something that would live closer to Wikibase itself, either developed as a MediaWiki extension, or a microservice, but I have not sensed a lot of interest in these discussions (for very good reasons - they have bigger fish to fry and rightfully so). Therefore we should not count on this API being implemented in Wikibase itself any time soon.

It is also worth noting that we can totally have different implementations of the reconciliation API for Wikidata running at the same time, and users picking the one they prefer to run their reconciliation queries. The current wrapper can currently be run locally (for instance with Docker) and this is something I recommend regularly to users when the main endpoint is overloaded.

Things that should be improved

  • Performance

    The current reconciliation service is slow and unreliable. At the moment processing a single reconciliation query can take about one second, which is extremely long. This comes from the fact that a lot of MediaWiki API calls need to be made to answer a query: we search for the entity using two search endpoints (action=query&list=search and action=wbsearchentities) and we retrieve the items JSON serialization via wbgetentities. We also need to perform occasional SPARQL queries to retrieve the list of subclasses of a given target type, to support type filtering. This could be improved in the following ways:
    • Analyze and optimize the way queries are made, by processing queries in parallel and/or using asynchronous data loaders; done: we cannot make huge gains with the current architecture.
    • Eliminate one of the search API calls, either by working with WMF to have them expose a single search endpoint which fits our needs, or run our own search index on top of Wikidata (which requires running an updater to stay in sync, just like the WDQS or EditGroups - so this is quite involved);
    • Migrate away from the toollabs project infrastructure which is not designed to provide a high service availability. done: https://wikidata.reconci.link/
  • Flexibility of the scoring mechanism

    Scoring of reconciliation candidates is opaque. It is hard to understand for users why a certain candidate got some score. It is hard for reconciliation service maintainers to change the scoring mechanism as users are relying on the current behaviour for their reconciliation workflows. This could be improved in the following ways:
    • Explain the current reconciliation scoring mechanism publicly;
    • Add support for returning individual reconciliation features which are exposed to the user. This requires adaptations to the protocol and to the client, OpenRefine.
    • Running our own search index would also help with exposing more useful reconciliation scores (as we don't have access to term frequencies or search scores in general via the MediaWiki API).
  • Documentation

    The reconciliation interface is essentially undocumented. Current documentation is mostly written up in OpenRefine's wiki, but that is not appropriate: the reconciliation endpoint should have documentation on its own. (And then OpenRefine can document it as "the default reconciliation enpdoint"). This is especially important to document property paths, scoring and other similar features which cannot be easily discovered from OpenRefine's UI. We now have some technical documentation here: https://openrefine-wikibase.readthedocs.io/en/latest/
  • Robustness of the web API

    The web API does not react well to malformed input. This is mostly because it is written with a web framework that does not come with much input validation on its own. This could be improved in the following ways:
    • Now that we have JSON schemas to validate queries (via the W3C community group), these could be used in the service itself to validate queries;
    • The service could be rewritten in a web framework that provides better validation for APIs (Swagger-based, for instance).

Event Timeline

Concerning the choice of language to make it easier to maintain / deploy in a Wikimedia context:

  • PHP seems like a pretty widespread choice, and is mandatory if the API is to be implemented as a MediaWiki extension . My understanding from our meeting with @Lokal_Profil is that it would generally be helpful to other Wikimedia organizations who are familiar with this stack, even if the service is not directly integrated in MediaWiki;
  • @Mvolz told me that Node.js is also used to run services at WMF (but Parsoid is moving from Node.js to PHP-only, perhaps a sign that Node.js is not a good long-term choice).

The service is down at the moment because of T247501. We should consider hosting the service outside Toollabs, as we have had a range of similar issues in the past (unavailability due to devops bugs outside our control).

I am unable to maintain the reconciliation endpoint hosted on toolforge (affected by T257405, which I do not know how to solve).

So I am now hosting an instance outside of the Toolforge infrastructure, on my own server:
https://wikidata.reconci.link/

Pintoch updated the task description. (Show Details)

I'm surprised that a private third party proxying such a significant segment of the traffic to Wikidata hasn't prompted the Wikidata Engineering team to take this more seriously.

Wikidata and OpenRefine users need an officially supported production quality reconciliation end point. It's a key tool for getting high quality data into Wikidata.

@Lydia_Pintscher What is required to get some action on this? Even getting it triaged would be a start.

About traffic, 17.6 million reconciliation queries were processed by the Toollabs endpoint in 2019 (that excludes other calls such as preview, suggest and extension). I don't know if that counts as "significant" in comparison to what the Wikidata API is exposed to, but I agree it would deserve to live closer to Wikibase. I will try to follow up with a more detailed analysis of the usage of the service soon.

I'm surprised that a private third party proxying such a significant segment of the traffic to Wikidata hasn't prompted the Wikidata Engineering team to take this more seriously.

Wikidata and OpenRefine users need an officially supported production quality reconciliation end point. It's a key tool for getting high quality data into Wikidata.

As @Pintoch has written there are a ton of things we have to work on with a small team so we unfortunately can't do all the things. My general impression is that the team around @Pintoch is doing a fine job and I'm keeping in mind that none of us has experience in this area so we'd have to start from a fairly basic level.

@Lydia_Pintscher What is required to get some action on this? Even getting it triaged would be a start.

This ticket wasn't on our radar because it didn't have the Wikidata tag or anything else that would have brought it to my attention. I've added it now.

Thanks @Lydia_Pintscher and sorry for forgetting to tag the issue for Wikidata.

On my side, I would be keen to transform the wrapper into a Wikibase extension, but lack the skills to develop in the MediaWiki environment. It is also a significant risk to invest energy into this without being reasonably confident that it will end up being deployed on Wikidata and Commons.

Would you know of contractors who you would trust to develop an extension to the quality and security standards required in Wikimedia production settings?

I think it would be interesting to start a preliminary discussion about the overall feasibility of this: there might be aspects of the API specs that do not fit well with the sort of APIs that MediaWiki exposes (URL format, for instance), or that your production settings require (CORS headers, constraints on the timing of HTTP requests…)
If we can identify such hurdles early, we could adapt the specs accordingly (assuming the rest of the community agrees), or convince ourselves that we will really have to stick to the wrapper-based system.

That makes sense. Let me get back to you on that.

There is also the simpler idea of hosting the existing wrapper as an official Wikidata service (just like query.wikidata.org is an official Wikidata service, even though Blazegraph is not part of the MediaWiki instance, nor is maintained by WMDE). Comparing my tiny wrapper to Blazegraph is perhaps a bit bold (in terms of maturity), but I am just trying to explore all the ways in which Wikidata could officially offer a reconciliation service.

Another outage today: T262553 (Toolforge-wide ticket: T262550).

I should probably also point out that earlier versions of the citoid tool on wikidata for wikidata references used open refine and produced better results.

However when we went to convert the user script into an extension, we had to drop the use of openRefine and try to do exact string matches.

That project was paused but with it restarting soon I would really like to see openRefine be available for that.

The reality is no matter how well we try to clean up the data we're not going to get exact matches for all the fields we want to fill in, and matching text with items is something open refine does well.

It seems silly for citoid to do this directly and reinvent the wheel when we already have a tool that does this, it just needs to be a bit more stable!

Thanks a lot @Mvolz! The new endpoint running at https://wikidata.reconci.link/ is much more stable but I assume that it is even harder for you to rely on something that is outside the WMF ecosystem (compared to a toolforge project).

On my side, I would be keen to transform the wrapper into a Wikibase extension, but lack the skills to develop in the MediaWiki environment. It is also a significant risk to invest energy into this without being reasonably confident that it will end up being deployed on Wikidata and Commons.

Hi @Pintoch - wondering if you've given more thought to this since it was last surfaced. Also @SandraF_WMF
or @Spinster may have some insight here since she is working with OpenRefine and its modernization.

Here is the current status of this issue:

The team at TIB (headed by @Loz.ross) is now maintaining the current wrapper and has contributed improvements to deploy it on other Wikibase instances:
https://gitlab.com/nfdi4culture/ta1-data-enrichment/openrefine-wikibase
Thank you to them!

The endpoint "https://wikidata.reconci.link/" is now served by a Wikimedia Cloud VPS (although it is running behind a proxy to retain the original domain name), which improves performance slightly by virtue of being closer to the Wikidata infrastructure. Thank you @Tarrow for suggesting it! The Cloud VPS project is wikidata-reconciliation. Getting rid of the proxy should be doable too, if I could find the time.

Prompted by the bad performance of this endpoint, Ontotext offers alternate endpoints for certain entity types, powered by their own search index over Wikidata. This video explains their approach. The endpoints are:

Thank you Ontotext (@VladimirAlexiev among others).

That's all I am aware of for now.

@Pintoch What do you see as the remaining issues to resolve for this issue?

The ontotext example is interesting. Might they be interested in also providing something closer to a WQDS interface?

Here is the current status of this issue:

The team at TIB (headed by @Loz.ross) is now maintaining the current wrapper and has contributed improvements to deploy it on other Wikibase instances:
https://gitlab.com/nfdi4culture/ta1-data-enrichment/openrefine-wikibase
Thank you to them!

So the Wikidata reconciliation service currently falls under the responsibility of a German research group... I know TIB does amazing work for the Wikibase ecosystem (thank you indeed @Loz.ross and colleagues) but for Wikidata, this surely is a weird situation. The (frequently requested) Wikidata-related feature requests are probably less likely to be picked up and acted upon now, including by the (potentially) random volunteer who may pop up to help, because of the unusual place where this repository now lives.

This is a broadly public service; ideally I'd want it to be in a broadly public place, not in specific institutional hands/stewardship/ownership (however deeply I value and appreciate the institution here).

If I'd have the money and the people, I'd jump to help, but I don't. I echo @Lydia_Pintscher's feedback above; outsiders seem to think 'Wikidata' and 'Wikimedia' are these big tech players with limitless resources, and are just blind and unwilling to divert money where it's needed, but in reality all Wikimedia technical infrastructure runs on shoestrings with tiny teams, and the fundraising efforts we do give us less annual resources *as an entire movement* than, say, a single mid-sized university. Just to put things in perspective...

Prompted by the bad performance of this endpoint, Ontotext offers alternate endpoints for certain entity types, powered by their own search index over Wikidata. This video explains their approach. The endpoints are:

Thank you Ontotext (@VladimirAlexiev among others).

I, too, have been very enthusiastic seeing this initiative. However, I've tried the Ontotext reconciliation service several times, and I've had quite a few issues with it:

  • it produces *a lot* of wrong matches for me. Up to 30 to 40% of the "100% confident" matches are just plain wrong (see screenshot below - I just tried it again)
  • the service provides only 3 suggested matches when it's uncertain, and these tend to be wrong as well, even if a correct value is present (see second screenshot)
  • many datasets (at least the ones I work with) are pretty dirty and messy, and there is usually a mix of people and organizations, or a mix of places and organizations

image.png (1Γ—2 px, 703 KB)

image.png (392Γ—1 px, 130 KB)

I would currently not use the Ontotext alternatives for any real world application for that reason, and I have discouraged trainees from the GLAM sector from using them. (The 'old' Wikidata reconciliation service works much better and is still extremely workable when it's up, in my experience.)
@VladimirAlexiev can such issues be reported somewhere, and if so where? Is Ontotext willing and able to improve upon the above services?

@Addshore wrote a blog post summarizing the options around this problem and I think it's a very worthy read:
https://addshore.com/2023/07/wikibase-and-reconciliation/

@Addshore thank you for writing that and resurfacing your old post. Agreed that recon should be core to Wikibase, both for it to reach new users and to make setting up new small projects easier.

After hearing reconciliation service (re)users, hearing emerging ideas, and seeing the reality of the Wikimedia movement's capacities, I decided to also create this ticket: T362149: Alternative, affordable, lower-barrier approach(es) to reconciliation