Page MenuHomePhabricator

Intermittent errors from Wikidata reconciliation service (wikidata.reconci.link) used by OpenRefine
Open, Needs TriagePublic

Description

Service endpoint

The reconciliation service being used is reachable at: https://wikidata.reconci.link/en/api.

Errors returned to OpenRefine

The service sometimes returns errors during reconciliation operations. The following messages are observed:

  • HTTP error 403 : Forbidden for URL /en/api
  • No. of recon objects was less than no. of jobs
  • The reconciliation service returned an invalid response

These messages appear in the OpenRefine user interface.

The OpenRefine server log records the following exception when sending reconciliation queries:

java.io.IOException: HTTP error 403 : Forbidden for URL /en/api

This occurs during:

com.google.refine.commands.recon.GuessTypesOfColumnCommand

which is part of the column type suggestion step in reconciliation.

Example queries sent
The request triggering the error includes queries such as:

University of Birmingham
The Ohio State University
University Health Network
University of Strasbourg

These are standard reconciliation queries generated by OpenRefine.

During reconciliation:

  • some rows reconcile successfully
  • some rows show the errors listed above
  • the errors appear intermittently during the same reconciliation run

This behavior is visible in the OpenRefine UI.

reconciliation-error.png (702×430 px, 93 KB)

Related user reports

OpenRefine forum threads report similar issues when reconciling with Wikidata:

Environment used for testing: OpenRefine version 3.9.3 and 3.10.0

Event Timeline

At the moment it seems to be functional, but I do observe long periods where no queries were processed in the past 2 days, despite the monitoring script not being able to detect that the service is unavailable:

image.png (474×1 px, 74 KB)

I don't know where the source repo for the deployed service lives (there seem to be three different candidates), but I had a brief look at this yesterday when it was erroring and I think there are bugs in both the reconciliation service and OpenRefine.

The reconciliation service:

  • doesn't consistently send User-Agent headers for all backend calls
  • uses different Wikidata calls in a data dependent way which isn't clear to me, interacting with the above
  • never sends User-Agent when handling property suggest requests, making these a hard fail when rate limiting is in effect
  • doesn't handle Retry-After headers or back off on 429s
  • changes all errors to 403s masking the actual errors

Wikidata has different rate limits depending on whether a User-Agent header is sent or not and whether it thinks it's a "good" application. This causes different affects on different types of calls from reconciliation service, confusing matters.

OpenRefine incorrectly handles errors for batches that fail, so a batch of 10 will get one 403, one "invalid response", and "No. of recon objects was less than no. of jobs" for the remainder of the batch.

All of the above, plus varying load through a shared choke point, (plus, perhaps, some evil dev op in the back room turning the rate limits up and down) interact in a way that makes for a very confusing user experience.

Is it still the case that Wikidata refuses to offer a production reconciliation service?

Who runs the current service and where does its source code live?

Looking longer term, I wonder if a shared multi-user unauthenticated service will continue to be feasible as Wikidata tightens the screws on rate limits. We may need to have the users authenticate in some way to get more permissive rate limits which would require changes to both the client (OpenRefine) and reconciliation service.

The original incident which had been going on for "several days" according to users, resolved itself shortly after this ticket was created, but Wikidata appears to have turned off the spigot again yesterday, so things are broken again (and not in an intermittent fashion).

I still run the https://wikidata.reconci.link/ service. It is hosted on a personal server, serving as a load balancer which forwards the queries to https://wikidata-reconciliation.wmcloud.org/ by default, and to its own local instance of the reconciliation when https://wikidata-reconciliation.wmcloud.org/ is not reachable. The reasoning behind this set up was that https://wikidata-reconciliation.wmcloud.org/ is generally pretty fast, being hosted on a Cloud VPS server that sits close to Wikidata's production servers (both for the Wikidata actions API and the query service). But in the past, this Cloud VPS hosting regularly suffered from devops issues beyond my control, hence the introduction of this fallback server to take over when the Cloud VPS has a hiccup. This setup could be reconsidered - perhaps the Cloud VPS infrastructure is now reliable enough that this additional front node brings more instability on its own.

I've been thinking about attempting again to hand over this service to a group who would maintain it. Any takers?
There has been some long-standing position in OpenRefine to consider that this reconciliation service is not in the scope of the OpenRefine project. The reasons I can think of are:

  • Not wanting to expand the scope of the project, to avoid being responsible for too many things
  • Hoping that WMF/WMDE would eventually offer an official reconciliation service, which would be less likely to happen if OpenRefine offered one
  • Not being familiar with the tech stack, not being used to maintaining web services
  • Taking responsibility for this service could mean that other authority databases expect OpenRefine to do the same for them

That being said, I think it's worth reconsidering this position. I see a couple of reasons why this service would be a quite natural fit in OpenRefine (if it's not already de facto the case):

  • Users don't perceive the difference between OpenRefine and the reconciliation service so well. When the service breaks, users perceive it as being a problem in OpenRefine, they go to OpenRefine's forum to report it.
  • There is a big overlap between the people involved in OpenRefine and in the service (as this discussion, happening on Phabricator, but only involving OpenRefine (former) members, shows)
  • The initial development of the service was coordinated in OpenRefine's issue tracker.
  • The Wikidata service is registered in OpenRefine by default, unlike other services which the users need to discover first.
  • Efforts to encourage WMDE to offer an official version of this service haven't led anywhere, and I think it's unlikely to change.
  • Although the service can be used with other reconciliation clients, OpenRefine remains the dominant client for this recon endpoint

So to me, continuing to claim that this service is outside the scope of OpenRefine doesn't feel that strategic. It's a position that puts architectural principles before the needs of end users, in my opinion.

@Pintoch as a quick stop-gap solution, you could enroll your service as a "known client". That would remove all rate limits.

It is hosted on a personal server, serving as a load balancer which forwards the queries to https://wikidata-reconciliation.wmcloud.org/ by default, and to its own local instance of the reconciliation when https://wikidata-reconciliation.wmcloud.org/ is not reachable.

That's useful information on the topology/architecture and adds an extra level of indirection in trying to decipher symptoms from the outside since the rate limits for Cloud VPS hosted and privately hosted services are likely to be different.

Is there observability into which direction the the load balancer is sending traffic? Does the load balancer expose any other stats (e.g. error counts)?

Is there a repo associated with the configuration for the production instance(s)?

Is this the correct place to report service issues or is there better place? The home page for the service points to https://github.com/wetneb/openrefine-wikibase/issues, but that, in turn, points to two other repos, both of which seem more software focused than service focused.

I totally understand not wanting to be the middle man between demanding OpenRefine users and unreliable Wikidata services. As far as the OpenRefine team taking on additional responsibilities goes, I think that would be a stretch given current resources. It looks like we're going to have to invest in adapting OpenRefine (again!) to Wikidata's shifting requirements.

The repo of the code deployed is still https://github.com/wetneb/openrefine-wikibase, with deployment documented here: https://openrefine-wikibase.readthedocs.io/en/latest/install.html#deploying-in-production.

Is there observability into which direction the the load balancer is sending traffic? Does the load balancer expose any other stats (e.g. error counts)?

Not that I am aware of.

Is this the correct place to report service issues or is there better place? The home page for the service points to https://github.com/wetneb/openrefine-wikibase/issues, but that, in turn, points to two other repos, both of which seem more software focused than service focused.

There is no particular place to report service issues.

I totally understand not wanting to be the middle man between demanding OpenRefine users and unreliable Wikidata services. As far as the OpenRefine team taking on additional responsibilities goes, I think that would be a stretch given current resources. It looks like we're going to have to invest in adapting OpenRefine (again!) to Wikidata's shifting requirements.

As far as I am concerned, I don't find OpenRefine users particularly demanding nor Wikidata services particularly unreliable. It's only fair that WMF is adapting the rate-limiting of their deployment to combat spikes of traffic. It's to be expected that OpenRefine users report when the Wikidata reconciliation service is unavailable.

I get the impression that you are reluctant to make what I see as pretty straightforward adaptations on OpenRefine's side to improve the experience of Wikimedia community, which worries me a bit. When leaving the OpenRefine project I had the impression that you still saw the value in serving Wikimedia users.

@Pintoch as a quick stop-gap solution, you could enroll your service as a "known client". That would remove all rate limits.

Thanks for the pointer, will do once I'll have moved my server to other hardware (which I need to do independently of this issue). But as you point out, the main backend https://wikidata-reconciliation.wmcloud.org/ is already hosted from WMCS so shouldn't need to be enrolled as a known client, right? So users will likely get a more reliable service if they switch from https://wikidata.reconci.link/ to https://wikidata-reconciliation.wmcloud.org/ (as it can be used directly, without going through the load balancer).

@Pintoch as a quick stop-gap solution, you could enroll your service as a "known client". That would remove all rate limits.

Thanks for the pointer, will do once I'll have moved my server to other hardware (which I need to do independently of this issue). But as you point out, the main backend https://wikidata-reconciliation.wmcloud.org/ is already hosted from WMCS so shouldn't need to be enrolled as a known client, right? So users will likely get a more reliable service if they switch from https://wikidata.reconci.link/ to https://wikidata-reconciliation.wmcloud.org/ (as it can be used directly, without going through the load balancer).

That seems like a good option for now, yes.

Thanks to all for engaging in this conversation and for looking for a resolution that satisfies all parties. As this issue spans multiple repositories (OpenRefine, wikimedia-toolkit, reconciliation service configuration and hosting, and finally wikidata), I created this document to

  1. Provide a high-level cross-system overview and coordination, and
  2. Discuss long-term ownership and maintenance of the reconciliation code and hosting

I invite you to correct and complete it, as it will help all parties better understand the different parts involved.

That's a nice initiative! Given that there are multiple entities and various kinds of requests flying between them, I thought it would be clearer with a diagram (attached).
From this diagram there are four kinds of requests that can reach Wikidata when someone uses OpenRefine:

  • Number 1 on the diagram (from the web frontend to MediaWiki): the User-Agent is set by the user's browser, there is no control over this as things stand (but they could be proxied through OpenRefine's server)
  • Number 3 on the diagram, from OpenRefine's server to MediaWiki: the OpenRefine team has full control over what user agent they want to send there
  • Number 4 and 5 on the diagram, from the reconciliation service to Wikidata (either MediaWiki or WDQS): the user-agent is set in code of the reconciliation service, which can be changed but will not have impacts on rate-limiting since the recon service is run on a Wikimedia Cloud VPS.

Maybe I forgot some links, I just made this on top of my mind.

That's a useful diagram. One thing it doesn't include is the current topology of two different reconciliation services, on different hosts, one of which is outside of WMCS, both fronted by a load balancer on the private server at https://wikidata.reconci.link/

Also, it's a nit, but there are two different client libraries involved for calls from the OpenRefine server, Wikidata Toolkit and okhttp, but the latter is only used to fetch language codes. The Apache HTTP client is also used for generic "fetch from the web" calls, but that would only come into play if a user decided to roll their own access to Wikidata APIs (ie unlikely). So there a three places where we need to set User-Agent, only one of which has the Wikidata username easily accessible.

@Pintoch if you are planning to continue running the Wikidata reconciliation service as a production service, I'm happy to offer the fixes I have in hand if you let me know what repo the pull requests should be opened against. Off the top of my head, they are:

  • retry support including honoring 429s and Retry-After headers
  • consistent User-Agent setting
  • limiting simultaneous connections to 3 instead of 10 (although depending on what Wikidata page you read, perhaps the limit is 1, not 3)
  • error counters (incomplete currently)

There is no particular place to report service issues.

That's obviously suboptimal for a production service, so is something we'll need to figure out a solution to.

During the Wikibase Stakeholder Group Meeting last week, @Loz.ross expressed her openness to using the reconciliation server codebase hosted by NFDI4Culture at https://gitlab.com/nfdi4culture/openrefine-reconciliation-services/openrefine-wikibase, including the issue tracker and accepting pull/merge requests (PRs). Given that this setup is already in place, should we keep this as the reference repository?

During the Wikibase Stakeholder Group Meeting last week, @Loz.ross expressed her openness to using the reconciliation server codebase hosted by NFDI4Culture at https://gitlab.com/nfdi4culture/openrefine-reconciliation-services/openrefine-wikibase, including the issue tracker and accepting pull/merge requests (PRs). Given that this setup is already in place, should we keep this as the reference repository?

It's kind to offer your help, @Loz.ross. Overall, hosting a git repository somewhere is cheap and could be done anywhere. What would be more important is that someone feels responsible and competent to review merge requests (or PRs depending on the forge), to fix the continuous integration when it breaks, and probably to make releases regularly. Looking at the list of MRs of the NDFI4Culture repository, I do not get the impression that anyone from your team is available to do this work at the moment - or is it likely to change soon? There are a couple of good contributions that have been waiting there for more than a year without being accepted or rejected. So I would be reluctant to encourage @tfmorris to open new merge requests there, if it is for them to gather dust in the same way.

If there currently is a developer on your team whose responsibility it is to take care of this repository, I would be very happy to have a meeting with them to ensure that they are equipped for the task (for instance, making sure that they receive email notifications when a new issue or merge request is opened, that they occasionally check the MR backlog even in the absence of a notification, discussing how to evaluate the impact of the changes, and so on). We could also do follow-up meetings (say, every 6 months) to check how they are doing with this task.

That being said, I need to apologize again about this situation. When I archived my repository and promoted your fork instead, the move wasn't very thought-through. The reason for it was that I was tired of maintaining this repository in a personal capacity, and was sad to discover that Paul had made this fork without trying to contribute his Dockerization and documentation improvements to my repository. I should have reached out to him, asking him if he would consider making a PR for it (maybe he just wasn't so familiar with open source contributions, or didn't feel like his work was good enough to be integrated upstream). By creating this fork, Paul probably didn't have the intention to maintain this code base for the use of other institutions, nor to accept external contributions: it was probably just meant to be internal documentation about your own deployment(s) of the service. So pointing to your repository as the new official home of this project was bound to fail, in my opinion.

That being said, I need to apologize again about this situation. When I archived my repository and promoted your fork instead, the move wasn't very thought-through. The reason for it was that I was tired of maintaining this repository in a personal capacity, and was sad to discover that Paul had made this fork without trying to contribute his Dockerization and documentation improvements to my repository. I should have reached out to him, asking him if he would consider making a PR for it (maybe he just wasn't so familiar with open source contributions, or didn't feel like his work was good enough to be integrated upstream). By creating this fork, Paul probably didn't have the intention to maintain this code base for the use of other institutions, nor to accept external contributions: it was probably just meant to be internal documentation about your own deployment(s) of the service. So pointing to your repository as the new official home of this project was bound to fail, in my opinion.

"Paul" here.

@Pintoch, in an interesting turn of events I was preparing to deliver a workshop on using this service for GLAM data at the Australian WikiCon, and exploring the issues described above led me to this thread.

First up, I'm extremely sorry for any hurt I caused you by cloning your repo and not pushing changes back to source. To clarify, I haven't checked the commit history, but I expect my contributions were also historically minimal compared to the work you have done on this project.

Based on this experience specifically maybe there needs to be explicit clarity that when an organisation takes on maintenance of open source code and assigns a developer, whether it is the organisation or the developer who is taking on longterm commitment to provide support.

That being said, I think the tool you created is incredible and I would be interested in supporting and/or maintaining it, possibly following the model you have outlined, if you would be open to an off-thread conversation on how this could work.

Hi all, at the request of @Magdmartin I am joining here to add some background info from NFDI4C perspective.

First of all, I think neither @Pintoch nor @Pxxlhxslxn have anything to apologize for. I think there is perhaps some level of lingering miscommunication, but I think bad intentions were never in play here. Based on what I remember: about 3-4 years ago, Antonin was ready to 'retire' from being lead maintainer of the reconciliation service code repo. In the meantime, the NFDI4C dev team at TIB, which included Paul at the time had already set up a fork for our own internal purposes - mostly deployment-related - we were deploying OR recon services for our various WB projects, but we didn't think this was particularly relevant for others given the specificity of our setup. We didn't have the ambition to become lead maintainers, but seeing that noone else wanted to step up at the time, we agreed that we can act as maintainers on a short to medium-term basis (which was promoted by Antonin, too, as far as I remember).

Over the past 3 years we had the intention to hold to this commitment as maintainers, but increasingly it's not really realistic for us.

Our dev (Paul) who originally worked on this moved away from the country and the job; we then tried to hire two more devs in succession - eventually unsuccessfully due to immigration complications (both were from abroad and after lengthy processes, it didn't work out with our government regulations - we have limits as a gov institute). We may have a chance to have someone starting this summer, but we can't confirm yet.

We can suggest two potential solutions:

  • Someone (1-2 people) from the OR community is elected to get maintainer rights to this repo and we are happy if they want to merge / approve PRs etc.
  • Someone else takes on the hat of lead maintainer and hosts the repo elsewhere, we don't want to necessarily be the only ones hosting it, but if noone steps up, we can keep doing it provided at least 1-2 community members volunteer to also get maintainer rights.

In the meantime, our current devs (disclaimer: they are not experts in the codebase) had a look & reviewed merge requests and this is the outcome:

  • Most of them were just Documentation Issues or broken links which got fixed. All are now up to date and merged in the codebase.
  • The only MR we are afraid of merging is this one. As far as we understand from Albin who wrote it - it's only a draft, and not meant for production. This MR introduces hardcoded variables. We will not merge it for now. But the Idea behind is good, so we should keep this on the Radar.

Lastly regarding the open issue concerning a non-compliant User-Agent header, which I understand is important to be fixed in the code repo first and then in the hosted service - if someone else makes a PR to fix it, we will approve it. We could also try to fix ourselves, but might take us 1-2 weeks. I hope this clarifies the situation somewhat. Any questions feel free to reach here, in the Gitlab issue tracker, or via email.

Hey all!

I'd be very much interested to maintain the tool. I've ties to both Wikimedia and OpenRefine communities, I think I can be of service here. I've deployed a temporary version here if you want to test if that works correctly: https://reconcile-temp.daxserver.com/ I'll remove it once we have a clarity of the future ahead.

I can also contribute my time into the development of the tool. I haven't looked into the internals yet, but I'm confident it works out of the box for Wikidata, to start with. I'm very much happy to split my time between my primary tool and the reconciliation tool!

Let me know your thoughts!

@Loz.ross and @Pxxlhxslxn thank you for the background information.

@Pintoch, can you update your load balancer to route all calls from https://wikidata.reconci.link/en/api to https://wikidata-reconciliation.wmcloud.org/en/api? Currently, by default, OpenRefine is set up to call https://wikidata.reconci.link/en/api, and we still have users reporting issues (see here and here). Additionally, should we

  1. Advertise the new endpoint to the community (social media, telegram, forum...)?
  2. Release a patch version of OpenRefine (3.10.2?) to have https://wikidata-reconciliation.wmcloud.org/en/api as the default service?

@DaxServer, @Pxxlhxslxn thanks for offering your help. In case you missed it, this document provides an overview across different repositories of the states of Wikidata reconciliation. It should offer pointers on where to contribute and what needs to be done (with a link to relevant issues)