Page MenuHomePhabricator

Wikidata Query Service nodes out of sync
Closed, DuplicatePublic

Description

I'm working on mySociety's Democratic Commons project and the EveryPolitician WikiProject, and we're seeing inconsistencies in the data we get back from the Wikidata Query Service.

We're running queries and storing the results in GitHub repositories. To illustrate the problem, here's a commit in which we refresh the data we hold from Wikidata:

https://github.com/everypolitician/proto-commons-canada/commit/15c0ae530ede1df864e551659322962672cfe4c6

The data maintained in Wikidata will have barely changed, yet we're seeing large swathes of results coming and going, depending on which WDQS node each query happens to get routed to. By looking at the X-Served-By header on results, I've previously noticed that wdqs1003 had missing data (see T199916#4555158; now fixed), but I haven't yet checked whether other nodes are similarly missing data.

This is causing huge trouble for us, as we can't reliably get stable and consistent data out of Wikidata.

Would it be possible to have the nodes' data reloaded from scratch to ensure that they're consistent? Would you also be able to investigate the cause of the inconsistency?

My colleague, @mhl20 made a previous report about this in T199916 (which I then reopened), and I think the solution was to refresh the data for the entities mentioned, but I think the problem is much wider than that.

Event Timeline

We've been reloading wdq1003 due to hard disk upgrade. It should be now all caught up. If you still notice discrepancies, please tell which entries are not correct, and I'll investigate.

@Smalyshev, not sure if this is related, but a query of DESCRIBE <http://www.wikidata.org/entity/statement/Q54356755-13A869AC-9E98-4D75-9809-613418E23405> is currently only returning the pq:P768 and pq:P4100, and is missing the ps:P39 and pq:P2937 (and thus not being returned in queries for all members) — see some discussion at https://twitter.com/tmtm/status/1039048340812517377

http://tinyurl.com/y8l62tav is another example: this is currently only returning 40 rows for me, when it should be returning over 50.
A DESCRIBE on one of the missing results (Q48622226-55106E3A-52C1-4927-B004-8C97A6D25365) shows only the p:P39, with no qualifiers at all.

This query finds other cases where the ps: triples is missing – it’s not limited to P39.

Smalyshev triaged this task as High priority.

Something weird is definitely going on - out of 582769 statements with P39, we have 2334 that are missing rank (and possibly other clauses). As statement should never ever be missing rank, it's clearly some bug. I'll dig into it and see how it could happen.

@Lucas_Werkmeister_WMDE your query finds a lot of statements with wdno: claims, which do not have ps:.

Mentioned in SAL (#wikimedia-operations) [2018-09-12T23:36:22Z] <smalyshev@deploy1001> Started deploy [wdqs/wdqs@7e5e537]: Deploy Blazegraph & Updater for T202765 and T203646 handling

Mentioned in SAL (#wikimedia-operations) [2018-09-13T00:00:07Z] <smalyshev@deploy1001> Finished deploy [wdqs/wdqs@7e5e537]: Deploy Blazegraph & Updater for T202765 and T203646 handling (duration: 23m 45s)