Page MenuHomePhabricator

SPARQL-Query shows entries, which should be filter out; number of entries in result set might change when executed repeatedly (possible caching/indexing problem)
Closed, ResolvedPublic

Description

Hello, the following query should return all german streets, which have a Commons-sitelink, but no Commonscat-Property (P373):

SELECT ?item ?commonscat ?sitelink  WHERE {
  ?item wdt:P31 wd:Q79007. # Innerortsstraße
  ?item wdt:P17 wd:Q183.   # Deutschland
  ?sitelink schema:about ?item .
  ?sitelink schema:isPartOf <https://commons.wikimedia.org/> .
  OPTIONAL {?item wdt:P373 ?commonscat }  
  FILTER (!bound(?commonscat))   # nur jene OHNE commonscat-Property (P373)
}

The query currently returns sometimes 22/23, sometimes 30 entries, depening when and how often the query is executed, allthough the objects have not been changed inbetween. Although the objects actually have a commonscat-Property, so they should not be listed at all in the result set, i.e. the result set actually should be empty.

For just one city (e.g. ?item wdt:P131 wd:Q61724. ) the query returns 2, 3 or 4 entries if the query is repeatedly executed.

Could this be a caching/indexing problem? Why are entries with commonscat-properties listed, while they should be filtered out?

According to this discussion

https://www.wikidata.org/wiki/Wikidata:Project_chat#Query_shows_entries,_which_should_be_filter_out;_number_of_entries_in_result_set_changes_when_executed_repeatedly

another user has spotted a similar issue with a different query.

Also see

https://www.wikidata.org/wiki/Wikidata:Contact_the_development_team/Query_Service_and_search#Query_shows_entries,_which_should_be_filter_out;_number_of_entries_in_result_set_changes_when_executed_repeatedly

Thanks a lot!

Event Timeline

Probably an issue with the query service updater, yes. You’re getting different results if you hit different backend servers.

action=purge only purges within MediaWiki, it does not affect the query service.

The “cache busting” part is the random comment, not the (MD5(CONCAT(STR(?item), STR(RAND()))) as ?random).

That tool shows the Cloud VPS database replication lag, which is unrelated to the query service update lag. (You can see the WDQS lag on Grafana.)

Gehel triaged this task as High priority.Nov 9 2020, 4:22 PM
Gehel moved this task from All WDQS-related tasks to Current work on the Wikidata-Query-Service board.
Gehel added a subscriber: Gehel.

We'll need to do some investigation to see what kind of underlying bug is hiding here. We are planning a full data reload anyway, which should solve this.

I flagged up the other case mentioned. As an example of the issue -

At the time of writing, https://w.wiki/kxw returns four items, three edited on 12 October and one on 27 October. All have removing Q99671172 (which is now no longer linked from any items) as their most recent edit.

  • Q186841
  • Q1026588
  • Q9006
  • Q28176

https://www.wikidata.org/wiki/Special:WhatLinksHere/Q99671172 shows nine items, however - two of the query items are listed on it, and the other seven are new. All of those have the removal of Q99671172 as their last edit, and dates either 6, 12 or 27 October.

  • electromagnetism (Q11406)
  • Altai (Q28176)
  • Lake Ohrid (Q156258)
  • Guadalcanal (Q192767)
  • Minigun (Q864060)
  • Caledonian Stadium (Q1026588)
  • A85 road (Q4649651)
  • Blairquhan Castle (Q4924231)
  • Judge Rinder (Q18160407)

So it does look like it's affecting the links table as well as the query service, but not quite in the same way for both. I don't know if that helps track the issue down or not!

The same problem with https://de.wikipedia.org/wiki/Benutzer:Z_thomas/DE-BB-Stra%C3%9Fen-Barnim and the connected sites. I created a lot of streets in Wikidata during the last days, but no changes in the listeria-list refering to streets in this area.

Timeboxing investigation to 1 day.

This issue is still occurring. We still get different answers depending on which report server we hit. That is clearly suboptimal. Could we please have an update on action being taken, and/or the timetable for proposed action, presuming that action will be taken. @Lydia_Pintscher

This issue is still occurring. We still get different answers depending on which report server we hit. That is clearly suboptimal. Could we please have an update on action being taken, and/or the timetable for proposed action, presuming that action will be taken. @Lydia_Pintscher

Hey @Tagishsimon :) I unfortunately don't know but given @Zbyszko has claimed the task I'm confident it's in good hands. Maybe he can tell us where we stand currently.

I started to investigate the issue, but had to get back to some previous issue - I should have some update before the end of the week, though.

Change 645313 had a related patch set uploaded (by ZPapierski; owner: ZPapierski):
[wikidata/query/rdf@master] Add processing delay to RC Poller

https://gerrit.wikimedia.org/r/645313

Few details on the issue:

  • For extensively modified entities (e.g. by bots) log show that information provided by RecentChanges API isn't always up to date. This can lead to lost revisions - if the last change was among the ones not yet provided by the API, the next check will omit it, since it thinks that previous time period was already handled.
  • Our workaround for the issue is to add the delay to processing time periods of changes. This isn't an ideal solution for two reasons - it adds a lag to an already substantial process and it doesn't provide guarantees that the delay length we set will help in all cases.
  • We won't be pursuing a better solution for this issue with current updater - we're closing in on having a new, streaming based, updater on production. It uses Kafka events and Flink to reconcile the updates which allows for a much less fragile process.
  • This doesn't fix the currently inconsistent entities, but that will be done with the data reload - T267927.

Change 645313 abandoned by ZPapierski:
[wikidata/query/rdf@master] Add processing delay to RC Poller

Reason:
Adding the delay was deemed to risky of a workaround.

https://gerrit.wikimedia.org/r/645313

Change 646631 had a related patch set uploaded (by DCausse; owner: DCausse):
[operations/puppet@production] [wdqs] re-enable polling kafka for updates on wdqs1010

https://gerrit.wikimedia.org/r/646631

Change 646632 had a related patch set uploaded (by DCausse; owner: DCausse):
[operations/puppet@production] Revert "wdqs: use RecentChanges API for updates on all WDQS servers"

https://gerrit.wikimedia.org/r/646632

dcausse moved this task from Needs review to In Progress on the Discovery-Search (Current work) board.
dcausse added subscribers: RKemper, Zbyszko.

We will re-enable the kafka poller that was disabled for security reasons back in january. The plan is as follow:

Moving back to in progress an re-assigning to @RKemper for merging the test patch.

I've observed quite a lot of inconsistencies over the past two weeks. I haven't looked very extensively at it, but I'm getting the impression that blocks of edits are missed and the timestamps are around a spike at https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&viewPanel=8 . Order of magnitude is several 100s of edits.

Change 646631 merged by Ryan Kemper:
[operations/puppet@production] [wdqs] re-enable polling kafka for updates on wdqs1010

https://gerrit.wikimedia.org/r/646631

Change 646632 merged by Ryan Kemper:
[operations/puppet@production] Revert "wdqs: use RecentChanges API for updates on all WDQS servers"

https://gerrit.wikimedia.org/r/646632

Change 654914 had a related patch set uploaded (by Ryan Kemper; owner: Ryan Kemper):
[operations/puppet@production] wdqs: default to using kafka for updates

https://gerrit.wikimedia.org/r/654914

Change 654914 merged by Ryan Kemper:
[operations/puppet@production] wdqs: default to using kafka for updates

https://gerrit.wikimedia.org/r/654914

@RKemper so what's the status of this? I see a lot of cases where the last edit didn't get processed so the data in SPARQL is not consistent. See https://w.wiki/ugf for some examples.

Checked a couple of these inconsistencies and they appear to all be out of order in the kafka topics. I suggest to disable async imports as I believe it might be possible cause of these inconsistencies.

Change 656833 had a related patch set uploaded (by DCausse; owner: DCausse):
[operations/puppet@production] [wdqs] disable async imports

https://gerrit.wikimedia.org/r/656833

Checked a couple of these inconsistencies and they appear to all be out of order in the kafka topics. I suggest to disable async imports as I believe it might be possible cause of these inconsistencies.

That means that data from the latest right revision might get overwritten with data from an older revision? That would explain the pattern.

Change 656833 merged by Ryan Kemper:
[operations/puppet@production] [wdqs] disable async imports

https://gerrit.wikimedia.org/r/656833

(Puppet run successful on wdqs nodes following deploy)

Post deploy check:

wdqs-updater.service is now running without the --import-async flag as expected:

ExecStart=/bin/bash /srv/deployment/wdqs/wdqs/runUpdate.sh -n wdq -- --kafka kafka-main1001.eqiad.wmnet:9092,kafka-main1002.eqiad.wmnet:9092,kafka-main1003.eqiad.wmnet:9092 --consumer wdqs1003 -b 700 --clusters eqiad,codfw --constraints --oldRevision 3 --entityNamespaces 0,120,146

I think now we just wait and see if the issue in the ticket description pops up again.

Tried to find more inconsistencies using the query provided by @Multichill (https://w.wiki/ugf) and could not spot any while it was very easy to find one previously. I'm assuming the problem is resolved and that we can proceed with the full reload based on new units (https://phabricator.wikimedia.org/T267644#6758238).
If no new inconsistencies are reported since then here is the expected schedule:

  • units are reloaded on Friday 29 (2021-01-29)
  • full data-reload of one machine starts next Friday (2021-02-05)
  • depending on the time needed and in best case scenario (no import failures) the reimport of all the wdqs machines should be finished by the end of February

I am still seeing a handful of inconsistencies stemming from the same period of edits. For example this edit to Q5540133 on 23 December has not propagated to all databases, and so https://w.wiki/w66 returns a missing value for one of the "partyLabel" fields about one time in four.

(Not sure from the comments above if this is the sort of thing that will be fixed by the reload or if it indicates there are still underlying issues, but wanted to flag it up in case it was a concern)

Q104982840 which was deleted January 27th 08:48 CET still shows up in my query for invalid P2397 values

Thanks for the comments.
Inconsistencies for edits prior to Jan 20 (time when the last fix was deployed) are expected and will be fixed by the reload.
Inconsistency on Q104982840 is more troubling as the delete was done after this date. I'll re-open this specific ticket as I wonder if there is not something specific to deletes happening here.

Mitigation for deletes will be made using a script that polls for the deletion log and resync the items, ref: https://people.wikimedia.org/~dcausse/wdqs_manual_deletes/ .