SPARQL-Query shows entries, which should be filter out; number of entries in result set might change when executed repeatedly (possible caching/indexing problem)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	M2k_dewiki
	Nov 3 2020, 11:17 PM

Description

Hello, the following query should return all german streets, which have a Commons-sitelink, but no Commonscat-Property (P373):

SELECT ?item ?commonscat ?sitelink  WHERE {
  ?item wdt:P31 wd:Q79007. # Innerortsstraße
  ?item wdt:P17 wd:Q183.   # Deutschland
  ?sitelink schema:about ?item .
  ?sitelink schema:isPartOf <https://commons.wikimedia.org/> .
  OPTIONAL {?item wdt:P373 ?commonscat }  
  FILTER (!bound(?commonscat))   # nur jene OHNE commonscat-Property (P373)
}

The query currently returns sometimes 22/23, sometimes 30 entries, depening when and how often the query is executed, allthough the objects have not been changed inbetween. Although the objects actually have a commonscat-Property, so they should not be listed at all in the result set, i.e. the result set actually should be empty.

For just one city (e.g. ?item wdt:P131 wd:Q61724. ) the query returns 2, 3 or 4 entries if the query is repeatedly executed.

Could this be a caching/indexing problem? Why are entries with commonscat-properties listed, while they should be filtered out?

https://www.wikidata.org/wiki/Q100417069?action=purge A purge to some objects did not change anything
Using https://byabbe.se/2018/01/26/cache-busting-wikidata-sparql-queries Cache Busting ( SELECT ?item ?commonscat ?sitelink (MD5(CONCAT(STR(?item), STR(RAND()))) as ?random) WHERE { ) does not change anything.
https://replag.toolforge.org/ Replag currently shows no problems.

According to this discussion

https://www.wikidata.org/wiki/Wikidata:Project_chat#Query_shows_entries,_which_should_be_filter_out;_number_of_entries_in_result_set_changes_when_executed_repeatedly

another user has spotted a similar issue with a different query.

Also see

https://www.wikidata.org/wiki/Wikidata:Contact_the_development_team/Query_Service_and_search#Query_shows_entries,_which_should_be_filter_out;_number_of_entries_in_result_set_changes_when_executed_repeatedly

Thanks a lot!

Details

Subject	Repo	Branch	Lines +/-
[wdqs] disable async imports	operations/puppet	production	+1 -1
wdqs: default to using kafka for updates	operations/puppet	production	+1 -1
Revert "wdqs: use RecentChanges API for updates on all WDQS servers"	operations/puppet	production	+0 -1
[wdqs] re-enable polling kafka for updates on wdqs1010	operations/puppet	production	+1 -0
Add processing delay to RC Poller	wikidata/query/rdf	master	+62 -33

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		RKemper	T267927 Reload wikidata journal from fresh dumps
		Resolved		RKemper	T267175 SPARQL-Query shows entries, which should be filter out; number of entries in result set might change when executed repeatedly (possible caching/indexing problem)

Event Timeline

M2k_dewiki created this task.Nov 3 2020, 11:17 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 3 2020, 11:17 PM

Reedy added a project: Wikidata-Query-Service.Nov 4 2020, 2:06 AM

Reedy updated the task description. (Show Details)

Restricted Application added a project: Wikidata. · View Herald TranscriptNov 4 2020, 2:06 AM

Probably an issue with the query service updater, yes. You’re getting different results if you hit different backend servers.

https://www.wikidata.org/wiki/Q100417069?action=purge A purge to some objects did not change anything

action=purge only purges within MediaWiki, it does not affect the query service.

Using https://byabbe.se/2018/01/26/cache-busting-wikidata-sparql-queries Cache Busting ( SELECT ?item ?commonscat ?sitelink (MD5(CONCAT(STR(?item), STR(RAND()))) as ?random) WHERE { ) does not change anything.

The “cache busting” part is the random comment, not the (MD5(CONCAT(STR(?item), STR(RAND()))) as ?random).

https://replag.toolforge.org/ Replag currently shows no problems.

That tool shows the Cloud VPS database replication lag, which is unrelated to the query service update lag. (You can see the WDQS lag on Grafana.)

We'll need to do some investigation to see what kind of underlying bug is hiding here. We are planning a full data reload anyway, which should solve this.

Lucas_Werkmeister_WMDE mentioned this in T267644: Update Wikidata unit conversion config (normalized quantities).Nov 10 2020, 2:48 PM

I flagged up the other case mentioned. As an example of the issue -

At the time of writing, https://w.wiki/kxw returns four items, three edited on 12 October and one on 27 October. All have removing Q99671172 (which is now no longer linked from any items) as their most recent edit.

Q186841
Q1026588
Q9006
Q28176

https://www.wikidata.org/wiki/Special:WhatLinksHere/Q99671172 shows nine items, however - two of the query items are listed on it, and the other seven are new. All of those have the removal of Q99671172 as their last edit, and dates either 6, 12 or 27 October.

electromagnetism (Q11406)
Altai (Q28176)
Lake Ohrid (Q156258)
Guadalcanal (Q192767)
Minigun (Q864060)
Caledonian Stadium (Q1026588)
A85 road (Q4649651)
Blairquhan Castle (Q4924231)
Judge Rinder (Q18160407)

So it does look like it's affecting the links table as well as the query service, but not quite in the same way for both. I don't know if that helps track the issue down or not!

The same problem with https://de.wikipedia.org/wiki/Benutzer:Z_thomas/DE-BB-Stra%C3%9Fen-Barnim and the connected sites. I created a lot of streets in Wikidata during the last days, but no changes in the listeria-list refering to streets in this area.

• dcausse mentioned this in T267927: Reload wikidata journal from fresh dumps.Nov 16 2020, 5:43 PM

Timeboxing investigation to 1 day.

CBogen moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.Nov 16 2020, 6:27 PM

This issue is still occurring. We still get different answers depending on which report server we hit. That is clearly suboptimal. Could we please have an update on action being taken, and/or the timetable for proposed action, presuming that action will be taken. @Lydia_Pintscher

• Zbyszko claimed this task.Dec 1 2020, 11:12 AM

• Zbyszko moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.

• dcausse merged a task: T267924: WDQS Updater (based on recent changes) missed some updates.Dec 1 2020, 6:16 PM

• dcausse merged a task: T268408: Query returns outdated results .

• dcausse added a subscriber: Epidosis.

• dcausse subscribed.

In T267175#6647957, @Tagishsimon wrote:

This issue is still occurring. We still get different answers depending on which report server we hit. That is clearly suboptimal. Could we please have an update on action being taken, and/or the timetable for proposed action, presuming that action will be taken. @Lydia_Pintscher

Hey @Tagishsimon :) I unfortunately don't know but given @Zbyszko has claimed the task I'm confident it's in good hands. Maybe he can tell us where we stand currently.

I started to investigate the issue, but had to get back to some previous issue - I should have some update before the end of the week, though.

Change 645313 had a related patch set uploaded (by ZPapierski; owner: ZPapierski):
[wikidata/query/rdf@master] Add processing delay to RC Poller

https://gerrit.wikimedia.org/r/645313

gerritbot added a project: Patch-For-Review.Dec 4 2020, 10:52 AM

Few details on the issue:

For extensively modified entities (e.g. by bots) log show that information provided by RecentChanges API isn't always up to date. This can lead to lost revisions - if the last change was among the ones not yet provided by the API, the next check will omit it, since it thinks that previous time period was already handled.
Our workaround for the issue is to add the delay to processing time periods of changes. This isn't an ideal solution for two reasons - it adds a lag to an already substantial process and it doesn't provide guarantees that the delay length we set will help in all cases.
We won't be pursuing a better solution for this issue with current updater - we're closing in on having a new, streaming based, updater on production. It uses Kafka events and Flink to reconcile the updates which allows for a much less fragile process.
This doesn't fix the currently inconsistent entities, but that will be done with the data reload - T267927.

• Zbyszko moved this task from In Progress to Needs review on the Discovery-Search (Current work) board.Dec 4 2020, 11:25 AM

Change 645313 abandoned by ZPapierski:
[wikidata/query/rdf@master] Add processing delay to RC Poller

Reason:
Adding the delay was deemed to risky of a workaround.

https://gerrit.wikimedia.org/r/645313

Change 646631 had a related patch set uploaded (by DCausse; owner: DCausse):
[operations/puppet@production] [wdqs] re-enable polling kafka for updates on wdqs1010

https://gerrit.wikimedia.org/r/646631

Change 646632 had a related patch set uploaded (by DCausse; owner: DCausse):
[operations/puppet@production] Revert "wdqs: use RecentChanges API for updates on all WDQS servers"

https://gerrit.wikimedia.org/r/646632

We will re-enable the kafka poller that was disabled for security reasons back in january. The plan is as follow:

merge the test patch https://gerrit.wikimedia.org/r/646631 and verify for one day that wdqs1010 behaves correctly
merge https://gerrit.wikimedia.org/r/646632 to enable the kafka poller on all the machines

Moving back to in progress an re-assigning to @RKemper for merging the test patch.

I've observed quite a lot of inconsistencies over the past two weeks. I haven't looked very extensively at it, but I'm getting the impression that blocks of edits are missed and the timestamps are around a spike at https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&viewPanel=8 . Order of magnitude is several 100s of edits.

• dcausse merged a task: T270975: Some lexemes cannot be obtained by SPARQL query.Jan 5 2021, 9:57 AM

• dcausse added subscribers: Strepon, Skim.

Change 646631 merged by Ryan Kemper:
[operations/puppet@production] [wdqs] re-enable polling kafka for updates on wdqs1010

https://gerrit.wikimedia.org/r/646631

Change 646632 merged by Ryan Kemper:
[operations/puppet@production] Revert "wdqs: use RecentChanges API for updates on all WDQS servers"

https://gerrit.wikimedia.org/r/646632

Change 654914 had a related patch set uploaded (by Ryan Kemper; owner: Ryan Kemper):
[operations/puppet@production] wdqs: default to using kafka for updates

https://gerrit.wikimedia.org/r/654914

Change 654914 merged by Ryan Kemper:
[operations/puppet@production] wdqs: default to using kafka for updates

https://gerrit.wikimedia.org/r/654914

@RKemper so what's the status of this? I see a lot of cases where the last edit didn't get processed so the data in SPARQL is not consistent. See https://w.wiki/ugf for some examples.

Checked a couple of these inconsistencies and they appear to all be out of order in the kafka topics. I suggest to disable async imports as I believe it might be possible cause of these inconsistencies.

Change 656833 had a related patch set uploaded (by DCausse; owner: DCausse):
[operations/puppet@production] [wdqs] disable async imports

https://gerrit.wikimedia.org/r/656833

In T267175#6754527, @dcausse wrote:

Checked a couple of these inconsistencies and they appear to all be out of order in the kafka topics. I suggest to disable async imports as I believe it might be possible cause of these inconsistencies.

That means that data from the latest right revision might get overwritten with data from an older revision? That would explain the pattern.

Change 656833 merged by Ryan Kemper:
[operations/puppet@production] [wdqs] disable async imports

https://gerrit.wikimedia.org/r/656833

Deploying https://gerrit.wikimedia.org/r/656833 now

(Puppet run successful on wdqs nodes following deploy)

Post deploy check:

wdqs-updater.service is now running without the --import-async flag as expected:

ExecStart=/bin/bash /srv/deployment/wdqs/wdqs/runUpdate.sh -n wdq -- --kafka kafka-main1001.eqiad.wmnet:9092,kafka-main1002.eqiad.wmnet:9092,kafka-main1003.eqiad.wmnet:9092 --consumer wdqs1003 -b 700 --clusters eqiad,codfw --constraints --oldRevision 3 --entityNamespaces 0,120,146

I think now we just wait and see if the issue in the ticket description pops up again.

Tried to find more inconsistencies using the query provided by @Multichill (https://w.wiki/ugf) and could not spot any while it was very easy to find one previously. I'm assuming the problem is resolved and that we can proceed with the full reload based on new units (https://phabricator.wikimedia.org/T267644#6758238).
If no new inconsistencies are reported since then here is the expected schedule:

units are reloaded on Friday 29 (2021-01-29)
full data-reload of one machine starts next Friday (2021-02-05)
depending on the time needed and in best case scenario (no import failures) the reimport of all the wdqs machines should be finished by the end of February

• dcausse mentioned this in T272120: Deleted item still gets shown in WDQS query results.Jan 26 2021, 4:42 PM

• dcausse merged a task: T272120: Deleted item still gets shown in WDQS query results.

• dcausse added a subscriber: Mbch331.

Gehel closed this task as Resolved.Jan 27 2021, 10:29 AM

I am still seeing a handful of inconsistencies stemming from the same period of edits. For example this edit to Q5540133 on 23 December has not propagated to all databases, and so https://w.wiki/w66 returns a missing value for one of the "partyLabel" fields about one time in four.

(Not sure from the comments above if this is the sort of thing that will be fixed by the reload or if it indicates there are still underlying issues, but wanted to flag it up in case it was a concern)

Q104982840 which was deleted January 27th 08:48 CET still shows up in my query for invalid P2397 values

Thanks for the comments.
Inconsistencies for edits prior to Jan 20 (time when the last fix was deployed) are expected and will be fixed by the reload.
Inconsistency on Q104982840 is more troubling as the delete was done after this date. I'll re-open this specific ticket as I wonder if there is not something specific to deletes happening here.

Gehel added a parent task: T267927: Reload wikidata journal from fresh dumps.Feb 4 2021, 9:12 AM

Please also see

https://www.wikidata.org/wiki/Wikidata:Contact_the_development_team/Query_Service_and_search#Q104776498_deleted_but_still_on_WQS_(2021-02-14)

https://www.wikidata.org/w/index.php?title=Wikidata%3AContact_the_development_team%2FQuery_Service_and_search&type=revision&diff=1361328260&oldid=1360558001

https://phabricator.wikimedia.org/T272120

Mitigation for deletes will be made using a script that polls for the deletion log and resync the items, ref: https://people.wikimedia.org/~dcausse/wdqs_manual_deletes/ .

Please also see

https://www.wikidata.org/wiki/Wikidata:Project_chat#Unique_constraint_violation_with_deleted_object
https://www.wikidata.org/w/index.php?title=Wikidata%3AProject_chat&type=revision&diff=1475543373&oldid=1475352852

SPARQL-Query shows entries, which should be filter out; number of entries in result set might change when executed repeatedly (possible caching/indexing problem)Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

SPARQL-Query shows entries, which should be filter out; number of entries in result set might change when executed repeatedly (possible caching/indexing problem)
Closed, ResolvedPublic
Actions

Related Objects
Search...