Page MenuHomePhabricator

WDQS returns current AND old data
Closed, ResolvedPublic


Hello everyone!
I created a SPARQL query which should return all CHEMBL IDs (P592) from all values in significant drug interactions (P769) of item Q179996. It actually works and returns all values appropriately. But unfortunately, It also returns values my drug bot ( replaced about a month ago. So with the results returned, it is impossible to determine what the current and the old values are, even worse, it gives the impression to the user that both values are valid. This behaviour was also experienced by another user executing different queries.

Should this be a feature and not a bug (in order to allow queries on the revision history of items), I think it should be clearly stated in the documentation (and how to filter for only the current values). I could not find anything on that. Thank you!

Executed on:

PREFIX wd: <>
PREFIX wdt: <>
PREFIX rdfs: <>
PREFIX p: <>
PREFIX v: <>

SELECT ?compound ?label ?chembl WHERE {
    ?compound wdt:P769 wd:Q179996 .
    ?compound wdt:P592 ?chembl
     OPTIONAL  {
        ?compound rdfs:label ?label filter (lang(?label) = "en") .


Event Timeline

Sebotic raised the priority of this task from to High.
Sebotic updated the task description. (Show Details)
Sebotic added a subscriber: Sebotic.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@Sebotic does it look fine now? If yes, it may be just a glitch with a skipped update, we recently had some networking glitches that may be related. Do you have any more examples of non-updated data?

Bene and I ran into an article today that was missing badges. An edit to the article made them show up in sparql.

@Lydia_Pintscher if edit fixes it, that is definitely missed update. Which article is that? These should be going away soon, but as I mentioned there were a couple of glitches on initial setup and we are still running on the same dataset. Once we're figured out remaining format issues, we'll reload the dataset which should eliminate those.

If edit does not fix it (within reasonable timefreame, checked against timestamp on homepage) then it may be a serious updater issue which needs deeper digging.

@Smalyshev I just tested the query once again. Some of the old data is gone now, but one still comes up. It is this item: ' I currently do not have other queries to execute, but I will think of some.

Deskana claimed this task.
Deskana added a subscriber: Deskana.

We believe this is fixed now. Please reopen if the problem persists.

I believe this issue is still present.

As an example:

Which doesn't show up using the sparql endpoint:

There are 3 with the same issue:

The IPR005128 one seems to work occasionally, however meaning it may have loaded correctly on one server but not on another?
Please take a look. Thank you

I have a quick follow up for this. I made 2 slightly differing sparql queries one accessing values directly and one inderectly. They should give the same return values, but it seems that if each query is executed on a different server, the 2 result sets differ, one gives back 54320 values, the other 54315. Irrespective of the counts, some values differ. Seem my code here:


r1 54320
r2 54315
r1 to r2 diff {'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''}
r2 to r1 diff {'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''}

as far as I could see, the reason why these result sets differ is not the rank or other differences. If the queries are both run on the same server, there is no difference between the result sets. (this is an assumption as I do not know which server is really executing a query)

Thanks for looking at this issue!


You can know which server runs the query if you look at the network trace (ie. in Chrome Devtools) and see x-served-by header, e.g. x-served-by:wdqs1002.

thanks, here are the headers for r1 and r2, respectively:

{'Connection': 'keep-alive', 'Access-Control-Allow-Origin': '*', 'Content-Type': 'application/sparql-results+json', 'Via': '1.1 varnish-v4, 1.1 varnish-v4, 1.1 varnish-v4, 1.1 varnish-v4', 'X-Cache': 'cp1058 miss, cp2006 miss, cp4002 miss, cp4001 miss', 'Accept-Ranges': 'bytes', 'X-Served-By': 'wdqs1001', 'X-Client-IP': '', 'Date': 'Thu, 14 Jul 2016 22:39:31 GMT', 'Vary': 'Accept, Accept-Encoding', 'Server': 'nginx/1.11.1', 'Set-Cookie': 'WMF-Last-Access=14-Jul-2016;Path=/;HttpOnly;secure;Expires=Mon, 15 Aug 2016 12:00:00 GMT', 'Cache-Control': 'public, max-age=300', 'X-Varnish': '63828730, 50083124, 32483470, 1448128', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'Age': '2', 'Content-Encoding': 'gzip', 'X-Analytics': 'https=1;nocookies=1', 'Content-Length': '232492'}
{'Connection': 'keep-alive', 'Access-Control-Allow-Origin': '*', 'Content-Type': 'application/sparql-results+json', 'Via': '1.1 varnish-v4, 1.1 varnish-v4, 1.1 varnish-v4, 1.1 varnish-v4', 'X-Cache': 'cp1045 miss, cp2025 miss, cp4004 miss, cp4001 miss', 'Accept-Ranges': 'bytes', 'X-Served-By': 'wdqs1002', 'X-Client-IP': '', 'Date': 'Thu, 14 Jul 2016 22:39:35 GMT', 'Vary': 'Accept, Accept-Encoding', 'Server': 'nginx/1.11.1', 'Set-Cookie': 'WMF-Last-Access=14-Jul-2016;Path=/;HttpOnly;secure;Expires=Mon, 15 Aug 2016 12:00:00 GMT', 'Cache-Control': 'public, max-age=300', 'X-Varnish': '64294654, 50298994, 32690474, 1448131', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'Age': '3', 'Content-Encoding': 'gzip', 'X-Analytics': 'https=1;nocookies=1', 'Content-Length': '233261'}

So my assumption that the 2 servers give different results back for these queries seems correct.

The three items above are missing from wdq2, but I don't see anything anomalous in the logs around the time they were supposed to be created. All three are created by bot on 24 June 2016‎, but many others created by the same bot in the same timeframe do not have problems. Also, the query does provide evidence for lost updates, but I don't see anything anomalous in the logs. Looks like more logging is needed.

I think I found the culprit. If you look at the first entry at|ids|timestamp&rcnamespace=0|120&rclimit=100&rccontinue=20160720152523|372669870, the first one has timestamp after the second one. So while rcid order is right, timestamp order is not. Which means if retrieved in timestamp order, it will retrieve the second one, then the first one gets added but the marker has already moved past it...

I've deployed the fix so it should not happen anymore. If you see any skipped updates dated August 2 or later, please reopen with specific examples.

I think there is still some issue present.
I run the following query:

select * where {
  wd:Q52839992 p:P5114 ?s . 
  ?s ?a ?b .

From wdqs2003 I get no results, but from wdqs2002 I get the expected results. Q52839992 was last updated ~48 hours ago.
see also:

Looks like now the servers are in sync. Please open new tasks if it happens again, otherwise it's a bit hard to keep track of what exactly is wrong. If it proves out to be same cause, I'll merge them.