Page MenuHomePhabricator

Fewer results from wdqs nodes running in codfw than eqiad
Closed, ResolvedPublic

Description

Reported by User:Oravrattas via https://www.wikidata.org/wiki/Wikidata:Report_a_technical_problem/WDQS_and_Search#Fewer_results_from_wdqs20*_than_wdqs10*

Running a simple query for all historic Dutch Senators consistently returns different results from all wdqs10* servers, than from all wdqs20* servers. I've been noticing odd results from this for many months, but it was remarkably difficult to track down, because it turns out that any queries I run locally only ever go to wdqs10*; the only way I have been able to replicate the behaviour is to run the query via Github Actions, which hits both sets of servers.

https://github.com/tmtmtmtm/ghatest2 shows this in action: the query in holders.js (including a run-time-specific comment to bypass caching) produces the outputs in results/ (named after the server handling the request and the date/time)

The results are consistent across the servers in each cluster: today each wdqs10* server returns 1361 rows, whereas each wdqs20* only returns 1357. These differences are actually made up of 8 people not being returned at all from wdqs20* (e.g. Q2053506 and Q2770742), 4 people who get returned twice from wdqs20* but only once from wdqs10* (e.g. Q2257961 and Q3469514), and one person who gets the same P39 data returned from each set of servers, but with a different schema:dateModified on each: (Q1845502, last modified on 2022-05-05 in the wdqs10* results, but on 2022-04-15 in the wdqs20* results).

Event Timeline

Recording few findings: checked 3 examples and they both relate to edits happening around 2022-05-05T18:03:00:

These edits all share the same tags #quickstatements; #temporary_batch_1651767610540.

MPhamWMF moved this task from Incoming to Blazegraph on the Wikidata-Query-Service board.

For the past couple of days this has got significantly worse. Running the same script as before, wdqs10* are returning the expected 1360 results¹, but wdqs20* have degraded even more:

  • wdqs2003 and 2004 are four short, as before, with 1356
  • wdqs2009 is returning only 155
  • wdqs2010 is returning only 154
  • wdqs2012 is returning zero

Previously I was only experiencing this with a handful of queries; now it's widespread, making the query service close to unusable for me from Github Actions.


¹ In the previous report the numbers were one higher, as they included the header row. This time I'm only counting data rows.

Mentioned in SAL (#wikimedia-operations) [2023-01-26T07:25:17Z] <dcausse> T322869: depooling wdqs2009 wdqs2010 wdqs2011 wdqs2012 these hosts should not serve user traffic yet they don't have the database loaded

@Oravrattas thanks for the report, I've depooled these machines because these are new hosts that are not ready to serve user traffic yet (they have an empty/partially loaded database).

Note that we are trying to reload the full dataset from dumps, which would correct all those discrepancies. This can be tracked in T323096. We are running into multiple issues, the main one being that the Blazegraph journal gets corrupted during the import (see T263110). The only reasonable option we have at this time is to try again (and again) to import the dumps and hope that it eventually work. Obviously this isn't a long term solution and we need to come up with an alternative (probably either splitting the graph to work with more reasonable data sizes or move to a different RDF backend).

bking subscribed.

The WDQS data reload is complete (ref T323096 ). At this point, we believe codfw and eqiad should consistently output the same results for the same query. @Oravrattas are you able to re-run your query and confirm?

Thanks for your patience during this lengthy endeavour.

Gehel claimed this task.
Gehel closed subtask T323096: WDQS Data Reload as Resolved.