Page MenuHomePhabricator

WDQS Data Reload
Closed, ResolvedPublic13 Estimated Story Points

Description

To address data discrepancy in WDQS, we need to reload data on all WDQS servers. In particular, this will address T322869. As part of this data reload, we might want to improve the automation to save time on this reload and the future ones. The exact scope of those improvements still need to be defined.

AC:

  • Data has been reloaded from the dumps on all WDQS servers
  • All servers return the same answers to queries

Note:

  • We might want to improve the resiliency of data transfer while working on this ticket

Event Timeline

Gehel triaged this task as High priority.Nov 21 2022, 4:29 PM
Gehel updated the task description. (Show Details)
Gehel updated the task description. (Show Details)
Gehel moved this task from Incoming to Operations/SRE on the Wikidata-Query-Service board.

@RKemper since we are doing data reloads on everything, I'm thinking maybe we should revisit this PR as it discusses a few ways to speed up the process

Change 867646 had a related patch set uploaded (by Bking; author: Ebernhardson):

[operations/puppet@production] Mount labstore to wcqs/wdqs instance for dumps reload

https://gerrit.wikimedia.org/r/867646

Change 867646 merged by Bking:

[operations/puppet@production] Mount labstore to wcqs/wdqs instance for dumps reload

https://gerrit.wikimedia.org/r/867646

The following hosts now have the clouddumps NFS share mounted:

  • wcqs2001.codfw.wmnet
  • wdqs1009.eqiad.wmnet
  • wdqs2001.codfw.wmnet
  • wdqs2009.codfw.wmnet

The ttl files we need are in /mnt/nfs/dumps-clouddumps100[12].wikimedia.org/other/wikibase/commonswiki for commons and
/mnt/nfs/dumps-clouddumps100[12].wikimedia.org/other/wikibase/wikidatawiki for wdqs .

The next step will be to update the cookbook; I've created T325114 for that work.

Change 868440 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: allow mounting clouddumps share from wdqs2009

https://gerrit.wikimedia.org/r/868440

Change 868440 merged by Bking:

[operations/puppet@production] wdqs: allow mounting clouddumps share from wdqs2009

https://gerrit.wikimedia.org/r/868440

Change 869828 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] query_service: add wdqs/wcqs hosts as rsync clients to clouddumps

https://gerrit.wikimedia.org/r/869828

Change 869828 merged by Bking:

[operations/puppet@production] query_service: add wdqs/wcqs hosts as rsync clients to clouddumps

https://gerrit.wikimedia.org/r/869828

Change 873791 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/cookbooks@master] wdqs: make depool the default behavior

https://gerrit.wikimedia.org/r/873791

Data reload is currently running in tmux window "wdqs-reload" on cumin1001 under my user on wdqs2009 now. I'm using a custom version of the cookbook that loads via NFS. Not a permanent solution, but should allow us to complete our immediate task.

Change 876217 had a related patch set uploaded (by Bking; author: Bking):

[operations/cookbooks@master] [WIP] wdqs-data-reload: use NFS for data reloads

https://gerrit.wikimedia.org/r/876217

Sorry if it is the wrong ticket, but several services of wdqs2010, wdqs2011 and wdqs2012 are alerting. The sevice is returnin 400 commands. My guess is this is due to this ongoing data reload (no issue). If that is the case, could the alerts "WDQS SPARQL" and other failing checks be acknowledeged on icinga, to prevent alert spam? Thank you!

I was told by @Gehel that it was unrelated to this, but related to T301167. Sorry for the confussion.

Change 879634 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[wikidata/query/deploy@master] wdqs: log realpath of dump file

https://gerrit.wikimedia.org/r/879634

Change 879636 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[wikidata/query/rdf@master] wdqs: log realpath of dump file

https://gerrit.wikimedia.org/r/879636

Change 879634 abandoned by Ryan Kemper:

[wikidata/query/deploy@master] wdqs: log realpath of dump file

Reason:

wrong repo, see https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/879636

https://gerrit.wikimedia.org/r/879634

Change 879636 merged by jenkins-bot:

[wikidata/query/rdf@master] wdqs: log realpath of dump file

https://gerrit.wikimedia.org/r/879636

Change 876217 merged by Ryan Kemper:

[operations/cookbooks@master] wdqs: use NFS for data reloads

https://gerrit.wikimedia.org/r/876217

Update: wdqs2009 has failed to reload 3 times. Each time it seems to go into an OOM state where the server is pingable, but logging in (even from the management console) is impossible.

Change 882664 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: mount NFS to new hosts

https://gerrit.wikimedia.org/r/882664

Change 882664 merged by Bking:

[operations/puppet@production] wdqs: mount NFS to new hosts

https://gerrit.wikimedia.org/r/882664

In order to avoid the OOM situation mentioned above , I've created a large swapfile on wdqs2010 at /srv/.swapfile . Making a note to myself to disable/remove this file once the reload is done.

wdqs2010 locked up as well. At this point, we can rule out OOM. The most likely culprit is NFS, which is already scheduled to be replaced by rsync.

More concerning is the failure of wdqs1009, which ran for 17 days and failed when Blazegraph corrupted its own journal (NFS was not a factor). We are starting the reload process again on 1009 and 1010.

MPhamWMF set the point value for this task to 13.Jan 30 2023, 4:27 PM

Another update: wdqs1009 corrupted itself over the weekend, so we have restarted the reload process again. wdqs1010 is still in progress. If we go by individual TTL files, it is at 56%, but I'm not sure if each TTL file takes the same amount of processing time. If so, that is significantly faster than the previous run, and we should be finished in ~5 days , or ~10 days from when we started.

Update: we're at 80% on wdqs1009, 60% on wdqs1010 (unsure why 1009 is moving so much faster).

wdqs1009 corrupted itself over the weekend and we had to restart it. wdqs1010 at ~80%, wdqs1009 still in munging state.

For context, the journal corruptions we see are similar to T263110 (there was some investigation done at that time). Our only reasonable option is to try again until is succeeds.

The reload for wdqs1010 completed successfully. We are using this host to seed other hosts using our data-transfer cookbook. We will update the ticket with progress.

  • wdqs1003
  • wdqs1004
  • wdqs1005
  • wdqs1006
  • wdqs1007
  • wdqs1008
  • wdqs1009
  • wdqs1010
  • wdqs1011
  • wdqs1012
  • wdqs1013
  • wdqs1014
  • wdqs1015
  • wdqs1016
  • wdqs2001
  • wdqs2002
  • wdqs2003
  • wdqs2004
  • wdqs2005
  • wdqs2006
  • wdqs2007
  • wdqs2008
  • wdqs2009
  • wdqs2010
  • wdqs2011
  • wdqs2012

Change 873791 merged by jenkins-bot:

[operations/cookbooks@master] wdqs: make depool the default behavior

https://gerrit.wikimedia.org/r/873791

Change 891899 had a related patch set uploaded (by Bking; author: Bking):

[operations/cookbooks@master] wdqs.data-transfer: completely remove defunct argument

https://gerrit.wikimedia.org/r/891899

Change 891899 merged by jenkins-bot:

[operations/cookbooks@master] wdqs.data-transfer: completely replace defunct argument

https://gerrit.wikimedia.org/r/891899

At long last..the data reload is complete!

At this point, we believe the data discrepancies mentioned in the initial ticket message have been resolved. However, please let us know if this is not the case.