WDQS Data Reload
Closed, ResolvedPublic13 Estimated Story Points
Actions

Assigned To

Authored By

	Gehel
	Nov 15 2022, 9:18 AM

Description

To address data discrepancy in WDQS, we need to reload data on all WDQS servers. In particular, this will address T322869. As part of this data reload, we might want to improve the automation to save time on this reload and the future ones. The exact scope of those improvements still need to be defined.

AC:

Data has been reloaded from the dumps on all WDQS servers
All servers return the same answers to queries

Note:

We might want to improve the resiliency of data transfer while working on this ticket

Details

Subject	Repo	Branch	Lines +/-
wdqs.data-transfer: completely replace defunct argument	operations/cookbooks	master	+1 -1
wdqs: make depool the default behavior	operations/cookbooks	master	+16 -11
wdqs: log realpath of dump file	wikidata/query/rdf	master	+4 -3
wdqs: mount NFS to new hosts	operations/puppet	production	+5 -1
wdqs: use NFS for data reloads	operations/cookbooks	master	+51 -98
wdqs: log realpath of dump file	wikidata/query/deploy	master	+4 -2
query_service: add wdqs/wcqs hosts as rsync clients to clouddumps	operations/puppet	production	+3 -0
wdqs: allow mounting clouddumps share from wdqs2009	operations/puppet	production	+1 -1
Mount labstore to wcqs/wdqs instance for dumps reload	operations/puppet	production	+49 -49

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	bking	T301167 Service implementation for wdqs20[09,10,11,12]
Resolved	Gehel	T322869 Fewer results from wdqs nodes running in codfw than eqiad
Resolved	bking	T323096 WDQS Data Reload
Resolved	bking	T305818 Perform a data transfer to wdqs2004 & wdqs1004 to reclaim burnt allocators

Event Timeline

Gehel created this task.Nov 15 2022, 9:18 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 15 2022, 9:18 AM

Maintenance_bot added a project: Wikidata.Nov 15 2022, 9:29 AM

Oravrattas subscribed.Nov 15 2022, 2:03 PM

bking subscribed.Nov 21 2022, 4:27 PM

Gehel triaged this task as High priority.Nov 21 2022, 4:29 PM

Gehel updated the task description. (Show Details)

Gehel moved this task from Incoming to Operations/SRE on the Wikidata-Query-Service board.

@RKemper since we are doing data reloads on everything, I'm thinking maybe we should revisit this PR as it discusses a few ways to speed up the process

bking added a subtask: T305818: Perform a data transfer to wdqs2004 & wdqs1004 to reclaim burnt allocators.Dec 9 2022, 2:43 PM

bking closed subtask T305818: Perform a data transfer to wdqs2004 & wdqs1004 to reclaim burnt allocators as Resolved.

bking mentioned this in T305818: Perform a data transfer to wdqs2004 & wdqs1004 to reclaim burnt allocators.

Change 867646 had a related patch set uploaded (by Bking; author: Ebernhardson):

[operations/puppet@production] Mount labstore to wcqs/wdqs instance for dumps reload

https://gerrit.wikimedia.org/r/867646

gerritbot added a project: Patch-For-Review.Dec 13 2022, 3:07 PM

Change 867646 merged by Bking:

[operations/puppet@production] Mount labstore to wcqs/wdqs instance for dumps reload

https://gerrit.wikimedia.org/r/867646

Maintenance_bot removed a project: Patch-For-Review.Dec 13 2022, 10:31 PM

bking mentioned this in T325114: Update wdqs/wcqs data reload cookbook to use NFS mounts instead of external site and autodetect kafka timestamp from dumps.Dec 13 2022, 10:52 PM

The following hosts now have the clouddumps NFS share mounted:

wcqs2001.codfw.wmnet
wdqs1009.eqiad.wmnet
wdqs2001.codfw.wmnet
wdqs2009.codfw.wmnet

The ttl files we need are in /mnt/nfs/dumps-clouddumps100[12].wikimedia.org/other/wikibase/commonswiki for commons and
/mnt/nfs/dumps-clouddumps100[12].wikimedia.org/other/wikibase/wikidatawiki for wdqs .

The next step will be to update the cookbook; I've created T325114 for that work.

Change 868440 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: allow mounting clouddumps share from wdqs2009

https://gerrit.wikimedia.org/r/868440

gerritbot added a project: Patch-For-Review.Dec 15 2022, 4:26 PM

Change 868440 merged by Bking:

[operations/puppet@production] wdqs: allow mounting clouddumps share from wdqs2009

https://gerrit.wikimedia.org/r/868440

Maintenance_bot removed a project: Patch-For-Review.Dec 15 2022, 5:30 PM

Gehel edited projects, added Discovery-Search (Current work); removed Wikidata-Query-Service.Dec 16 2022, 8:37 AM

Gehel added a parent task: T301167: Service implementation for wdqs20[09,10,11,12].Dec 19 2022, 4:27 PM

Change 869828 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] query_service: add wdqs/wcqs hosts as rsync clients to clouddumps

https://gerrit.wikimedia.org/r/869828

gerritbot added a project: Patch-For-Review.Dec 20 2022, 6:05 PM

Change 869828 merged by Bking:

[operations/puppet@production] query_service: add wdqs/wcqs hosts as rsync clients to clouddumps

https://gerrit.wikimedia.org/r/869828

Change 873791 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/cookbooks@master] wdqs: make depool the default behavior

https://gerrit.wikimedia.org/r/873791

Data reload is currently running in tmux window "wdqs-reload" on cumin1001 under my user on wdqs2009 now. I'm using a custom version of the cookbook that loads via NFS. Not a permanent solution, but should allow us to complete our immediate task.

Change 876217 had a related patch set uploaded (by Bking; author: Bking):

[operations/cookbooks@master] [WIP] wdqs-data-reload: use NFS for data reloads

https://gerrit.wikimedia.org/r/876217

Sorry if it is the wrong ticket, but several services of wdqs2010, wdqs2011 and wdqs2012 are alerting. The sevice is returnin 400 commands. My guess is this is due to this ongoing data reload (no issue). If that is the case, could the alerts "WDQS SPARQL" and other failing checks be acknowledeged on icinga, to prevent alert spam? Thank you!

I was told by @Gehel that it was unrelated to this, but related to T301167. Sorry for the confussion.

Change 879634 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[wikidata/query/deploy@master] wdqs: log realpath of dump file

https://gerrit.wikimedia.org/r/879634

Change 879636 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[wikidata/query/rdf@master] wdqs: log realpath of dump file

https://gerrit.wikimedia.org/r/879636

Change 879634 abandoned by Ryan Kemper:

[wikidata/query/deploy@master] wdqs: log realpath of dump file

Reason:

wrong repo, see https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/879636

https://gerrit.wikimedia.org/r/879634

Change 879636 merged by jenkins-bot:

[wikidata/query/rdf@master] wdqs: log realpath of dump file

https://gerrit.wikimedia.org/r/879636

Change 876217 merged by Ryan Kemper:

[operations/cookbooks@master] wdqs: use NFS for data reloads

https://gerrit.wikimedia.org/r/876217

Update: wdqs2009 has failed to reload 3 times. Each time it seems to go into an OOM state where the server is pingable, but logging in (even from the management console) is impossible.

Change 882664 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: mount NFS to new hosts

https://gerrit.wikimedia.org/r/882664

Change 882664 merged by Bking:

[operations/puppet@production] wdqs: mount NFS to new hosts

https://gerrit.wikimedia.org/r/882664

Gehel assigned this task to bking.Jan 23 2023, 4:20 PM

Gehel added a parent task: T322869: Fewer results from wdqs nodes running in codfw than eqiad.

Gehel moved this task from Incoming to In Progress on the Discovery-Search (Current work) board.

In order to avoid the OOM situation mentioned above , I've created a large swapfile on wdqs2010 at /srv/.swapfile . Making a note to myself to disable/remove this file once the reload is done.

Lydia_Pintscher subscribed.Jan 24 2023, 5:39 PM

wdqs2010 locked up as well. At this point, we can rule out OOM. The most likely culprit is NFS, which is already scheduled to be replaced by rsync.

More concerning is the failure of wdqs1009, which ran for 17 days and failed when Blazegraph corrupted its own journal (NFS was not a factor). We are starting the reload process again on 1009 and 1010.

MPhamWMF set the point value for this task to 13.Jan 30 2023, 4:27 PM

Gehel mentioned this in T327689: Use rsync instead of NFS for wdqs data reload cookbook.Jan 30 2023, 4:32 PM

Another update: wdqs1009 corrupted itself over the weekend, so we have restarted the reload process again. wdqs1010 is still in progress. If we go by individual TTL files, it is at 56%, but I'm not sure if each TTL file takes the same amount of processing time. If so, that is significantly faster than the previous run, and we should be finished in ~5 days , or ~10 days from when we started.

Update: we're at 80% on wdqs1009, 60% on wdqs1010 (unsure why 1009 is moving so much faster).

wdqs1009 corrupted itself over the weekend and we had to restart it. wdqs1010 at ~80%, wdqs1009 still in munging state.

For context, the journal corruptions we see are similar to T263110 (there was some investigation done at that time). Our only reasonable option is to try again until is succeeds.

Gehel mentioned this in T322869: Fewer results from wdqs nodes running in codfw than eqiad.Feb 15 2023, 4:24 PM

Change 873791 merged by jenkins-bot:

[operations/cookbooks@master] wdqs: make depool the default behavior

https://gerrit.wikimedia.org/r/873791

Maintenance_bot removed a project: Patch-For-Review.Feb 23 2023, 11:11 PM

Change 891899 had a related patch set uploaded (by Bking; author: Bking):

[operations/cookbooks@master] wdqs.data-transfer: completely remove defunct argument

https://gerrit.wikimedia.org/r/891899

gerritbot added a project: Patch-For-Review.Feb 24 2023, 8:08 PM

Change 891899 merged by jenkins-bot:

[operations/cookbooks@master] wdqs.data-transfer: completely replace defunct argument

https://gerrit.wikimedia.org/r/891899

Maintenance_bot removed a project: Patch-For-Review.Feb 24 2023, 9:11 PM

At long last..the data reload is complete!

At this point, we believe the data discrepancies mentioned in the initial ticket message have been resolved. However, please let us know if this is not the case.

bking moved this task from In Progress to Needs review on the Discovery-Search (Current work) board.Feb 28 2023, 10:14 PM

MPhamWMF moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Mar 6 2023, 4:22 PM

Gehel closed this task as Resolved.Mar 10 2023, 2:05 PM

Gehel mentioned this in T334082: Create a workflow to run Scholia based on dumps.May 1 2023, 8:28 AM

bking mentioned this in T347504: WDQS graph split: load data from dumps into new hosts.Sep 27 2023, 3:27 PM