Reload WCQS from dumps
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	dcausse
	Aug 25 2022, 3:51 PM

Description

Followup on T314703 where the consumer side of the updater was misconfigured and caused all deletes to be ignored.
Replaying the delete log might be tricky to schedule so we should probably reload the full DB from a fresh dump.

Procedure (to be clarified):

depool one wcqs2001
run the reload script
note the date of the dump and position the offset of the wcqs2001 consumer according to this date
start the updater
wait for the lag to catch up
propagate the journal to other nodes using the data-transfer cookbook

NOTE: T314703 must be fixed before running this runbook.

AC:

All wcqs are reloaded from fresh dumps
Adapt documentation with a Runbook at https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater

Details

Subject	Repo	Branch	Lines +/-
Remove deprecated wcqs-data-reload.sh script	wikidata/query/deploy	master	+0 -45
Mount labstore to wcqs/wdqs instance for dumps reload	operations/puppet	production	+49 -49
Mount labstore to wcqs/wdqs instance for dumps reload	operations/puppet	production	+66 -1
Mount labstore to wcqs/wdqs instance for dumps reload	operations/puppet	production	+58 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved	BUG REPORT	dcausse	T314703 Structured data for deleted files on Commons still visible in SPARQL engine after deletion
		Resolved		bking	T316236 Reload WCQS from dumps

Event Timeline

dcausse created this task.Aug 25 2022, 3:51 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 25 2022, 3:51 PM

dcausse added a parent task: T314703: Structured data for deleted files on Commons still visible in SPARQL engine after deletion.Aug 25 2022, 3:54 PM

Maintenance_bot added a project: Wikidata.Aug 25 2022, 4:29 PM

HenkvD subscribed.Aug 29 2022, 3:21 PM

Gehel moved this task from Incoming to Current work on the Wikidata-Query-Service board.Aug 29 2022, 3:22 PM

Gehel added a project: Discovery-Search (Current work).

dcausse updated the task description. (Show Details)Aug 29 2022, 3:31 PM

Gehel set the point value for this task to 5.Aug 29 2022, 3:39 PM

Gehel moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

Mentioned in SAL (#wikimedia-operations) [2022-09-15T18:15:50Z] <ebernhardson> depool wcqs2001 for T316236

Started looking into this, first problem is that dumps.wikimedia.your.org has changed their path layouts, a minor change to the data reload script will be necessary to pull from the correct paths and not 404. As long as we are revisiting this script though, it seems worthwhile to reconsider T222349. It looks like we should be able to NFS mount the appropriate data to specific instances and run the data reloads fully within our own network.

Change 832543 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/puppet@production] Mount labstore to wcqs/wdqs instance for dumps reload

https://gerrit.wikimedia.org/r/832543

gerritbot added a project: Patch-For-Review.Sep 15 2022, 7:25 PM

Mentioned in SAL (#wikimedia-operations) [2022-09-15T19:26:29Z] <ebernhardson> pool'd wdqs2001, some blockers before reload can start T316236

Started download/munge on wcqs2001 using the internal dumps.wikimedia.org, we can't use dumps.wikimedia.your.org as it's dumps are two weeks out of date.

The dumps are dated 20220911

Mentioned in SAL (#wikimedia-operations) [2022-09-15T21:30:12Z] <ebernhardson> depool wcqs2001 for T316236

Also stopped wcqs-updater.service on wcqs2001, and disabled puppet so it wont be restarted

EBernhardson claimed this task.Sep 15 2022, 9:32 PM

EBernhardson moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.

Mentioned in SAL (#wikimedia-operations) [2022-09-15T22:01:20Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wcqs2001.codfw.wmnet with reason: T316236

Mentioned in SAL (#wikimedia-operations) [2022-09-15T22:01:45Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wcqs2001.codfw.wmnet with reason: T316236

The reload that was started on wcqs2001 didn't quite go right. We need to drop the reload scripts from the rdf deploy repo and only use the cookbooks going forward.

Change 833025 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikidata/query/deploy@master] Remove deprecated wcqs-data-reload.sh script

https://gerrit.wikimedia.org/r/833025

To move this forward one of our SRE's will need to run the following and let it go for a couple days. After that the sre.wdqs.data-transfer cookbook will need to be used.

cookbook sre.wdqs.data-reload wcqs2001.codw.wmnet \
    --task-id T316236 \
    --reason 'reloading data' \
    --reuse-downloaded-dump \
    --depool \
    --reload-data=commons \
    --kafka-timestamp=1662854400000

This is currently running in tmux window T316236 on cumin2002 .

Change 832543 merged by Ryan Kemper:

[operations/puppet@production] Mount labstore to wcqs/wdqs instance for dumps reload

https://gerrit.wikimedia.org/r/832543

Change 835596 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] Mount labstore to wcqs/wdqs instance for dumps reload

https://gerrit.wikimedia.org/r/835596

@bking what is the status of this ticket?

EBernhardson updated the task description. (Show Details)Oct 18 2022, 6:24 PM

Apologies, as I lost track of this ticket.

It looks like we're stalled waiting for the NFS puppet config to be merged, at which point we should start the reload again. Let me check with @dcaro and WMCS again just to make sure everything looks OK.

bking added a subscriber: dcaro.Oct 18 2022, 6:56 PM

Upon further discussion with @EBernhardson , we'll hold off on the NFS changes for the time being and just load the dumps from HTTP.

For the kafka timestamp, we prior sunday at midnight as a unix timestamp * 1000 ( 1665878400000) or an iso 8601 timestamp ( --kafka-timestamp=2022-10-16T00:00:00 )

Change 835596 merged by Ryan Kemper:

[operations/puppet@production] Mount labstore to wcqs/wdqs instance for dumps reload

https://gerrit.wikimedia.org/r/835596

The data reload is complete, but there were errors .

The complete gzipped output of the cookbook is available here .

@dcausse , let us know if these errors are serious, or if we can move ahead with the following steps:

note the date of the dump and position the offset of the wcqs2001 consumer according to this date
start the updater
wait for the lag to catch up
propagate the journal to other nodes using the data-transfer cookbook

@bking I did not spot any errors, the not found, terminating line is expected I guess.

The reload cookbook seemed to have taken care of the 3 first steps you mention, I think the remaining steps are just to propagate the journal using the data-transfer cookbook.

bking mentioned this in T321605: Make WCQS/WDQS data transfer cookbook more reliable .Oct 25 2022, 6:41 PM

As of this writing, the following hosts have the updated data:

wcqs2001
wcqs1001
wcqs1002

We'll investigate the data transfer failures further tomorrow.

HenkvD mentioned this in T314703: Structured data for deleted files on Commons still visible in SPARQL engine after deletion.Oct 26 2022, 2:19 PM

Still not sure exactly why they're failling, wcqs2002 also has the new data.
wcqs2003 and wcqs1003 still need it.

Icinga downtime and Alertmanager silence (ID=b3b0c7c7-a37b-4722-8303-1e26dc50d1a3) set by bking@cumin2002 for 5 days, 0:00:00 on 1 host(s) and their services with reason: data reload

wcqs2002.codfw.wmnet

Icinga downtime and Alertmanager silence (ID=c3c6135a-4c18-4dff-b3d6-1cea09c7e58c) set by bking@cumin2002 for 5 days, 0:00:00 on 1 host(s) and their services with reason: data reload

wcqs2003.codfw.wmnet

bking added a subtask: T322037: Add blazegraph as systemd dependency of prometheus-blazegraph-exporter service.Oct 31 2022, 4:45 PM

RKemper removed a subtask: T322037: Add blazegraph as systemd dependency of prometheus-blazegraph-exporter service.Oct 31 2022, 4:45 PM

Icinga downtime and Alertmanager silence (ID=47a1387a-5ea2-4086-addb-7dc9cb66fddf) set by bking@cumin2002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: data reload

wcqs1003.eqiad.wmnet

wcqs1003 is the only host left that needs a reload:

wcqs2003.codfw.wmnet | CHANGED | rc=0 >>
/dev/mapper/vg0-srv   2.7T  470G  2.1T  19% /srv
wcqs2002.codfw.wmnet | CHANGED | rc=0 >>
/dev/mapper/vg0-srv   2.7T  470G  2.1T  19% /srv
wcqs2001.codfw.wmnet | CHANGED | rc=0 >>
/dev/mapper/vg0-srv             2.7T  487G  2.1T  19% /srv
wcqs1002.eqiad.wmnet | CHANGED | rc=0 >>
/dev/mapper/vg0-srv   2.7T  470G  2.1T  19% /srv
wcqs1001.eqiad.wmnet | CHANGED | rc=0 >>
/dev/mapper/vg0-srv   2.7T  470G  2.1T  19% /srv
wcqs1003.eqiad.wmnet | CHANGED | rc=0 >>
/dev/mapper/vg0-srv   2.7T   57G  2.5T   3% /srv

We have delayed fixing this so we can test changes to the data transfer cookbook in T321605 . We hope to be finished with these changes by the end of the week.

The reload is complete. However, we had to reboot wcqs1003.eqiad.wmnet several times before it would actually load the OS, and the BIOS displayed disk errors every time:

Initializing Serial ATA devices...
 Port A: Device initialization error
 Port B: MTFDDAK1T9TDT
 Port C: MTFDDAK1T9TDT
 Port D: MTFDDAK1T9TDT

I'll open a separate ticket to troubleshoot this further.

bking mentioned this in T323380: Investigate disk errors on wcqs1003.eqiad.wmnet.Nov 18 2022, 2:59 PM

Opened T323380 for the disk errors, closing this one for now.

bking closed this task as Resolved.Nov 18 2022, 3:42 PM

bking moved this task from In Progress to Needs review on the Discovery-Search (Current work) board.

Moved to Needs Reporting. Usually we leave tickets open but move them to needs reporting and then gehel closes as resolved after he reviews them, but I'll leave this task resolved for now to avoid re-opening it.

In T316236#8415706, @RKemper wrote:

Moved to Needs Reporting. Usually we leave tickets open but move them to needs reporting and then gehel closes as resolved after he reviews them, but I'll leave this task resolved for now to avoid re-opening it.

Changing back to Open b/c it occurred to me that gehel probably only see tickets that are in Needs Reporting and also Open (not resolved)

bking added a subtask: T305818: Perform a data transfer to wdqs2004 & wdqs1004 to reclaim burnt allocators.Dec 9 2022, 2:38 PM

bking removed a subtask: T305818: Perform a data transfer to wdqs2004 & wdqs1004 to reclaim burnt allocators.

Gehel closed this task as Resolved.Dec 9 2022, 4:00 PM

Change 867646 had a related patch set uploaded (by Bking; author: Ebernhardson):

[operations/puppet@production] Mount labstore to wcqs/wdqs instance for dumps reload

https://gerrit.wikimedia.org/r/867646

Change 867646 merged by Bking:

[operations/puppet@production] Mount labstore to wcqs/wdqs instance for dumps reload

https://gerrit.wikimedia.org/r/867646

bking mentioned this in T325114: Update wdqs/wcqs data reload cookbook to use NFS mounts instead of external site and autodetect kafka timestamp from dumps.Dec 13 2022, 10:52 PM

Change 833025 abandoned by DCausse: