Page MenuHomePhabricator

Reload WCQS from dumps
Closed, ResolvedPublic5 Estimated Story Points

Description

Followup on T314703 where the consumer side of the updater was misconfigured and caused all deletes to be ignored.
Replaying the delete log might be tricky to schedule so we should probably reload the full DB from a fresh dump.

Procedure (to be clarified):

  • depool one wcqs2001
  • run the reload script
  • note the date of the dump and position the offset of the wcqs2001 consumer according to this date
  • start the updater
  • wait for the lag to catch up
  • propagate the journal to other nodes using the data-transfer cookbook
NOTE: T314703 must be fixed before running this runbook.

AC:

Event Timeline

Gehel set the point value for this task to 5.Aug 29 2022, 3:39 PM

Started looking into this, first problem is that dumps.wikimedia.your.org has changed their path layouts, a minor change to the data reload script will be necessary to pull from the correct paths and not 404. As long as we are revisiting this script though, it seems worthwhile to reconsider T222349. It looks like we should be able to NFS mount the appropriate data to specific instances and run the data reloads fully within our own network.

Change 832543 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/puppet@production] Mount labstore to wcqs/wdqs instance for dumps reload

https://gerrit.wikimedia.org/r/832543

Mentioned in SAL (#wikimedia-operations) [2022-09-15T19:26:29Z] <ebernhardson> pool'd wdqs2001, some blockers before reload can start T316236

Started download/munge on wcqs2001 using the internal dumps.wikimedia.org, we can't use dumps.wikimedia.your.org as it's dumps are two weeks out of date.

The dumps are dated 20220911

Also stopped wcqs-updater.service on wcqs2001, and disabled puppet so it wont be restarted

Mentioned in SAL (#wikimedia-operations) [2022-09-15T22:01:20Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wcqs2001.codfw.wmnet with reason: T316236

Mentioned in SAL (#wikimedia-operations) [2022-09-15T22:01:45Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wcqs2001.codfw.wmnet with reason: T316236

The reload that was started on wcqs2001 didn't quite go right. We need to drop the reload scripts from the rdf deploy repo and only use the cookbooks going forward.

Change 833025 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikidata/query/deploy@master] Remove deprecated wcqs-data-reload.sh script

https://gerrit.wikimedia.org/r/833025

To move this forward one of our SRE's will need to run the following and let it go for a couple days. After that the sre.wdqs.data-transfer cookbook will need to be used.

cookbook sre.wdqs.data-reload wcqs2001.codw.wmnet \
    --task-id T316236 \
    --reason 'reloading data' \
    --reuse-downloaded-dump \
    --depool \
    --reload-data=commons \
    --kafka-timestamp=1662854400000

This is currently running in tmux window T316236 on cumin2002 .

Change 832543 merged by Ryan Kemper:

[operations/puppet@production] Mount labstore to wcqs/wdqs instance for dumps reload

https://gerrit.wikimedia.org/r/832543

Change 835596 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] Mount labstore to wcqs/wdqs instance for dumps reload

https://gerrit.wikimedia.org/r/835596

@bking what is the status of this ticket?

Apologies, as I lost track of this ticket.

It looks like we're stalled waiting for the NFS puppet config to be merged, at which point we should start the reload again. Let me check with @dcaro and WMCS again just to make sure everything looks OK.

Upon further discussion with @EBernhardson , we'll hold off on the NFS changes for the time being and just load the dumps from HTTP.

For the kafka timestamp, we prior sunday at midnight as a unix timestamp * 1000 ( 1665878400000) or an iso 8601 timestamp ( --kafka-timestamp=2022-10-16T00:00:00 )

Change 835596 merged by Ryan Kemper:

[operations/puppet@production] Mount labstore to wcqs/wdqs instance for dumps reload

https://gerrit.wikimedia.org/r/835596

The data reload is complete, but there were errors .

The complete gzipped output of the cookbook is available here .

@dcausse , let us know if these errors are serious, or if we can move ahead with the following steps:

  • note the date of the dump and position the offset of the wcqs2001 consumer according to this date
  • start the updater
  • wait for the lag to catch up
  • propagate the journal to other nodes using the data-transfer cookbook

@bking I did not spot any errors, the not found, terminating line is expected I guess.

The reload cookbook seemed to have taken care of the 3 first steps you mention, I think the remaining steps are just to propagate the journal using the data-transfer cookbook.

As of this writing, the following hosts have the updated data:

  • wcqs2001
  • wcqs1001
  • wcqs1002

We'll investigate the data transfer failures further tomorrow.

Still not sure exactly why they're failling, wcqs2002 also has the new data.
wcqs2003 and wcqs1003 still need it.

Icinga downtime and Alertmanager silence (ID=b3b0c7c7-a37b-4722-8303-1e26dc50d1a3) set by bking@cumin2002 for 5 days, 0:00:00 on 1 host(s) and their services with reason: data reload

wcqs2002.codfw.wmnet

Icinga downtime and Alertmanager silence (ID=c3c6135a-4c18-4dff-b3d6-1cea09c7e58c) set by bking@cumin2002 for 5 days, 0:00:00 on 1 host(s) and their services with reason: data reload

wcqs2003.codfw.wmnet

Icinga downtime and Alertmanager silence (ID=47a1387a-5ea2-4086-addb-7dc9cb66fddf) set by bking@cumin2002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: data reload

wcqs1003.eqiad.wmnet

wcqs1003 is the only host left that needs a reload:

wcqs2003.codfw.wmnet | CHANGED | rc=0 >>
/dev/mapper/vg0-srv   2.7T  470G  2.1T  19% /srv
wcqs2002.codfw.wmnet | CHANGED | rc=0 >>
/dev/mapper/vg0-srv   2.7T  470G  2.1T  19% /srv
wcqs2001.codfw.wmnet | CHANGED | rc=0 >>
/dev/mapper/vg0-srv             2.7T  487G  2.1T  19% /srv
wcqs1002.eqiad.wmnet | CHANGED | rc=0 >>
/dev/mapper/vg0-srv   2.7T  470G  2.1T  19% /srv
wcqs1001.eqiad.wmnet | CHANGED | rc=0 >>
/dev/mapper/vg0-srv   2.7T  470G  2.1T  19% /srv
wcqs1003.eqiad.wmnet | CHANGED | rc=0 >>
/dev/mapper/vg0-srv   2.7T   57G  2.5T   3% /srv

We have delayed fixing this so we can test changes to the data transfer cookbook in T321605 . We hope to be finished with these changes by the end of the week.

The reload is complete. However, we had to reboot wcqs1003.eqiad.wmnet several times before it would actually load the OS, and the BIOS displayed disk errors every time:

Initializing Serial ATA devices...
 Port A: Device initialization error
 Port B: MTFDDAK1T9TDT
 Port C: MTFDDAK1T9TDT
 Port D: MTFDDAK1T9TDT

I'll open a separate ticket to troubleshoot this further.

Opened T323380 for the disk errors, closing this one for now.

RKemper subscribed.

Moved to Needs Reporting. Usually we leave tickets open but move them to needs reporting and then gehel closes as resolved after he reviews them, but I'll leave this task resolved for now to avoid re-opening it.

Moved to Needs Reporting. Usually we leave tickets open but move them to needs reporting and then gehel closes as resolved after he reviews them, but I'll leave this task resolved for now to avoid re-opening it.

Changing back to Open b/c it occurred to me that gehel probably only see tickets that are in Needs Reporting and also Open (not resolved)

Change 867646 had a related patch set uploaded (by Bking; author: Ebernhardson):

[operations/puppet@production] Mount labstore to wcqs/wdqs instance for dumps reload

https://gerrit.wikimedia.org/r/867646

Change 867646 merged by Bking:

[operations/puppet@production] Mount labstore to wcqs/wdqs instance for dumps reload

https://gerrit.wikimedia.org/r/867646

Change 833025 abandoned by DCausse:

[wikidata/query/deploy@master] Remove deprecated wcqs-data-reload.sh script

Reason:

should be automatically removed during the next deploy

https://gerrit.wikimedia.org/r/833025