Page MenuHomePhabricator

Set up backups of tools and misc data from labstore1004/5 in labstore2003/4
Closed, ResolvedPublic

Description

labstore2001 is not in a good state and is being taken apart tomorrow to inspect the hardware weirdness with the RAID controllers and disks. See T102626 and T149567.

labstore2003/4 have about 22TB between them, and the plan is to now backup (using bdsync) the tools and misc volumes from the new labstore secondary cluster in eqiad (labstore1004/5) to these boxes.

We are also planning to reimage labstore2003/4 before doing this.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 319500 had a related patch set uploaded (by Madhuvishy):
new labstore partman recipe

https://gerrit.wikimedia.org/r/319500

Change 319500 merged by Madhuvishy:
new labstore partman recipe

https://gerrit.wikimedia.org/r/319500

Script wmf_auto_reimage was launched by madhuvishy on neodymium.eqiad.wmnet for hosts:

['labstore2003.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201611030437_madhuvishy_13787.log.

Script wmf_auto_reimage was launched by madhuvishy on neodymium.eqiad.wmnet for hosts:

['labstore2004.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201611030607_madhuvishy_15305.log.

Change 319518 had a related patch set uploaded (by Madhuvishy):
labstore: Add mountpoint at srv for labstore-lvm-noraid partman recipe

https://gerrit.wikimedia.org/r/319518

Change 319518 merged by Madhuvishy:
labstore: Add mountpoint at srv for labstore-lvm-noraid partman recipe

https://gerrit.wikimedia.org/r/319518

The WMF auto reimage script didn't work out due to a puppet issue we discovered in this process (ticket coming soon). Manually reimaged 2004 so far.

Change 319530 had a related patch set uploaded (by Madhuvishy):
labstore: Setup secondary backups of tools and misc on labstore2003/4

https://gerrit.wikimedia.org/r/319530

Change 319530 merged by Madhuvishy:
labstore: Setup secondary backups of tools and misc on labstore2003/4

https://gerrit.wikimedia.org/r/319530

Mentioned in SAL (#wikimedia-operations) [2016-11-04T01:54:23Z] <madhuvishy> Manually reimaging labstore2003 (T149870)

Change 319781 had a related patch set uploaded (by Madhuvishy):
labstore: Apply role secondary::backup::tools-project to labstore2003

https://gerrit.wikimedia.org/r/319781

Change 319781 merged by Madhuvishy:
labstore: Apply role secondary::backup::tools-project to labstore2003

https://gerrit.wikimedia.org/r/319781

Labstore2003 and 2004 reimaged, and set up with bdsync weekly backups for tools and misc respectively. I have a manual initial backup job running on screens in both servers.

Two things we have noticed that need to happen to close this:

  1. we need to use flock or something to ensure one job is running at a time
  2. we should use snapshot-manager to keep a history of backup-host-local snaphots both to have a known good version during backups and to have some history even if limited for cheap

@madhuvishy can you update status here? iiuc a flock like thing is in play but we are still keeping only 1 historical version. Could be wrong though :)

Change 334692 had a related patch set uploaded (by Madhuvishy):
nfs: Snapshot backup device on secondary DC before replicating latest from remote

https://gerrit.wikimedia.org/r/334692

Change 334692 merged by Madhuvishy:
[operations/puppet] nfs: Snapshot backup device on secondary DC before replicating latest from remote

https://gerrit.wikimedia.org/r/334692

Two things we have noticed that need to happen to close this:

  • we need to use flock or something to ensure one job is running at a time
  • we should use snapshot-manager to keep a history of backup-host-local snaphots both to have a known good version during backups and to have some history even if limited for cheap

Both of these are done now. I'll follow up tomorrow once the backup jobs have run, and update status and close this ticket.

Looks like the backup jobs are running fine. Closing this.