Page MenuHomePhabricator

Move maps and scratch on cloudstore1008/9 to a DRBD failover similar to labstore1004/5
Closed, ResolvedPublic

Description

On reviewing the setup obsessively, I realized that a locking issue will prevent smooth failover for maps in the event of an NFS server failover. As mentioned in T203469, NFS failover is not very good without shared cluster filesystems, so this isn't the only place where that is true. Currently I will be documenting the failover as suggesting that the maps servers be rebooted to clean up after.

Because the failover of the primary cluster now works, this task is now to go to that model. The steps are:

  1. Evaluate and fix the issue of having a DRBD network interface/IP (significant work here)
  2. Get all the monitoring and scripts in place for DRBD
  3. Set the standby server to be a DRBD primary system as an rsync target Once the puppet and interface are all set up:
    • stop nfs and unmount the volume (scratch is our example)
    • destroy the volume just enough to get DRBD to create metadata dd if=/dev/zero of=/dev/srv/scratch bs=1M count=128
    • drbdadm create-md scratch
    • Since this is a standalone for now, drbd up scratch and then drbd disconnect scratch just to be safe
    • drbdadm primary scratch --force is needed to be able to use/mount the volume
    • you should now be able to create a filesystem (remember, you destroyed it) mkfs.ext4 /dev/drbd2
    • remove from /etc/fstab and mount by hand mount -o noatime /dev/drbd2 /srv/scratch
  4. initial rsync
  5. stop the constant rsync after success
  6. Move scratch to use the cluster IP
  7. Fail over to the DRBD primary node
  8. Stop/remove the rsync process
  9. Set up the DRBD secondary (destructive step) and connect the cluster

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Removing myself because cookie-licking is bad when I'm not working on it.

Bstorm changed the task status from Stalled to Open.Jun 11 2020, 11:17 PM

I think we know two things about this right now:

  1. This isn't going on cephfs directly.
  2. We should convert this cluster to using DRBD like the primary NFS system. It will work so much better.

Change 607142 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloud nfs: clean up some of the secondary cluster materials

https://gerrit.wikimedia.org/r/607142

Bstorm renamed this task from Improve the failover mechanism for maps on cloudstore1008/9 to Move maps and scratch on cloudstore1008/9 to a DRBD failover similar to labstore1004/5.Jun 22 2020, 11:08 PM
Bstorm updated the task description. (Show Details)

Change 607142 merged by Bstorm:
[operations/puppet@production] cloud nfs: clean up some of the secondary cluster materials

https://gerrit.wikimedia.org/r/607142

@ayounsi I was wondering if you could help me with a mystery on this task. Are the second port on labstore1004 and labstore1005 on a non-routed network connection to each other or are they directly linked via crossover cable?

@ayounsi I was wondering if you could help me with a mystery on this task. Are the second port on labstore1004 and labstore1005 on a non-routed network connection to each other or are they directly linked via crossover cable?

Looks directly connected.

ayounsi@labstore1004:~$ sudo lldpcli 
[lldpcli] # show neighbors
-------------------------------------------------------------------------------
LLDP neighbors:
-------------------------------------------------------------------------------
[...]
-------------------------------------------------------------------------------
Interface:    eth1, via: LLDP, RID: 3, Time: 12 days, 11:11:56
  Chassis:     
    ChassisID:    mac 14:18:77:5d:2d:d2
    SysName:      labstore1005.eqiad.wmnet
    SysDescr:     Debian GNU/Linux 9 (stretch) Linux 4.9.0-0.bpo.12-amd64 #1 SMP Debian 4.9.210-1~deb8u1 (2020-02-21) x86_64
    TTL:          120
    MgmtIP:       10.64.37.20
    MgmtIP:       2620:0:861:119:10:64:37:20
    Capability:   Bridge, off
    Capability:   Router, off
    Capability:   Wlan, off
    Capability:   Station, on
  Port:        
    PortID:       mac 14:18:77:5d:2d:d3
    PortDescr:    eth1
-------------------------------------------------------------------------------

Ok, there should be a cable now. This is unblocked and should proceed.

Change 681800 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] cloudstore: set up secondary_drbd classes

https://gerrit.wikimedia.org/r/681800

Change 681800 merged by Bstorm:

[operations/puppet@production] cloudstore: set up secondary_drbd classes

https://gerrit.wikimedia.org/r/681800

Change 683445 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] cloudstore: Collapse drbd vs symlinks into one profile

https://gerrit.wikimedia.org/r/683445

Change 683445 merged by Bstorm:

[operations/puppet@production] cloudstore: Collapse drbd vs symlinks into one profile

https://gerrit.wikimedia.org/r/683445

Change 683737 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] cloudstore: enable drbd on cloudstore1008

https://gerrit.wikimedia.org/r/683737

Change 683737 merged by Bstorm:

[operations/puppet@production] cloudstore: enable drbd on cloudstore1008

https://gerrit.wikimedia.org/r/683737

@Cmjohnson I'm trying to bring up the interface for the cable added on T266192, and I'm just getting no link at all. ethtool reports Link detected: no and ip link shows NO-CARRIER.

I've tried to bring it up on both sides cloudstore1008 and 1009, but no dice (presuming this is in the next port eno2). All ports other than the main one appear dead on the console. I find it odd that it was cat6 is mentioned since I think that's a fibre port? I'm not quite sure what's up with the device, and I need some eyes on the system to help me understand what's going on.

For all I know, I'm just trying the wrong port? I just figure the next port in line seems safe.

Bstorm added a subscriber: wiki_willy.

Assigning to @wiki_willy to see if it can be prioritized/moved however it should be and assigned to someone with time to poke at it.

Hi @Bstorm - @Jclark-ctr and @Andrew typically work together once a week (usually Tuesdays) to work on any WMCS related hardware tasks. Are you ok if we include this task in that regular cadence?

Thanks,
Willy

I was hoping it would be discussed this week, but I somehow missed that? I commented on IRC during the specific time window for that, but I didn't ping @Jclark-ctr directly. It's rework from T266192 since there is something not working quite right. Did that discussion not take place this week?

Hey @Bstorm - both Chris and John are off for the day (EST), but I'll check with them tomorrow and get you an update. Thanks, Willy

Hi @Bstorm - @Jclark-ctr is going to check it out and possibly reach out to you if there are any questions. Thanks, Willy

Change 690043 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] cloud drbd: change default for this cluster to eno3

https://gerrit.wikimedia.org/r/690043

Change 690043 merged by Bstorm:

[operations/puppet@production] cloud drbd: change default for this cluster to eno3

https://gerrit.wikimedia.org/r/690043

Updated brook with Correct ports being used. has link

Assigning back to myself to finish up the task. Thanks @Jclark-ctr !

Bstorm removed subscribers: Andrew, Jclark-ctr, wiki_willy and 2 others.

So in the course of this, I've found that the rsync process is completely busted. It's pointed at the old bindmount /exp. We removed that. So that needs to be fixed, but perhaps I can fix that after the DRBD is configured.

Confirmed that this won't create the volume metadata without either external metadata volumes or zeroing out the volumes. Since they are completely out of sync anyway, zeroing out seems like the way to go.

Once the volumes are repaired, then it makes sense to fix the rsync process.

Mentioned in SAL (#wikimedia-cloud) [2021-05-13T21:25:15Z] <bstorm> converted the maps and scratch volumes on cloudstore1008 (standby) to drbd T224747

Change 690706 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] cloudstore: fix the sync path for the secondary cluster

https://gerrit.wikimedia.org/r/690706

Change 690706 merged by Bstorm:

[operations/puppet@production] cloudstore: fix the sync path for the secondary cluster

https://gerrit.wikimedia.org/r/690706

Change 690783 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] cloudstore: set syncserver to only be run with puppet disabled

https://gerrit.wikimedia.org/r/690783

data is flowing to cloudstore1008. Switching the syncserver process to a manual one so that it can be used for the final migrations.

Change 690795 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] cloudstore: fix some more settings on the syncserver mess

https://gerrit.wikimedia.org/r/690795

Change 690795 merged by Bstorm:

[operations/puppet@production] cloudstore: fix some more settings on the syncserver mess

https://gerrit.wikimedia.org/r/690795

Change 692426 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] cloudstore: remove the delete-after bit in the rsync code

https://gerrit.wikimedia.org/r/692426

Change 692426 merged by Bstorm:

[operations/puppet@production] cloudstore: remove the delete-after bit in the rsync code

https://gerrit.wikimedia.org/r/692426

Change 690783 merged by Bstorm:

[operations/puppet@production] cloudstore: set syncserver to only be run with puppet disabled

https://gerrit.wikimedia.org/r/690783

Change 695447 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] nfs: fix the scratch mount setup

https://gerrit.wikimedia.org/r/695447

Change 695447 merged by Bstorm:

[operations/puppet@production] nfs: fix the scratch mount setup

https://gerrit.wikimedia.org/r/695447

@Chippyy, @Multichill and @TheDJ, I've got things set up on the maps and scratch NFS cluster so that it should be possible to fail over more reasonably than it currently can (which is extremely bad). Unfortunately, switching to the new setup requires a failover using the worst of the old process. I'll have to shut down NFS services, sync the files, and then bring up services on the other side.

While the /data/scratch mounts will affect a broad user base, that's not nearly as disruptive as the maps project NFS going offline for a while for you folks. I can proceed with a straightforward plan to reboot all servers afterward or coordinate with your group as people are available.

Let me know how available you are or if just counting on rebooting the maps servers is good enough. If it is good enough to reboot them, then I will announce a schedule for the change and proceed from there, which could be easiest for everyone. After this is up on the standalone DRBD node, I should be able to make all other changes without user-facing impact for this. This is partly in preparation for debian buster upgrades. I'd like them to be a lot smoother than this mess.

@Bstorm the maps-warper3 server should recover fine from a reboot, and having the NFS offline for a bit. Many thanks in advance for your work!

Sent an email scheduling this for tomorrow at 1600 UTC.

Mentioned in SAL (#wikimedia-cloud) [2021-07-01T16:18:01Z] <bstorm> downtimed cloudstore1008 and cloudstore1009 to fail over T224747

Mentioned in SAL (#wikimedia-cloud) [2021-07-01T16:27:11Z] <bstorm> failed over cloudstore1009 to cloudstore1008 T224747

Mentioned in SAL (#wikimedia-cloud) [2021-07-01T16:46:53Z] <bstorm> rebooted entire project of VMs and things appear mounted T224747

Mentioned in SAL (#wikimedia-cloud) [2021-07-01T16:47:22Z] <bstorm> remounted scratch everywhere...but mostly tools T224747

Change 702701 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] cloud nfs: set up cloudstore1009 for DRBD

https://gerrit.wikimedia.org/r/702701

Change 702701 merged by Bstorm:

[operations/puppet@production] cloud nfs: set up cloudstore1009 for DRBD

https://gerrit.wikimedia.org/r/702701

We are replicating!

 1:maps/0     SyncSource Primary/Secondary UpToDate/Inconsistent /srv/maps    ext4 8.0T 5.1T 2.5T 68%
	[>....................] sync'ed:  0.1% (8388304/8388348)M
 2:scratch/0  SyncSource Primary/Secondary UpToDate/Inconsistent /srv/scratch ext4 4.0T 3.2T 570G 86%
	[>...................] sync'ed:  5.4% (3968704/4194172)M

Change 702733 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] cloud nfs: commit solidly to the drbd setup step 1

https://gerrit.wikimedia.org/r/702733

Change 702733 merged by Bstorm:

[operations/puppet@production] cloud nfs: commit solidly to the drbd setup step 1

https://gerrit.wikimedia.org/r/702733

Change 702738 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] cloud nfs: cleaning up the non-drbd setup

https://gerrit.wikimedia.org/r/702738

Sync is still going

[bstorm@cloudstore1008]:scratch $ sudo drbd-overview
 1:maps/0     SyncSource Primary/Secondary UpToDate/Inconsistent /srv/maps    ext4 8.0T 5.1T 2.5T 68%
	[>....................] sync'ed:  2.0% (8220948/8388348)M
 2:scratch/0  SyncSource Primary/Secondary UpToDate/Inconsistent /srv/scratch ext4 4.0T 3.2T 566G 86%
	[==>.................] sync'ed: 17.1% (3478396/4194172)M

Unfortunately using rsync without delete left scratch a bit too high and alerting. I'll see if I can get help from the big users to run a cleanup on their folders.

Things are fully synced at this point.

Change 702738 merged by Bstorm:

[operations/puppet@production] cloud nfs: cleaning up the non-drbd setup

https://gerrit.wikimedia.org/r/702738

Further refactoring might be possible, but that doesn't affect this task, which is done.