Page MenuHomePhabricator

Migrate labstore1004/labstore1005 to Stretch/Buster
Closed, ResolvedPublic

Description

These are currently running jessie:

  • labstore1004.eqiad.wmnet
  • labstore1005.eqiad.wmnet

Event Timeline

Bstorm added a subscriber: Bstorm.
ArielGlenn triaged this task as Medium priority.Jun 11 2019, 7:53 AM

Change 566873 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] nfs: puppetize a cloud-vps nfs testbed

https://gerrit.wikimedia.org/r/566873

Change 566873 merged by Bstorm:
[operations/puppet@production] nfs: puppetize a cloud-vps nfs testbed

https://gerrit.wikimedia.org/r/566873

Change 567116 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: remove profile from top-level module

https://gerrit.wikimedia.org/r/567116

Change 567116 merged by Bstorm:
[operations/puppet@production] labstore: remove profile from top-level module

https://gerrit.wikimedia.org/r/567116

Change 567142 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: finish up making this class work on VMs

https://gerrit.wikimedia.org/r/567142

Change 567142 merged by Bstorm:
[operations/puppet@production] labstore: finish up making this class work on VMs

https://gerrit.wikimedia.org/r/567142

Change 567160 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore test: add the last couple ferm rules to let drbd work

https://gerrit.wikimedia.org/r/567160

Change 567160 merged by Bstorm:
[operations/puppet@production] cloudstore test: add the last couple ferm rules to let drbd work

https://gerrit.wikimedia.org/r/567160

Change 571821 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: remove dependency on bind mounts

https://gerrit.wikimedia.org/r/571821

Change 573422 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: Update the nfs_hostlist script

https://gerrit.wikimedia.org/r/573422

Change 573422 merged by Bstorm:
[operations/puppet@production] cloudstore: Update the nfs_hostlist script

https://gerrit.wikimedia.org/r/573422

Change 571821 merged by Bstorm:
[operations/puppet@production] cloudstore: remove dependency on bind mounts

https://gerrit.wikimedia.org/r/571821

Bstorm changed the task status from Open to Stalled.Mar 16 2020, 7:31 PM

While we now have an improved failover experience with these systems, there are concerns whenever they are rebooted. They have variously had warnings and issues in the past with both of them (such as T169286: labstore1005 A PCIe link training failure error on boot), and they are quite old machines. I do not think we should move forward with the upgrade to Debian Stretch, which will require reboots, until we can be sure that we have datacenter support for the process.

Bstorm changed the task status from Stalled to Open.May 19 2020, 9:10 PM

Upgrading labstore1005 on Thursday this week.

Dzahn removed a subscriber: Dzahn.May 20 2020, 10:40 AM

Mentioned in SAL (#wikimedia-operations) [2020-05-21T17:04:24Z] <bstorm_> starting labstore1005 upgrades T224582

Mentioned in SAL (#wikimedia-operations) [2020-05-21T20:44:39Z] <bstorm_> labstore1005 is now running stretch and drbd devices are resyncing after several reboots and some significant effort T224582

Ok, so labstore1005 upgrade notes.

  • Downtimed the server.
  • sudo puppet agent --disable "upgrading to stretch [bstorm]"
  • sudo apt update
  • sudo apt upgrade
  • sudo apt dist-upgrade
  • reboot

At this point, all is good except some odd hangs around DRBD and disk IO.

At this point there's a conflict with the odd redirect done by nfsd-ldap (which is our WMF-custom package for applying LDAP only to the nfs daemon).
This was fixed by:

  • uninstalling nfsd-ldap (failing) rm /usr/sbin/rpc.mountd.real and rm /usr/sbin/rpc.mountd, then apt-get install --reinstall nfs-kernel-server then apt-get install nfsd-ldap

Continued with a dist-upgrade and reboot.

  • sudo apt dist-upgrade
  • sudo rm /opt/puppetlabs/facter/cache/cached_facts/operating\ system
  • reboot
  • enable puppet and run puppet

Now, the device links for LVM volumes were missing, which meant DRBD was broken. I tried many things including invalidating the volumes to force a full resync. The system thought it was "diskless".

The solution here was:

  1. reboots to get udev to recreate links correctly (which worked for misc, but not for tools)
  2. Delete the 44% data tools snapshot (must have been a large deletion or something from tools recently and then another reboot.

DRBD is syncing and all is well except that I managed to get the PCI-E link training failure from T169286 on one of the boots, which seems to have no significant impact other than being scary.

Bstorm updated the task description. (Show Details)May 21 2020, 8:53 PM

Change 597868 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: the monitor_systemd_service module doesn't work with drbd

https://gerrit.wikimedia.org/r/597868

Change 597873 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: current setup doesn't allow check_call against exportfs

https://gerrit.wikimedia.org/r/597873

Change 597868 merged by Bstorm:
[operations/puppet@production] labstore: the monitor_systemd_service module doesn't work with drbd

https://gerrit.wikimedia.org/r/597868

Change 597873 merged by Bstorm:
[operations/puppet@production] labstore: current setup doesn't allow check_call against exportfs

https://gerrit.wikimedia.org/r/597873

Mentioned in SAL (#wikimedia-operations) [2020-05-25T07:36:25Z] <moritzm> installed linux-imageamd64 on labstore (current meta package for kernels following the Stretch update) T224582

Mentioned in SAL (#wikimedia-operations) [2020-05-25T07:36:39Z] <moritzm> installed linux-image-amd64 on labstore1005 (current meta package for kernels following the Stretch update) T224582

One note for labstore1004: The meta package changed between jessie and stretch: Jessie by default has 3.16, but we were using a custom 4.9 kernel backport which used an internal meta package called "linux-meta". With Stretch we're just using the default kernel shipped in Debian and instead use the default Debian meta packages: I have installed linux-image-amd64 on labstore1005 (it's almost the same kernel, both use 4.9.210, the only difference is that the ~bpo8 kernel is built with an older GCC, we don't need to reboot 1005 again, this can simply align with the next maintenance on it).

Mentioned in SAL (#wikimedia-operations) [2020-06-11T16:10:03Z] <bstorm_> downtimed labstore1004 for upgrades T224582

Ah ok. Good to know :)

Mentioned in SAL (#wikimedia-operations) [2020-06-11T16:12:52Z] <bstorm_> downtimed labstore1005 for upgrades on 1004 since that will alert as well T224582

Mentioned in SAL (#wikimedia-cloud) [2020-06-11T16:15:54Z] <bstorm_> failing over NFS for labstore1004 to labstore1005 T224582

Mentioned in SAL (#wikimedia-operations) [2020-06-11T16:36:36Z] <bstorm_> rebooting labstore1004 for upgrades T224582

Mentioned in SAL (#wikimedia-operations) [2020-06-11T16:49:06Z] <bstorm_> doing stretch upgrade for labstore1004 T224582

Mentioned in SAL (#wikimedia-operations) [2020-06-11T17:12:26Z] <bstorm_> reboot for stretch upgrade on labstore1004 T224582

Mentioned in SAL (#wikimedia-cloud) [2020-06-11T17:17:36Z] <bstorm_> failing NFS back to labstore1004 to complete the upgrade process T224582

Mentioned in SAL (#wikimedia-cloud) [2020-06-11T17:22:15Z] <bstorm_> delaying failback labstore1004 for drive syncs T224582

Bstorm updated the task description. (Show Details)Jun 11 2020, 6:14 PM

Marking off the server since it is now running stretch (and the right kernel). Just finishing up work to get the cluster back to rights.

[bstorm@labstore1004]:~ $ sudo /usr/sbin/drbd-overview
 1:test/0   Connected  Secondary/Primary UpToDate/UpToDate
 3:misc/0   Connected  Secondary/Primary UpToDate/UpToDate
 4:tools/0  SyncTarget Secondary/Primary Inconsistent/UpToDate
	[=========>..........] sync'ed: 50.2% (2520/5056)M

Once that is resynced, I can fail it all back.

Mentioned in SAL (#wikimedia-cloud) [2020-06-11T19:19:29Z] <bstorm_> proceeding with failback to labstore1004 now that DRBD devices are consistent T224582

Bstorm closed this task as Resolved.Jun 11 2020, 7:35 PM
Bstorm claimed this task.

Done.

Change 604857 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: fix the failover process

https://gerrit.wikimedia.org/r/604857

Change 604857 merged by Bstorm:
[operations/puppet@production] labstore: fix the failover process

https://gerrit.wikimedia.org/r/604857