Page MenuHomePhabricator

Migrate labstore1004/labstore1005 to Stretch/Buster
Open, MediumPublic

Description

These are currently running jessie:

  • labstore1004.eqiad.wmnet
  • labstore1005.eqiad.wmnet

Event Timeline

Bstorm added a subscriber: Bstorm.
ArielGlenn triaged this task as Medium priority.Jun 11 2019, 7:53 AM

Change 566873 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] nfs: puppetize a cloud-vps nfs testbed

https://gerrit.wikimedia.org/r/566873

Change 566873 merged by Bstorm:
[operations/puppet@production] nfs: puppetize a cloud-vps nfs testbed

https://gerrit.wikimedia.org/r/566873

Change 567116 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: remove profile from top-level module

https://gerrit.wikimedia.org/r/567116

Change 567116 merged by Bstorm:
[operations/puppet@production] labstore: remove profile from top-level module

https://gerrit.wikimedia.org/r/567116

Change 567142 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: finish up making this class work on VMs

https://gerrit.wikimedia.org/r/567142

Change 567142 merged by Bstorm:
[operations/puppet@production] labstore: finish up making this class work on VMs

https://gerrit.wikimedia.org/r/567142

Change 567160 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore test: add the last couple ferm rules to let drbd work

https://gerrit.wikimedia.org/r/567160

Change 567160 merged by Bstorm:
[operations/puppet@production] cloudstore test: add the last couple ferm rules to let drbd work

https://gerrit.wikimedia.org/r/567160

Change 571821 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: remove dependency on bind mounts

https://gerrit.wikimedia.org/r/571821

Change 573422 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: Update the nfs_hostlist script

https://gerrit.wikimedia.org/r/573422

Change 573422 merged by Bstorm:
[operations/puppet@production] cloudstore: Update the nfs_hostlist script

https://gerrit.wikimedia.org/r/573422

Change 571821 merged by Bstorm:
[operations/puppet@production] cloudstore: remove dependency on bind mounts

https://gerrit.wikimedia.org/r/571821

Bstorm changed the task status from Open to Stalled.Mar 16 2020, 7:31 PM

While we now have an improved failover experience with these systems, there are concerns whenever they are rebooted. They have variously had warnings and issues in the past with both of them (such as T169286: labstore1005 A PCIe link training failure error on boot), and they are quite old machines. I do not think we should move forward with the upgrade to Debian Stretch, which will require reboots, until we can be sure that we have datacenter support for the process.

Bstorm changed the task status from Stalled to Open.Tue, May 19, 9:10 PM

Upgrading labstore1005 on Thursday this week.

Dzahn removed a subscriber: Dzahn.Wed, May 20, 10:40 AM

Mentioned in SAL (#wikimedia-operations) [2020-05-21T17:04:24Z] <bstorm_> starting labstore1005 upgrades T224582

Mentioned in SAL (#wikimedia-operations) [2020-05-21T20:44:39Z] <bstorm_> labstore1005 is now running stretch and drbd devices are resyncing after several reboots and some significant effort T224582

Ok, so labstore1005 upgrade notes.

  • Downtimed the server.
  • sudo puppet agent --disable "upgrading to stretch [bstorm]"
  • sudo apt update
  • sudo apt upgrade
  • sudo apt dist-upgrade
  • reboot

At this point, all is good except some odd hangs around DRBD and disk IO.

At this point there's a conflict with the odd redirect done by nfsd-ldap (which is our WMF-custom package for applying LDAP only to the nfs daemon).
This was fixed by:

  • uninstalling nfsd-ldap (failing) rm /usr/sbin/rpc.mountd.real and rm /usr/sbin/rpc.mountd, then apt-get install --reinstall nfs-kernel-server then apt-get install nfsd-ldap

Continued with a dist-upgrade and reboot.

  • sudo apt dist-upgrade
  • sudo rm /opt/puppetlabs/facter/cache/cached_facts/operating\ system
  • reboot
  • enable puppet and run puppet

Now, the device links for LVM volumes were missing, which meant DRBD was broken. I tried many things including invalidating the volumes to force a full resync. The system thought it was "diskless".

The solution here was:

  1. reboots to get udev to recreate links correctly (which worked for misc, but not for tools)
  2. Delete the 44% data tools snapshot (must have been a large deletion or something from tools recently and then another reboot.

DRBD is syncing and all is well except that I managed to get the PCI-E link training failure from T169286 on one of the boots, which seems to have no significant impact other than being scary.

Bstorm updated the task description. (Show Details)Thu, May 21, 8:53 PM

Change 597868 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: the monitor_systemd_service module doesn't work with drbd

https://gerrit.wikimedia.org/r/597868

Change 597873 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: current setup doesn't allow check_call against exportfs

https://gerrit.wikimedia.org/r/597873

Change 597868 merged by Bstorm:
[operations/puppet@production] labstore: the monitor_systemd_service module doesn't work with drbd

https://gerrit.wikimedia.org/r/597868

Change 597873 merged by Bstorm:
[operations/puppet@production] labstore: current setup doesn't allow check_call against exportfs

https://gerrit.wikimedia.org/r/597873

Mentioned in SAL (#wikimedia-operations) [2020-05-25T07:36:25Z] <moritzm> installed linux-imageamd64 on labstore (current meta package for kernels following the Stretch update) T224582

Mentioned in SAL (#wikimedia-operations) [2020-05-25T07:36:39Z] <moritzm> installed linux-image-amd64 on labstore1005 (current meta package for kernels following the Stretch update) T224582

One note for labstore1004: The meta package changed between jessie and stretch: Jessie by default has 3.16, but we were using a custom 4.9 kernel backport which used an internal meta package called "linux-meta". With Stretch we're just using the default kernel shipped in Debian and instead use the default Debian meta packages: I have installed linux-image-amd64 on labstore1005 (it's almost the same kernel, both use 4.9.210, the only difference is that the ~bpo8 kernel is built with an older GCC, we don't need to reboot 1005 again, this can simply align with the next maintenance on it).