These are currently running jessie:
- labstore1004.eqiad.wmnet
- labstore1005.eqiad.wmnet
These are currently running jessie:
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T197804 Puppet: forbid new Python2 code | |||
Open | None | T218426 Upgrade various Cloud VPS Python 2 scripts to Python 3 | |||
Resolved | BUG REPORT | Bstorm | T218423 Add python 3 packages to openstack::clientpackages::common | ||
Resolved | MoritzMuehlenhoff | T232677 Remove support for Debian Jessie in Cloud Services | |||
Restricted Task | |||||
Restricted Task | |||||
Resolved | MoritzMuehlenhoff | T224549 Track remaining jessie systems in production | |||
Resolved | Bstorm | T169289 Tool Labs 2017-06-29 Labstore100[45] kernel upgrade issues | |||
Resolved | Bstorm | T169286 labstore1005 A PCIe link training failure error on boot | |||
Resolved | MoritzMuehlenhoff | T169290 New anti-stackclash (4.9.25-1~bpo8+3 ) kernel super bad for NFS | |||
Resolved | Bstorm | T203254 labstore1004 and labstore1005 high load issues following upgrades | |||
Resolved | Bstorm | T224582 Migrate labstore1004/labstore1005 to Stretch/Buster | |||
Resolved | Bstorm | T253353 Add cluster-awareness to nfs-exportd |
Change 566873 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] nfs: puppetize a cloud-vps nfs testbed
Change 566873 merged by Bstorm:
[operations/puppet@production] nfs: puppetize a cloud-vps nfs testbed
Change 567116 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: remove profile from top-level module
Change 567116 merged by Bstorm:
[operations/puppet@production] labstore: remove profile from top-level module
Change 567142 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: finish up making this class work on VMs
Change 567142 merged by Bstorm:
[operations/puppet@production] labstore: finish up making this class work on VMs
Change 567160 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore test: add the last couple ferm rules to let drbd work
Change 567160 merged by Bstorm:
[operations/puppet@production] cloudstore test: add the last couple ferm rules to let drbd work
Change 571821 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: remove dependency on bind mounts
Change 573422 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: Update the nfs_hostlist script
Change 573422 merged by Bstorm:
[operations/puppet@production] cloudstore: Update the nfs_hostlist script
Change 571821 merged by Bstorm:
[operations/puppet@production] cloudstore: remove dependency on bind mounts
While we now have an improved failover experience with these systems, there are concerns whenever they are rebooted. They have variously had warnings and issues in the past with both of them (such as T169286: labstore1005 A PCIe link training failure error on boot), and they are quite old machines. I do not think we should move forward with the upgrade to Debian Stretch, which will require reboots, until we can be sure that we have datacenter support for the process.
Mentioned in SAL (#wikimedia-operations) [2020-05-21T17:04:24Z] <bstorm_> starting labstore1005 upgrades T224582
Mentioned in SAL (#wikimedia-operations) [2020-05-21T20:44:39Z] <bstorm_> labstore1005 is now running stretch and drbd devices are resyncing after several reboots and some significant effort T224582
Ok, so labstore1005 upgrade notes.
At this point, all is good except some odd hangs around DRBD and disk IO.
At this point there's a conflict with the odd redirect done by nfsd-ldap (which is our WMF-custom package for applying LDAP only to the nfs daemon).
This was fixed by:
Continued with a dist-upgrade and reboot.
Now, the device links for LVM volumes were missing, which meant DRBD was broken. I tried many things including invalidating the volumes to force a full resync. The system thought it was "diskless".
The solution here was:
DRBD is syncing and all is well except that I managed to get the PCI-E link training failure from T169286 on one of the boots, which seems to have no significant impact other than being scary.
Change 597868 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: the monitor_systemd_service module doesn't work with drbd
Change 597873 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: current setup doesn't allow check_call against exportfs
Change 597868 merged by Bstorm:
[operations/puppet@production] labstore: the monitor_systemd_service module doesn't work with drbd
Change 597873 merged by Bstorm:
[operations/puppet@production] labstore: current setup doesn't allow check_call against exportfs
Mentioned in SAL (#wikimedia-operations) [2020-05-25T07:36:25Z] <moritzm> installed linux-imageamd64 on labstore (current meta package for kernels following the Stretch update) T224582
Mentioned in SAL (#wikimedia-operations) [2020-05-25T07:36:39Z] <moritzm> installed linux-image-amd64 on labstore1005 (current meta package for kernels following the Stretch update) T224582
One note for labstore1004: The meta package changed between jessie and stretch: Jessie by default has 3.16, but we were using a custom 4.9 kernel backport which used an internal meta package called "linux-meta". With Stretch we're just using the default kernel shipped in Debian and instead use the default Debian meta packages: I have installed linux-image-amd64 on labstore1005 (it's almost the same kernel, both use 4.9.210, the only difference is that the ~bpo8 kernel is built with an older GCC, we don't need to reboot 1005 again, this can simply align with the next maintenance on it).
Mentioned in SAL (#wikimedia-operations) [2020-06-11T16:10:03Z] <bstorm_> downtimed labstore1004 for upgrades T224582
Mentioned in SAL (#wikimedia-operations) [2020-06-11T16:12:52Z] <bstorm_> downtimed labstore1005 for upgrades on 1004 since that will alert as well T224582
Mentioned in SAL (#wikimedia-cloud) [2020-06-11T16:15:54Z] <bstorm_> failing over NFS for labstore1004 to labstore1005 T224582
Mentioned in SAL (#wikimedia-operations) [2020-06-11T16:36:36Z] <bstorm_> rebooting labstore1004 for upgrades T224582
Mentioned in SAL (#wikimedia-operations) [2020-06-11T16:49:06Z] <bstorm_> doing stretch upgrade for labstore1004 T224582
Mentioned in SAL (#wikimedia-operations) [2020-06-11T17:12:26Z] <bstorm_> reboot for stretch upgrade on labstore1004 T224582
Mentioned in SAL (#wikimedia-cloud) [2020-06-11T17:17:36Z] <bstorm_> failing NFS back to labstore1004 to complete the upgrade process T224582
Mentioned in SAL (#wikimedia-cloud) [2020-06-11T17:22:15Z] <bstorm_> delaying failback labstore1004 for drive syncs T224582
Marking off the server since it is now running stretch (and the right kernel). Just finishing up work to get the cluster back to rights.
[bstorm@labstore1004]:~ $ sudo /usr/sbin/drbd-overview 1:test/0 Connected Secondary/Primary UpToDate/UpToDate 3:misc/0 Connected Secondary/Primary UpToDate/UpToDate 4:tools/0 SyncTarget Secondary/Primary Inconsistent/UpToDate [=========>..........] sync'ed: 50.2% (2520/5056)M
Once that is resynced, I can fail it all back.
Mentioned in SAL (#wikimedia-cloud) [2020-06-11T19:19:29Z] <bstorm_> proceeding with failback to labstore1004 now that DRBD devices are consistent T224582
Change 604857 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: fix the failover process
Change 604857 merged by Bstorm:
[operations/puppet@production] labstore: fix the failover process