These are currently running jessie:
These are currently running jessie:
|operations/puppet||production||+2 -2||labstore: fix the failover process|
|operations/puppet||production||+1 -1||labstore: current setup doesn't allow check_call against exportfs|
|operations/puppet||production||+0 -8||labstore: the monitor_systemd_service module doesn't work with drbd|
|operations/puppet||production||+113 -59||cloudstore: remove dependency on bind mounts|
|operations/puppet||production||+127 -45||cloudstore: Update the nfs_hostlist script|
|operations/puppet||production||+19 -0||cloudstore test: add the last couple ferm rules to let drbd work|
|operations/puppet||production||+8 -8||labstore: finish up making this class work on VMs|
|operations/puppet||production||+4 -3||labstore: remove profile from top-level module|
|operations/puppet||production||+177 -0||nfs: puppetize a cloud-vps nfs testbed|
|Open||None||T197804 Puppet: forbid new Python2 code|
|Open||None||T218426 Upgrade various Cloud VPS Python 2 scripts to Python 3|
|Resolved||BUG REPORT||Bstorm||T218423 Add python 3 packages to openstack::clientpackages::common|
|Resolved||MoritzMuehlenhoff||T232677 Remove support for Debian Jessie in Cloud Services|
|Resolved||MoritzMuehlenhoff||T224549 Track remaining jessie systems in production|
|Resolved||Bstorm||T169289 Tool Labs 2017-06-29 Labstore100 kernel upgrade issues|
|Stalled||Bstorm||T169286 labstore1005 A PCIe link training failure error on boot|
|Resolved||MoritzMuehlenhoff||T169290 New anti-stackclash (4.9.25-1~bpo8+3 ) kernel super bad for NFS|
|Resolved||Bstorm||T203254 labstore1004 and labstore1005 high load issues following upgrades|
|Resolved||Bstorm||T224582 Migrate labstore1004/labstore1005 to Stretch/Buster|
|Resolved||Bstorm||T253353 Add cluster-awareness to nfs-exportd|
While we now have an improved failover experience with these systems, there are concerns whenever they are rebooted. They have variously had warnings and issues in the past with both of them (such as T169286: labstore1005 A PCIe link training failure error on boot), and they are quite old machines. I do not think we should move forward with the upgrade to Debian Stretch, which will require reboots, until we can be sure that we have datacenter support for the process.
Ok, so labstore1005 upgrade notes.
At this point, all is good except some odd hangs around DRBD and disk IO.
At this point there's a conflict with the odd redirect done by nfsd-ldap (which is our WMF-custom package for applying LDAP only to the nfs daemon).
This was fixed by:
Continued with a dist-upgrade and reboot.
Now, the device links for LVM volumes were missing, which meant DRBD was broken. I tried many things including invalidating the volumes to force a full resync. The system thought it was "diskless".
The solution here was:
DRBD is syncing and all is well except that I managed to get the PCI-E link training failure from T169286 on one of the boots, which seems to have no significant impact other than being scary.
One note for labstore1004: The meta package changed between jessie and stretch: Jessie by default has 3.16, but we were using a custom 4.9 kernel backport which used an internal meta package called "linux-meta". With Stretch we're just using the default kernel shipped in Debian and instead use the default Debian meta packages: I have installed linux-image-amd64 on labstore1005 (it's almost the same kernel, both use 4.9.210, the only difference is that the ~bpo8 kernel is built with an older GCC, we don't need to reboot 1005 again, this can simply align with the next maintenance on it).
[bstorm@labstore1004]:~ $ sudo /usr/sbin/drbd-overview 1:test/0 Connected Secondary/Primary UpToDate/UpToDate 3:misc/0 Connected Secondary/Primary UpToDate/UpToDate 4:tools/0 SyncTarget Secondary/Primary Inconsistent/UpToDate [=========>..........] sync'ed: 50.2% (2520/5056)M
Once that is resynced, I can fail it all back.