Buster
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	MoritzMuehlenhoff
	May 29 2019, 1:31 PM

Description

These are currently running jessie:

labstore1004.eqiad.wmnet
labstore1005.eqiad.wmnet

Details

Subject	Repo	Branch	Lines +/-
labstore: fix the failover process	operations/puppet	production	+2 -2
labstore: current setup doesn't allow check_call against exportfs	operations/puppet	production	+1 -1
labstore: the monitor_systemd_service module doesn't work with drbd	operations/puppet	production	+0 -8
cloudstore: remove dependency on bind mounts	operations/puppet	production	+113 -59
cloudstore: Update the nfs_hostlist script	operations/puppet	production	+127 -45
cloudstore test: add the last couple ferm rules to let drbd work	operations/puppet	production	+19 -0
labstore: finish up making this class work on VMs	operations/puppet	production	+8 -8
labstore: remove profile from top-level module	operations/puppet	production	+4 -3
nfs: puppetize a cloud-vps nfs testbed	operations/puppet	production	+177 -0

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Invalid		None	T197804 Puppet: forbid new Python2 code
Open		None	T218426 Upgrade various Cloud VPS Python 2 scripts to Python 3
Resolved	BUG REPORT	• Bstorm	T218423 Add python 3 packages to openstack::clientpackages::common
Resolved		MoritzMuehlenhoff	T232677 Remove support for Debian Jessie in Cloud Services
			Restricted Task
			Restricted Task
Resolved		MoritzMuehlenhoff	T224549 Track remaining jessie systems in production
Resolved		• Bstorm	T169289 Tool Labs 2017-06-29 Labstore100[45] kernel upgrade issues
Resolved		• Bstorm	T169286 labstore1005 A PCIe link training failure error on boot
Resolved		MoritzMuehlenhoff	T169290 New anti-stackclash (4.9.25-1~bpo8+3 ) kernel super bad for NFS
Resolved		• Bstorm	T203254 labstore1004 and labstore1005 high load issues following upgrades
Resolved		• Bstorm	T224582 Migrate labstore1004/labstore1005 to Stretch/Buster
Resolved		• Bstorm	T253353 Add cluster-awareness to nfs-exportd

Event Timeline

MoritzMuehlenhoff created this task.May 29 2019, 1:31 PM

MoritzMuehlenhoff mentioned this in T224549: Track remaining jessie systems in production.

• Bstorm edited projects, added cloud-services-team (Kanban); removed cloud-services-team.May 31 2019, 9:08 PM

• Bstorm subscribed.

• Bstorm merged a task: T184290: Upgrade labstore servers in eqiad to Stretch.May 31 2019, 9:10 PM

• Bstorm added subscribers: • madhuvishy, Dzahn.

ArielGlenn triaged this task as Medium priority.Jun 11 2019, 7:53 AM

• bd808 added a parent task: T232677: Remove support for Debian Jessie in Cloud Services.Oct 2 2019, 9:14 PM

• bd808 added a project: Cloud-VPS (Debian Jessie Deprecation).Oct 24 2019, 2:51 AM

• bd808 moved this task from Backlog to Hardware on the Cloud-VPS (Debian Jessie Deprecation) board.Oct 24 2019, 2:52 AM

• Bstorm added a parent task: T169289: Tool Labs 2017-06-29 Labstore100[45] kernel upgrade issues.Nov 27 2019, 4:59 PM

• Bstorm added a parent task: T203254: labstore1004 and labstore1005 high load issues following upgrades.Jan 23 2020, 9:24 PM

Change 566873 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] nfs: puppetize a cloud-vps nfs testbed

https://gerrit.wikimedia.org/r/566873

gerritbot added a project: Patch-For-Review.Jan 23 2020, 9:26 PM

Change 566873 merged by Bstorm:
[operations/puppet@production] nfs: puppetize a cloud-vps nfs testbed

https://gerrit.wikimedia.org/r/566873

Maintenance_bot removed a project: Patch-For-Review.Jan 23 2020, 10:10 PM

Change 567116 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: remove profile from top-level module

https://gerrit.wikimedia.org/r/567116

gerritbot added a project: Patch-For-Review.Jan 24 2020, 6:44 PM

Change 567116 merged by Bstorm:
[operations/puppet@production] labstore: remove profile from top-level module

https://gerrit.wikimedia.org/r/567116

Maintenance_bot removed a project: Patch-For-Review.Jan 24 2020, 8:10 PM

Change 567142 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: finish up making this class work on VMs

https://gerrit.wikimedia.org/r/567142

gerritbot added a project: Patch-For-Review.Jan 24 2020, 8:59 PM

Change 567142 merged by Bstorm:
[operations/puppet@production] labstore: finish up making this class work on VMs

https://gerrit.wikimedia.org/r/567142

Maintenance_bot removed a project: Patch-For-Review.Jan 24 2020, 10:11 PM

Change 567160 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore test: add the last couple ferm rules to let drbd work

https://gerrit.wikimedia.org/r/567160

gerritbot added a project: Patch-For-Review.Jan 24 2020, 11:02 PM

Change 567160 merged by Bstorm:
[operations/puppet@production] cloudstore test: add the last couple ferm rules to let drbd work

https://gerrit.wikimedia.org/r/567160

Maintenance_bot removed a project: Patch-For-Review.Jan 25 2020, 12:10 AM

Change 571821 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: remove dependency on bind mounts

https://gerrit.wikimedia.org/r/571821

gerritbot added a project: Patch-For-Review.Feb 12 2020, 11:03 PM

Change 573422 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: Update the nfs_hostlist script

https://gerrit.wikimedia.org/r/573422

Change 573422 merged by Bstorm:
[operations/puppet@production] cloudstore: Update the nfs_hostlist script

https://gerrit.wikimedia.org/r/573422

• Bstorm mentioned this in T169286: labstore1005 A PCIe link training failure error on boot.Feb 25 2020, 8:08 PM

• Bstorm mentioned this in T112388: Static server returns HTTP 403 Forbidden for valid files in some cases.

Change 571821 merged by Bstorm:
[operations/puppet@production] cloudstore: remove dependency on bind mounts

https://gerrit.wikimedia.org/r/571821

While we now have an improved failover experience with these systems, there are concerns whenever they are rebooted. They have variously had warnings and issues in the past with both of them (such as T169286: labstore1005 A PCIe link training failure error on boot), and they are quite old machines. I do not think we should move forward with the upgrade to Debian Stretch, which will require reboots, until we can be sure that we have datacenter support for the process.

Maintenance_bot removed a project: Patch-For-Review.Mar 16 2020, 8:10 PM

• bd808 added a parent task: T169286: labstore1005 A PCIe link training failure error on boot.May 5 2020, 4:36 PM

Upgrading labstore1005 on Thursday this week.

Dzahn unsubscribed.May 20 2020, 10:40 AM

Mentioned in SAL (#wikimedia-operations) [2020-05-21T17:04:24Z] <bstorm_> starting labstore1005 upgrades T224582

Mentioned in SAL (#wikimedia-operations) [2020-05-21T20:44:39Z] <bstorm_> labstore1005 is now running stretch and drbd devices are resyncing after several reboots and some significant effort T224582

Ok, so labstore1005 upgrade notes.

Downtimed the server.
sudo puppet agent --disable "upgrading to stretch [bstorm]"
sudo apt update
sudo apt upgrade
sudo apt dist-upgrade
reboot

At this point, all is good except some odd hangs around DRBD and disk IO.

edit the sources.lists to stretch
wget -O wikimedia-apt-key "https://wikitech.wikimedia.org/w/index.php?title=APT_repository/Stretch-Key&action=raw"
sudo apt-key add wikimedia-apt-key
sudo apt update
sudo apt upgrade

At this point there's a conflict with the odd redirect done by nfsd-ldap (which is our WMF-custom package for applying LDAP only to the nfs daemon).
This was fixed by:

uninstalling nfsd-ldap (failing) rm /usr/sbin/rpc.mountd.real and rm /usr/sbin/rpc.mountd, then apt-get install --reinstall nfs-kernel-server then apt-get install nfsd-ldap

Continued with a dist-upgrade and reboot.

sudo apt dist-upgrade
sudo rm /opt/puppetlabs/facter/cache/cached_facts/operating\ system
reboot
enable puppet and run puppet

Now, the device links for LVM volumes were missing, which meant DRBD was broken. I tried many things including invalidating the volumes to force a full resync. The system thought it was "diskless".

The solution here was:

reboots to get udev to recreate links correctly (which worked for misc, but not for tools)
Delete the 44% data tools snapshot (must have been a large deletion or something from tools recently and then another reboot.

DRBD is syncing and all is well except that I managed to get the PCI-E link training failure from T169286 on one of the boots, which seems to have no significant impact other than being scary.

• Bstorm updated the task description. (Show Details)May 21 2020, 8:53 PM

Change 597868 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: the monitor_systemd_service module doesn't work with drbd

https://gerrit.wikimedia.org/r/597868

gerritbot added a project: Patch-For-Review.May 21 2020, 9:41 PM

Change 597873 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: current setup doesn't allow check_call against exportfs

https://gerrit.wikimedia.org/r/597873

Change 597868 merged by Bstorm:
[operations/puppet@production] labstore: the monitor_systemd_service module doesn't work with drbd

https://gerrit.wikimedia.org/r/597868

Change 597873 merged by Bstorm:
[operations/puppet@production] labstore: current setup doesn't allow check_call against exportfs

https://gerrit.wikimedia.org/r/597873

Maintenance_bot removed a project: Patch-For-Review.May 21 2020, 10:10 PM

Mentioned in SAL (#wikimedia-operations) [2020-05-25T07:36:25Z] <moritzm> installed linux-imageamd64 on labstore (current meta package for kernels following the Stretch update) T224582

Mentioned in SAL (#wikimedia-operations) [2020-05-25T07:36:39Z] <moritzm> installed linux-image-amd64 on labstore1005 (current meta package for kernels following the Stretch update) T224582

One note for labstore1004: The meta package changed between jessie and stretch: Jessie by default has 3.16, but we were using a custom 4.9 kernel backport which used an internal meta package called "linux-meta". With Stretch we're just using the default kernel shipped in Debian and instead use the default Debian meta packages: I have installed linux-image-amd64 on labstore1005 (it's almost the same kernel, both use 4.9.210, the only difference is that the ~bpo8 kernel is built with an older GCC, we don't need to reboot 1005 again, this can simply align with the next maintenance on it).

Mentioned in SAL (#wikimedia-operations) [2020-06-11T16:10:03Z] <bstorm_> downtimed labstore1004 for upgrades T224582

Ah ok. Good to know :)

Mentioned in SAL (#wikimedia-operations) [2020-06-11T16:12:52Z] <bstorm_> downtimed labstore1005 for upgrades on 1004 since that will alert as well T224582

Mentioned in SAL (#wikimedia-cloud) [2020-06-11T16:15:54Z] <bstorm_> failing over NFS for labstore1004 to labstore1005 T224582

Mentioned in SAL (#wikimedia-operations) [2020-06-11T16:36:36Z] <bstorm_> rebooting labstore1004 for upgrades T224582

Mentioned in SAL (#wikimedia-operations) [2020-06-11T16:49:06Z] <bstorm_> doing stretch upgrade for labstore1004 T224582

Mentioned in SAL (#wikimedia-operations) [2020-06-11T17:12:26Z] <bstorm_> reboot for stretch upgrade on labstore1004 T224582

Mentioned in SAL (#wikimedia-cloud) [2020-06-11T17:17:36Z] <bstorm_> failing NFS back to labstore1004 to complete the upgrade process T224582

Mentioned in SAL (#wikimedia-cloud) [2020-06-11T17:22:15Z] <bstorm_> delaying failback labstore1004 for drive syncs T224582

Marking off the server since it is now running stretch (and the right kernel). Just finishing up work to get the cluster back to rights.

[bstorm@labstore1004]:~ $ sudo /usr/sbin/drbd-overview
 1:test/0   Connected  Secondary/Primary UpToDate/UpToDate
 3:misc/0   Connected  Secondary/Primary UpToDate/UpToDate
 4:tools/0  SyncTarget Secondary/Primary Inconsistent/UpToDate
	[=========>..........] sync'ed: 50.2% (2520/5056)M

Once that is resynced, I can fail it all back.

Mentioned in SAL (#wikimedia-cloud) [2020-06-11T19:19:29Z] <bstorm_> proceeding with failback to labstore1004 now that DRBD devices are consistent T224582

Done.

Change 604857 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: fix the failover process

https://gerrit.wikimedia.org/r/604857

gerritbot added a project: Patch-For-Review.Jun 11 2020, 8:40 PM

Change 604857 merged by Bstorm:
[operations/puppet@production] labstore: fix the failover process

https://gerrit.wikimedia.org/r/604857

Maintenance_bot removed a project: Patch-For-Review.Jun 11 2020, 10:11 PM

• Bstorm closed subtask T253353: Add cluster-awareness to nfs-exportd as Resolved.Jun 22 2020, 10:55 PM

• Bstorm removed a subtask: T283385: Upgrade labstore1004, labstore1005, cloudstore1008 and cloudstore1009 to Debian Buster.May 21 2021, 5:44 PM

Migrate labstore1004/labstore1005 to Stretch/BusterClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Migrate labstore1004/labstore1005 to Stretch/Buster
Closed, ResolvedPublic
Actions

Related Objects
Search...