⚓ T207377 Reboot WMCS servers for L1TF

	Subject	Repo	Branch	Lines +/-
	dumps distribution: fail over dumps to labstore1007 for upgrades	operations/dns	master	+1 -1

		Status	Subtype	Assigned	Task
		Resolved		aborrero	T207377 Reboot WMCS servers for L1TF
		Resolved		aborrero	T209517 Upgrade/reboot labsdb* servers

MoritzMuehlenhoff created this task.Oct 18 2018, 11:14 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 18 2018, 11:14 AM

MoritzMuehlenhoff added projects: SRE, cloud-services-team.Oct 18 2018, 11:15 AM

aborrero subscribed.Oct 18 2018, 11:16 AM

MoritzMuehlenhoff mentioned this in T207378: trusty servers: purge old kernel packages.Oct 18 2018, 11:24 AM

Krenair subscribed.Oct 18 2018, 11:26 AM

Paladox subscribed.Oct 18 2018, 12:02 PM

• GTirloni subscribed.Oct 21 2018, 12:16 PM

MoritzMuehlenhoff triaged this task as High priority.Oct 23 2018, 3:10 PM

aborrero edited projects, added cloud-services-team (Kanban); removed cloud-services-team.Nov 12 2018, 5:05 PM

aborrero moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.

Who in Cloud Services is able to look at this?

In T207377#4741857, @ArielGlenn wrote:

Who in Cloud Services is able to look at this?

We will discuss this in our team meeting today.

aborrero claimed this task.Nov 13 2018, 4:48 PM

aborrero moved this task from Needs discussion to Graveyard on the cloud-services-team (Kanban) board.

Some servers may have newer packages still:

labservices1002:

linux-image-generic/trusty-updates,trusty-security 3.13.0.161.171 amd64 [upgradable from: 3.13.0.160.170]

Should we update all hosts to the latest kernel at the time of reboots? I would think so but I want to confirm if that's against any policies we may have.

It's fine to upgrade the kernel, I've installed running what was recent when I created the task and those versions are sufficient to fix L1TF, but it's good to move to a newer kernel for additonal bugfixes in any case:

jessie has the most recent 4.9 kernel, no update needed
stretch (package name linux-image-4.9.0-8-amd64) can be upgraded to 4.9.130-2
trusty systems running 3.13 can upgrade linux-image-generic which will pull in the latest kernel (Ubuntu has separate kernel package for every kernel update) and for systems running 4.4, linux-image-generic-lts-xenial

We could start with spare systems and standby servers:

cloudelastic1001.wikimedia.org (stretch) (spare)
cloudelastic1002.wikimedia.org (stretch) (spare)
cloudelastic1003.wikimedia.org (stretch) (spare)
cloudelastic1004.wikimedia.org (stretch) (spare)
cloudcontrol1004.wikimedia.org (jessie)
labcontrol1002.wikimedia.org (trusty 3.13)
labmon1002.eqiad.wmnet (jessie)
labnet1002.eqiad.wmnet (trusty 4.4)

(only adding here those I know for sure)

• GTirloni updated the task description. (Show Details)Nov 13 2018, 7:50 PM

• GTirloni updated the task description. (Show Details)Nov 14 2018, 12:15 PM

aborrero mentioned this in T209480: labnet1001/labstore1004 combined alert on 2018-11-14.Nov 14 2018, 12:46 PM

Mentioned in SAL (#wikimedia-cloud) [2018-11-14T13:20:36Z] <gtirloni> rebooted labstore2001/labstore2003 after package upgrades (T207377)

Mentioned in SAL (#wikimedia-cloud) [2018-11-14T13:20:50Z] <gtirloni> rebooted labstore2003 after package upgrades (T207377)

Mentioned in SAL (#wikimedia-cloud) [2018-11-14T13:22:59Z] <gtirloni> rebooted labstore2004 after package upgrades (T207377)

• GTirloni updated the task description. (Show Details)Nov 14 2018, 1:39 PM

Mentioned in SAL (#wikimedia-cloud) [2018-11-14T13:52:53Z] <gtirloni> rebooted labservices1002 after package upgrades (T207377)

• GTirloni updated the task description. (Show Details)Nov 14 2018, 1:58 PM

Mentioned in SAL (#wikimedia-operations) [2018-11-14T17:37:33Z] <arturo> T207377 downtime and reboot cloudnet1004 (cloudnet1003 is the active one already)

Mentioned in SAL (#wikimedia-operations) [2018-11-14T17:53:27Z] <arturo> T207377 downtime and reboot cloudnet1003 (cloudnet1004 is the active one already)

aborrero updated the task description. (Show Details)Nov 14 2018, 6:01 PM

aborrero updated the task description. (Show Details)Nov 15 2018, 11:50 AM

aborrero updated the task description. (Show Details)

I think that by rebooting labmon1001.eqiad.wmnet we will have just a brief gap in metrics/graphs, which is not a big deal. I will reboot it now.

Mentioned in SAL (#wikimedia-operations) [2018-11-15T12:40:44Z] <arturo> T207377 downtime and reboot labmon1001

aborrero updated the task description. (Show Details)Nov 15 2018, 12:41 PM

Scheduled reboot for cloudcontrol1003, cloudservices1003, labcontrol1001 and labservices1001 for next Monday 2018-11-19 at 13:00 UTC. Email sent to cloud-announce.

aborrero updated the task description. (Show Details)Nov 15 2018, 5:04 PM

Mentioned in SAL (#wikimedia-cloud) [2018-11-16T11:57:03Z] <gtirloni> rebooted labpuppetmaster1002 (T207377)

Mentioned in SAL (#wikimedia-cloud) [2018-11-16T12:08:13Z] <gtirloni> rebooted labpuppetmaster1001 (T207377)

• GTirloni updated the task description. (Show Details)Nov 16 2018, 12:13 PM

Mentioned in SAL (#wikimedia-operations) [2018-11-19T13:08:08Z] <arturo> T207377 icinga downtime and reboot of cloudcontrol1003 and cloudservices1003

aborrero updated the task description. (Show Details)Nov 19 2018, 1:13 PM

Mentioned in SAL (#wikimedia-operations) [2018-11-19T13:21:20Z] <gtirloni> T207377 icinga downtime and reboot of labcontrol1001 and labservices1001

aborrero updated the task description. (Show Details)Nov 19 2018, 1:25 PM

aborrero updated the task description. (Show Details)Nov 19 2018, 1:33 PM

• GTirloni updated the task description. (Show Details)Nov 19 2018, 1:55 PM

Mentioned in SAL (#wikimedia-operations) [2018-11-19T13:55:53Z] <gtirloni> T207377 reboot cloudcontrol1004

aborrero updated the task description. (Show Details)Nov 19 2018, 4:50 PM

hey @MoritzMuehlenhoff aren't cloudvirts affected by this?

See the task description, "For labvirt/cloudvirt I'll create a separate ticket as more steps are necessary." This needs a backport of SSBD support for the qemu version cloudvirt is using and some tests for the level of L1TF mitigation required.

aborrero updated the task description. (Show Details)Nov 21 2018, 9:54 AM

hey @Bstorm any suggestions to handle labstore1006 and labstore1007 reboots? Those are for Dumps, right? cc @ArielGlenn

They are indeed for dumps; one is set up to be the web server, and a reboot of that means that the broader public will notice. The other provides NFS service of dumps to stat100* and to labs vms (includes toolforge I think), and this is a bit finicky. Definitely @Bstorm will want to weigh in on these.

So far, we haven't had a smooth maintenance on those two. They will break toolforge for at least a bit. Otherwise, there is a failover mechanism. They are both NFS and web servers (only one gets the DNS). We may just want to fail over the web functions and reboot in place for the NFS. The failover story is only so useful for the NFS (since it is based on symlinks) for a very short outage. It's intended for taking one down for a longer time.

That should require a simultaneous change to the hiera to refresh the cert and to DNS to change servers. It takes a little bit for a DNS-based failover, naturally, so the schedule should be staggered by a couple days for that.

• chasemp subscribed.Nov 26 2018, 5:38 PM

• Bstorm updated the task description. (Show Details)Nov 27 2018, 3:27 PM

I'll set the DNS switch over to run over the weekend so all the DNS caches in the world clear up. Once it is on 07, I'll reboot 06 and switch it back for the reboot of 07 on Friday. That should be pretty smooth for the web. The NFS should be ok as long as the reboot goes smooth and it is announced that it will flake out a bit. We can do the failover, but I don't see much benefit on a straight reboot because the load numbers on clients will climb anyway until it comes back up.

I'll change drop "pending" when I actually send the announcement.

Mentioned in SAL (#wikimedia-operations) [2018-11-27T17:23:41Z] <arturo> T207377 icinga downtime labnet1001

aborrero updated the task description. (Show Details)Nov 27 2018, 5:38 PM

• Bstorm updated the task description. (Show Details)Nov 28 2018, 10:42 PM

aborrero closed subtask T209517: Upgrade/reboot labsdb* servers as Resolved.Nov 29 2018, 12:35 PM

So labstore1007 has a current certificate for dumps.wikimedia.org:

Validity
    Not Before: Nov 28 13:59:40 2018 GMT
    Not After : Feb 26 13:59:40 2019 GMT
Subject: CN=dumps.wikimedia.org

It looks like that's because it's running some new certcentral stuff. That appears to be the identical cert as the partner has, which makes failover really easy. I somehow wasn't sure if that new cert management stuff was done.

So, I'll just fail over the DNS then.

Change 476903 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/dns@master] dumps distribution: fail over dumps to labstore1007 for upgrades

https://gerrit.wikimedia.org/r/476903

gerritbot added a project: Patch-For-Review.Nov 30 2018, 6:19 PM

Change 476903 merged by Bstorm:
[operations/dns@master] dumps distribution: fail over dumps to labstore1007 for upgrades

https://gerrit.wikimedia.org/r/476903

Mentioned in SAL (#wikimedia-operations) [2018-12-03T17:09:41Z] <bstorm_> T207377 reboot labstore1006 for upgrades

• Bstorm updated the task description. (Show Details)Dec 3 2018, 5:22 PM

Mentioned in SAL (#wikimedia-operations) [2018-12-07T17:17:09Z] <bstorm_> T207377 rebooted labstore1007 for kernel upgrades

• Bstorm updated the task description. (Show Details)Dec 7 2018, 5:17 PM

Thanks @Bstorm and @GTirloni you both did most of the heavy work :-)

I'm closing the task now as done, since labstore1003.eqiad.wmnet is going to be handled differently.