Reboot WMCS servers for L1TF
Closed, ResolvedPublic

Description

These servers need a reboot for the L1TF vulnerabilities, they've all been upgraded to fixed kernels along with updated microcode. For labvirt/cloudvirt I'll create a separate ticket as more steps are necessary.

  • cloudcontrol1003.wikimedia.org (jessie) scheduled 2018-11-19 13:00 UTC with announcement
  • cloudcontrol1004.wikimedia.org (jessie)
  • cloudnet1003.eqiad.wmnet (jessie)
  • cloudnet1004.eqiad.wmnet (jessie)
  • cloudservices1003.wikimedia.org (jessie) scheduled 2018-11-19 13:00 UTC with announcement
  • cloudservices1004.wikimedia.org (jessie)
  • cloudelastic1001.wikimedia.org (stretch) (spare)
  • cloudelastic1002.wikimedia.org (stretch) (spare)
  • cloudelastic1003.wikimedia.org (stretch) (spare)
  • cloudelastic1004.wikimedia.org (stretch) (spare)
  • labstore2001.codfw.wmnet (jessie) (spare)
  • labstore2002.codfw.wmnet (jessie) (spare)
  • labstore2003.codfw.wmnet (jessie)
  • labstore2004.codfw.wmnet (jessie)
  • labcontrol1001.wikimedia.org (trusty 3.13) scheduled 2018-11-19 13:00 UTC with announcement
  • labcontrol1002.wikimedia.org (trusty 3.13)
  • labmon1001.eqiad.wmnet (jessie)
  • labmon1002.eqiad.wmnet (jessie)
  • labnet1001.eqiad.wmnet (trusty 4.4) scheduled 2018-11-27 17:30 UTC with announcement
  • labnet1002.eqiad.wmnet (trusty 4.4)
  • labpuppetmaster1001.wikimedia.org (jessie)
  • labpuppetmaster1002.wikimedia.org (jessie)
  • labservices1001.wikimedia.org (trusty 3.13) scheduled 2018-11-19 13:00 UTC with announcement
  • labservices1002.wikimedia.org (trusty 3.13)
  • labstore1003.eqiad.wmnet (trusty 3.13) (replacement WIP, probably best to skip the reboot and directly move to the new replacement)
  • labstore1006.wikimedia.org (jessie) scheduled 2018-12-03 1700 UTC with announcement
  • labstore1007.wikimedia.org (jessie) scheduled 2018-12-07 1700 UTC with announcement

(labsdbXXXX servers in subtasks for additional coordination with other folks)

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 18 2018, 11:14 AM
MoritzMuehlenhoff triaged this task as High priority.Oct 23 2018, 3:10 PM

Who in Cloud Services is able to look at this?

Who in Cloud Services is able to look at this?

We will discuss this in our team meeting today.

aborrero claimed this task.Nov 13 2018, 4:48 PM
aborrero moved this task from Needs discussion to Graveyard on the cloud-services-team (Kanban) board.

Some servers may have newer packages still:

labservices1002:

linux-image-generic/trusty-updates,trusty-security 3.13.0.161.171 amd64 [upgradable from: 3.13.0.160.170]

Should we update all hosts to the latest kernel at the time of reboots? I would think so but I want to confirm if that's against any policies we may have.

It's fine to upgrade the kernel, I've installed running what was recent when I created the task and those versions are sufficient to fix L1TF, but it's good to move to a newer kernel for additonal bugfixes in any case:

  • jessie has the most recent 4.9 kernel, no update needed
  • stretch (package name linux-image-4.9.0-8-amd64) can be upgraded to 4.9.130-2
  • trusty systems running 3.13 can upgrade linux-image-generic which will pull in the latest kernel (Ubuntu has separate kernel package for every kernel update) and for systems running 4.4, linux-image-generic-lts-xenial

We could start with spare systems and standby servers:

  • cloudelastic1001.wikimedia.org (stretch) (spare)
  • cloudelastic1002.wikimedia.org (stretch) (spare)
  • cloudelastic1003.wikimedia.org (stretch) (spare)
  • cloudelastic1004.wikimedia.org (stretch) (spare)
  • cloudcontrol1004.wikimedia.org (jessie)
  • labcontrol1002.wikimedia.org (trusty 3.13)
  • labmon1002.eqiad.wmnet (jessie)
  • labnet1002.eqiad.wmnet (trusty 4.4)

(only adding here those I know for sure)

GTirloni updated the task description. (Show Details)Nov 13 2018, 7:50 PM
GTirloni updated the task description. (Show Details)Nov 14 2018, 12:15 PM

Mentioned in SAL (#wikimedia-cloud) [2018-11-14T13:20:36Z] <gtirloni> rebooted labstore2001/labstore2003 after package upgrades (T207377)

Mentioned in SAL (#wikimedia-cloud) [2018-11-14T13:20:50Z] <gtirloni> rebooted labstore2003 after package upgrades (T207377)

Mentioned in SAL (#wikimedia-cloud) [2018-11-14T13:22:59Z] <gtirloni> rebooted labstore2004 after package upgrades (T207377)

GTirloni updated the task description. (Show Details)Nov 14 2018, 1:39 PM

Mentioned in SAL (#wikimedia-cloud) [2018-11-14T13:52:53Z] <gtirloni> rebooted labservices1002 after package upgrades (T207377)

GTirloni updated the task description. (Show Details)Nov 14 2018, 1:58 PM

Mentioned in SAL (#wikimedia-operations) [2018-11-14T17:37:33Z] <arturo> T207377 downtime and reboot cloudnet1004 (cloudnet1003 is the active one already)

Mentioned in SAL (#wikimedia-operations) [2018-11-14T17:53:27Z] <arturo> T207377 downtime and reboot cloudnet1003 (cloudnet1004 is the active one already)

aborrero updated the task description. (Show Details)Nov 14 2018, 6:01 PM
aborrero updated the task description. (Show Details)Thu, Nov 15, 11:50 AM
aborrero updated the task description. (Show Details)

I think that by rebooting labmon1001.eqiad.wmnet we will have just a brief gap in metrics/graphs, which is not a big deal. I will reboot it now.

Mentioned in SAL (#wikimedia-operations) [2018-11-15T12:40:44Z] <arturo> T207377 downtime and reboot labmon1001

aborrero updated the task description. (Show Details)Thu, Nov 15, 12:41 PM

Scheduled reboot for cloudcontrol1003, cloudservices1003, labcontrol1001 and labservices1001 for next Monday 2018-11-19 at 13:00 UTC. Email sent to cloud-announce.

aborrero updated the task description. (Show Details)Thu, Nov 15, 5:04 PM

Mentioned in SAL (#wikimedia-cloud) [2018-11-16T11:57:03Z] <gtirloni> rebooted labpuppetmaster1002 (T207377)

Mentioned in SAL (#wikimedia-cloud) [2018-11-16T12:08:13Z] <gtirloni> rebooted labpuppetmaster1001 (T207377)

GTirloni updated the task description. (Show Details)Fri, Nov 16, 12:13 PM

Mentioned in SAL (#wikimedia-operations) [2018-11-19T13:08:08Z] <arturo> T207377 icinga downtime and reboot of cloudcontrol1003 and cloudservices1003

aborrero updated the task description. (Show Details)Mon, Nov 19, 1:13 PM

Mentioned in SAL (#wikimedia-operations) [2018-11-19T13:21:20Z] <gtirloni> T207377 icinga downtime and reboot of labcontrol1001 and labservices1001

aborrero updated the task description. (Show Details)Mon, Nov 19, 1:25 PM
aborrero updated the task description. (Show Details)Mon, Nov 19, 1:33 PM
GTirloni updated the task description. (Show Details)Mon, Nov 19, 1:55 PM

Mentioned in SAL (#wikimedia-operations) [2018-11-19T13:55:53Z] <gtirloni> T207377 reboot cloudcontrol1004

aborrero updated the task description. (Show Details)Mon, Nov 19, 4:50 PM

hey @MoritzMuehlenhoff aren't cloudvirts affected by this?

See the task description, "For labvirt/cloudvirt I'll create a separate ticket as more steps are necessary." This needs a backport of SSBD support for the qemu version cloudvirt is using and some tests for the level of L1TF mitigation required.

aborrero updated the task description. (Show Details)Wed, Nov 21, 9:54 AM

hey @Bstorm any suggestions to handle labstore1006 and labstore1007 reboots? Those are for Dumps, right? cc @ArielGlenn

They are indeed for dumps; one is set up to be the web server, and a reboot of that means that the broader public will notice. The other provides NFS service of dumps to stat100* and to labs vms (includes toolforge I think), and this is a bit finicky. Definitely @Bstorm will want to weigh in on these.

So far, we haven't had a smooth maintenance on those two. They will break toolforge for at least a bit. Otherwise, there is a failover mechanism. They are both NFS and web servers (only one gets the DNS). We may just want to fail over the web functions and reboot in place for the NFS. The failover story is only so useful for the NFS (since it is based on symlinks) for a very short outage. It's intended for taking one down for a longer time.

That should require a simultaneous change to the hiera to refresh the cert and to DNS to change servers. It takes a little bit for a DNS-based failover, naturally, so the schedule should be staggered by a couple days for that.

Bstorm updated the task description. (Show Details)Tue, Nov 27, 3:27 PM

I'll set the DNS switch over to run over the weekend so all the DNS caches in the world clear up. Once it is on 07, I'll reboot 06 and switch it back for the reboot of 07 on Friday. That should be pretty smooth for the web. The NFS should be ok as long as the reboot goes smooth and it is announced that it will flake out a bit. We can do the failover, but I don't see much benefit on a straight reboot because the load numbers on clients will climb anyway until it comes back up.

I'll change drop "pending" when I actually send the announcement.

Mentioned in SAL (#wikimedia-operations) [2018-11-27T17:23:41Z] <arturo> T207377 icinga downtime labnet1001

aborrero updated the task description. (Show Details)Tue, Nov 27, 5:38 PM
Bstorm updated the task description. (Show Details)Wed, Nov 28, 10:42 PM

So labstore1007 has a current certificate for dumps.wikimedia.org:

Validity
    Not Before: Nov 28 13:59:40 2018 GMT
    Not After : Feb 26 13:59:40 2019 GMT
Subject: CN=dumps.wikimedia.org

It looks like that's because it's running some new certcentral stuff. That appears to be the identical cert as the partner has, which makes failover really easy. I somehow wasn't sure if that new cert management stuff was done.

So, I'll just fail over the DNS then.

Change 476903 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/dns@master] dumps distribution: fail over dumps to labstore1007 for upgrades

https://gerrit.wikimedia.org/r/476903

Change 476903 merged by Bstorm:
[operations/dns@master] dumps distribution: fail over dumps to labstore1007 for upgrades

https://gerrit.wikimedia.org/r/476903

Mentioned in SAL (#wikimedia-operations) [2018-12-03T17:09:41Z] <bstorm_> T207377 reboot labstore1006 for upgrades

Bstorm updated the task description. (Show Details)Mon, Dec 3, 5:22 PM

Mentioned in SAL (#wikimedia-operations) [2018-12-07T17:17:09Z] <bstorm_> T207377 rebooted labstore1007 for kernel upgrades

Bstorm updated the task description. (Show Details)Fri, Dec 7, 5:17 PM
aborrero closed this task as Resolved.Mon, Dec 10, 10:22 AM

Thanks @Bstorm and @GTirloni you both did most of the heavy work :-)

I'm closing the task now as done, since labstore1003.eqiad.wmnet is going to be handled differently.