Page MenuHomePhabricator

Reboot WMCS servers for L1TF
Closed, ResolvedPublic

Description

These servers need a reboot for the L1TF vulnerabilities, they've all been upgraded to fixed kernels along with updated microcode. For labvirt/cloudvirt I'll create a separate ticket as more steps are necessary.

  • cloudcontrol1003.wikimedia.org (jessie) scheduled 2018-11-19 13:00 UTC with announcement
  • cloudcontrol1004.wikimedia.org (jessie)
  • cloudnet1003.eqiad.wmnet (jessie)
  • cloudnet1004.eqiad.wmnet (jessie)
  • cloudservices1003.wikimedia.org (jessie) scheduled 2018-11-19 13:00 UTC with announcement
  • cloudservices1004.wikimedia.org (jessie)
  • cloudelastic1001.wikimedia.org (stretch) (spare)
  • cloudelastic1002.wikimedia.org (stretch) (spare)
  • cloudelastic1003.wikimedia.org (stretch) (spare)
  • cloudelastic1004.wikimedia.org (stretch) (spare)
  • labstore2001.codfw.wmnet (jessie) (spare)
  • labstore2002.codfw.wmnet (jessie) (spare)
  • labstore2003.codfw.wmnet (jessie)
  • labstore2004.codfw.wmnet (jessie)
  • labcontrol1001.wikimedia.org (trusty 3.13) scheduled 2018-11-19 13:00 UTC with announcement
  • labcontrol1002.wikimedia.org (trusty 3.13)
  • labmon1001.eqiad.wmnet (jessie)
  • labmon1002.eqiad.wmnet (jessie)
  • labnet1001.eqiad.wmnet (trusty 4.4) scheduled 2018-11-27 17:30 UTC with announcement
  • labnet1002.eqiad.wmnet (trusty 4.4)
  • labpuppetmaster1001.wikimedia.org (jessie)
  • labpuppetmaster1002.wikimedia.org (jessie)
  • labservices1001.wikimedia.org (trusty 3.13) scheduled 2018-11-19 13:00 UTC with announcement
  • labservices1002.wikimedia.org (trusty 3.13)
  • labstore1003.eqiad.wmnet (trusty 3.13) (replacement WIP, probably best to skip the reboot and directly move to the new replacement)
  • labstore1006.wikimedia.org (jessie) scheduled 2018-12-03 1700 UTC with announcement
  • labstore1007.wikimedia.org (jessie) scheduled 2018-12-07 1700 UTC with announcement

(labsdbXXXX servers in subtasks for additional coordination with other folks)

Event Timeline

Who in Cloud Services is able to look at this?

Who in Cloud Services is able to look at this?

We will discuss this in our team meeting today.

Some servers may have newer packages still:

labservices1002:

linux-image-generic/trusty-updates,trusty-security 3.13.0.161.171 amd64 [upgradable from: 3.13.0.160.170]

Should we update all hosts to the latest kernel at the time of reboots? I would think so but I want to confirm if that's against any policies we may have.

It's fine to upgrade the kernel, I've installed running what was recent when I created the task and those versions are sufficient to fix L1TF, but it's good to move to a newer kernel for additonal bugfixes in any case:

  • jessie has the most recent 4.9 kernel, no update needed
  • stretch (package name linux-image-4.9.0-8-amd64) can be upgraded to 4.9.130-2
  • trusty systems running 3.13 can upgrade linux-image-generic which will pull in the latest kernel (Ubuntu has separate kernel package for every kernel update) and for systems running 4.4, linux-image-generic-lts-xenial

We could start with spare systems and standby servers:

  • cloudelastic1001.wikimedia.org (stretch) (spare)
  • cloudelastic1002.wikimedia.org (stretch) (spare)
  • cloudelastic1003.wikimedia.org (stretch) (spare)
  • cloudelastic1004.wikimedia.org (stretch) (spare)
  • cloudcontrol1004.wikimedia.org (jessie)
  • labcontrol1002.wikimedia.org (trusty 3.13)
  • labmon1002.eqiad.wmnet (jessie)
  • labnet1002.eqiad.wmnet (trusty 4.4)

(only adding here those I know for sure)

Mentioned in SAL (#wikimedia-cloud) [2018-11-14T13:20:36Z] <gtirloni> rebooted labstore2001/labstore2003 after package upgrades (T207377)

Mentioned in SAL (#wikimedia-cloud) [2018-11-14T13:20:50Z] <gtirloni> rebooted labstore2003 after package upgrades (T207377)

Mentioned in SAL (#wikimedia-cloud) [2018-11-14T13:22:59Z] <gtirloni> rebooted labstore2004 after package upgrades (T207377)

Mentioned in SAL (#wikimedia-cloud) [2018-11-14T13:52:53Z] <gtirloni> rebooted labservices1002 after package upgrades (T207377)

Mentioned in SAL (#wikimedia-operations) [2018-11-14T17:37:33Z] <arturo> T207377 downtime and reboot cloudnet1004 (cloudnet1003 is the active one already)

Mentioned in SAL (#wikimedia-operations) [2018-11-14T17:53:27Z] <arturo> T207377 downtime and reboot cloudnet1003 (cloudnet1004 is the active one already)

aborrero updated the task description. (Show Details)

I think that by rebooting labmon1001.eqiad.wmnet we will have just a brief gap in metrics/graphs, which is not a big deal. I will reboot it now.

Mentioned in SAL (#wikimedia-operations) [2018-11-15T12:40:44Z] <arturo> T207377 downtime and reboot labmon1001

Scheduled reboot for cloudcontrol1003, cloudservices1003, labcontrol1001 and labservices1001 for next Monday 2018-11-19 at 13:00 UTC. Email sent to cloud-announce.

Mentioned in SAL (#wikimedia-cloud) [2018-11-16T11:57:03Z] <gtirloni> rebooted labpuppetmaster1002 (T207377)

Mentioned in SAL (#wikimedia-cloud) [2018-11-16T12:08:13Z] <gtirloni> rebooted labpuppetmaster1001 (T207377)

Mentioned in SAL (#wikimedia-operations) [2018-11-19T13:08:08Z] <arturo> T207377 icinga downtime and reboot of cloudcontrol1003 and cloudservices1003

Mentioned in SAL (#wikimedia-operations) [2018-11-19T13:21:20Z] <gtirloni> T207377 icinga downtime and reboot of labcontrol1001 and labservices1001

See the task description, "For labvirt/cloudvirt I'll create a separate ticket as more steps are necessary." This needs a backport of SSBD support for the qemu version cloudvirt is using and some tests for the level of L1TF mitigation required.

hey @Bstorm any suggestions to handle labstore1006 and labstore1007 reboots? Those are for Dumps, right? cc @ArielGlenn

They are indeed for dumps; one is set up to be the web server, and a reboot of that means that the broader public will notice. The other provides NFS service of dumps to stat100* and to labs vms (includes toolforge I think), and this is a bit finicky. Definitely @Bstorm will want to weigh in on these.

So far, we haven't had a smooth maintenance on those two. They will break toolforge for at least a bit. Otherwise, there is a failover mechanism. They are both NFS and web servers (only one gets the DNS). We may just want to fail over the web functions and reboot in place for the NFS. The failover story is only so useful for the NFS (since it is based on symlinks) for a very short outage. It's intended for taking one down for a longer time.

That should require a simultaneous change to the hiera to refresh the cert and to DNS to change servers. It takes a little bit for a DNS-based failover, naturally, so the schedule should be staggered by a couple days for that.

I'll set the DNS switch over to run over the weekend so all the DNS caches in the world clear up. Once it is on 07, I'll reboot 06 and switch it back for the reboot of 07 on Friday. That should be pretty smooth for the web. The NFS should be ok as long as the reboot goes smooth and it is announced that it will flake out a bit. We can do the failover, but I don't see much benefit on a straight reboot because the load numbers on clients will climb anyway until it comes back up.

I'll change drop "pending" when I actually send the announcement.

So labstore1007 has a current certificate for dumps.wikimedia.org:

Validity
    Not Before: Nov 28 13:59:40 2018 GMT
    Not After : Feb 26 13:59:40 2019 GMT
Subject: CN=dumps.wikimedia.org

It looks like that's because it's running some new certcentral stuff. That appears to be the identical cert as the partner has, which makes failover really easy. I somehow wasn't sure if that new cert management stuff was done.

So, I'll just fail over the DNS then.

Change 476903 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/dns@master] dumps distribution: fail over dumps to labstore1007 for upgrades

https://gerrit.wikimedia.org/r/476903

Change 476903 merged by Bstorm:
[operations/dns@master] dumps distribution: fail over dumps to labstore1007 for upgrades

https://gerrit.wikimedia.org/r/476903

Mentioned in SAL (#wikimedia-operations) [2018-12-03T17:09:41Z] <bstorm_> T207377 reboot labstore1006 for upgrades

Mentioned in SAL (#wikimedia-operations) [2018-12-07T17:17:09Z] <bstorm_> T207377 rebooted labstore1007 for kernel upgrades

Thanks @Bstorm and @GTirloni you both did most of the heavy work :-)

I'm closing the task now as done, since labstore1003.eqiad.wmnet is going to be handled differently.