Page MenuHomePhabricator

Cloud: Labvirt and instance reboots for Meltdown
Closed, ResolvedPublic

Event Timeline

For the labvirts themselves I would like to do a few things:

  • Run the profile we ran in the past for labvirt1015 with CPU issues on a labvirt in labtest (labtestvirt2001 seems to be working fine)
  • Note in https://gerrit.wikimedia.org/r/#/c/399243/ we whitelisted possible labvirt kernel versions so that needs to be updated for whatever is now valid (it's an array so it can contain both for the transition)
  • Update labtestvirt2001 to Meltdown mitigated kernel
  • Run the profile again. It's rough numbers but we need something.

Edit: well ideally we use a labtestvirt to do this staging and compare but I imagine it will be a better and faster test to use the spare labvirt in production on reflection. Use the same methodology from labvirt1015 https://phabricator.wikimedia.org/T171473#3699835

chasemp renamed this task from Labvirt reboots for Meltdown to Cloud: Labvirt and instance reboots for Meltdown.Jan 4 2018, 3:43 PM
chasemp triaged this task as High priority.

Regarding toolsforge instances, we should take into account that some of them have pending kernel upgrades already: T180809

aborrero updated the task description. (Show Details)

I think upstream have started rolling out the security update.

And the package has been updated in jessie

[22:09:48] <moritzm> !log uploaded linux-meta 1.16 for jessie-wikimedia to apt.wikimedia.org (which installs the new KPTI-enabled kernel with the new ABI)

@Paladox: Most of WMCS runs trusty with either the 3.13 or 4.4 kernel and needs an update by Canonical (which isn't available).

OS_TENANT_NAME=testlabs openstack server create --flavor 2 --image 85e8924b-b25d-4341-ad3e-56856d4de2cc --availability-zone host:labvirt1018 labvirt1018stresstest-1
OS_TENANT_NAME=testlabs openstack server create --flavor 2 --image 85e8924b-b25d-4341-ad3e-56856d4de2cc --availability-zone host:labvirt1018 labvirt1018stresstest-2
OS_TENANT_NAME=testlabs openstack server create --flavor 4 --image 85e8924b-b25d-4341-ad3e-56856d4de2cc --availability-zone host:labvirt1018 labvirt1018stresstest-3
OS_TENANT_NAME=testlabs openstack server create --flavor 4 --image 85e8924b-b25d-4341-ad3e-56856d4de2cc --availability-zone host:labvirt1018 labvirt1018stresstest-4

sudo cumin "name:labvirt1018stresstest*" "sudo aptitude install -y stress-ng"

sudo cumin --force D{labvirt1018stresstest-[1-4].testlabs.eqiad.wmflabs} 'stress-ng --cpu 1 --io 2 --vm 1 --vm-bytes 1G &'

or

sudo cumin --force D{labvirt1018stresstest-4.testlabs.eqiad.wmflabs} 'sudo screen -d -m "stress-ng --cpu 1 --io 2 --vm 1 --vm-bytes 1G"'

sudo cumin --force D{labvirt1018stresstest-[1-5].testlabs.eqiad.wmflabs} 'sudo screen -d -m "stress-ng --timeout 600 --fork 4 --cpu 1 --io 2 --vm 1 --vm-bytes 1G --switch 5"'

First, on labvirt1018:

Create new base images with updated kernels

  1. upgrade kernels on all VMs (how? Probably we need to VMs per distro and then run distro-selected apt commands)
    1. Trusty: apt-get install <???> (in particular, make sure we're moving everything to 4.x kernels)
    2. Stretch: apt-get install <???>
    3. Jessie: apt-get install <???>
  2. upgrade kernel on labvirt
  3. reboot labvirt
  4. restart all VMs
  5. test
  6. wait a few hours
  7. check performance stats

tldr PCID feature on all, invpcid on 1010+ only

Checking for PCID and INVPCID feature flags across labvirts.

for i in `grep labvirt main`; do echo $i; ssh $i 'grep pcid /proc/cpuinfo &> /dev/null; echo $?'; done
labvirt1001.eqiad.wmnet
0
labvirt1002.eqiad.wmnet
0
labvirt1003.eqiad.wmnet
0
labvirt1004.eqiad.wmnet
0
labvirt1005.eqiad.wmnet
0
labvirt1006.eqiad.wmnet
0
labvirt1007.eqiad.wmnet
0
labvirt1008.eqiad.wmnet
0
labvirt1009.eqiad.wmnet
0
labvirt1010.eqiad.wmnet
0
labvirt1011.eqiad.wmnet
0
labvirt1012.eqiad.wmnet
0
labvirt1013.eqiad.wmnet
0
labvirt1014.eqiad.wmnet
0
labvirt1015.eqiad.wmnet
0
labvirt1016.eqiad.wmnet
0
labvirt1017.eqiad.wmnet
0
labvirt1018.eqiad.wmnet
0
labvirt1019.eqiad.wmnet
0
labvirt1020.eqiad.wmnet
0
for i in `grep labvirt main`; do echo $i; ssh $i 'grep invpcid /proc/cpuinfo &> /dev/null; echo $?'; done
labvirt1001.eqiad.wmnet
1
labvirt1002.eqiad.wmnet
1
labvirt1003.eqiad.wmnet
1
labvirt1004.eqiad.wmnet
1
labvirt1005.eqiad.wmnet
1
labvirt1006.eqiad.wmnet
1
labvirt1007.eqiad.wmnet
1
labvirt1008.eqiad.wmnet
1
labvirt1009.eqiad.wmnet
1
labvirt1010.eqiad.wmnet
0
labvirt1011.eqiad.wmnet
0
labvirt1012.eqiad.wmnet
0
labvirt1013.eqiad.wmnet
0
labvirt1014.eqiad.wmnet
0
labvirt1015.eqiad.wmnet
0
labvirt1016.eqiad.wmnet
0
labvirt1017.eqiad.wmnet
0
labvirt1018.eqiad.wmnet
0
labvirt1019.eqiad.wmnet
0
labvirt1020.eqiad.wmnet
0

I've build new base images, and I'm concerned about what I'm seeing for Jessie.

Trusty:

andrew@trusty-meltdown-image:~$ lsb_release -a
No LSB modules are available.
Distributor ID:    Ubuntu
Description:    Ubuntu 14.04.5 LTS
Release:    14.04
Codename:    trusty
andrew@trusty-meltdown-image:~$ uname -a
Linux trusty-meltdown-image 3.13.0-139-generic #188-Ubuntu SMP Tue Jan 9 14:43:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

(looks right)

Stretch:

andrew@stretch-meltdown-image:~$ lsb_release -a
No LSB modules are available.
Distributor ID:    Debian
Description:    Debian GNU/Linux 9.3 (stretch)
Release:    9.3
Codename:    stretch

andrew@stretch-meltdown-image:~$ uname -a
Linux stretch-meltdown-image 4.9.0-5-amd64 #1 SMP Debian 4.9.65-3+deb9u2 (2018-01-04) x86_64 GNU/Linux

(that looks right to me)

Jessie:

andrew@jessie-meltdown-image:~$ lsb_release -a
No LSB modules are available.
Distributor ID:    Debian
Description:    Debian GNU/Linux 8.10 (jessie)
Release:    8.10
Codename:    jessie
andrew@jessie-meltdown-image:~$ uname -a
Linux jessie-meltdown-image 4.9.0-0.bpo.5-amd64 #1 SMP Debian 4.9.65-3+deb9u1~bpo8+2 (2018-01-04) x86_64 GNU/Linux

(So... apparently we are running 4.9 kernels on Jessie even though the security patch for Jessie is only in the 3.16 kernel. Not sure how to move forward from this. That also raises concerns about the upgrade path for existing VMs.)

Here are all the distros and kernels currently running: P6565

Linux jessie-meltdown-image 4.9.0-0.bpo.5-amd64 #1 SMP Debian 4.9.65-3+deb9u1~bpo8+2 (2018-01-04) x86_64 GNU/Linux

(So... apparently we are running 4.9 kernels on Jessie even though the security patch for Jessie is only in the 3.16 kernel.  Not sure how to move forward from this.  That also raises concerns about the upgrade path for existing VMs.)

No, that's correct. While Debian jessie by default uses a 3.16 kernel, we've been using a 4.9 backport for a while (to be able to run the same base kernel as on stretch and for various features not in 3.16). 4.9.65-3+deb9u1~bpo8+2 is a kernel I built internally, if you look into /usr/share/doc/linux-image-4.9.0-0.bpo.5-amd64/changelog.Debian.gz it'll list my changelog entry wrt KPTI patches.

The load-testing command I've settled on is:

sudo cumin --force --timeout 120 -o json "project:testlabs name:labvirt1018stresstest*"  'screen -d -m stress-ng --timeout 3600 --fork 4 --cpu 1 --io 2 --vm 1 --vm-bytes 1G --switch 5'

I've run three load tests with the above command. The last test started at Wed Jan 10 15:51:10 UTC 2018

before-1.png (1×2 px, 599 KB)

before-2.png (1×2 px, 590 KB)

before-3 Screen Shot 2018-01-10 at 9.57.28 AM.png (1×2 px, 601 KB)

The terrible way to fix grub on Trusty VMs is:

sudo cumin --force --timeout 120 -o json "A:all" "lsb_release -si | grep Ubuntu && mv /boot/grub/menu.lst /boot/grub/menu.lst.old && update-grub -y"

Change 403455 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] labvirts: whitelist the post-meltdown kernel version

https://gerrit.wikimedia.org/r/403455

Change 403455 merged by Andrew Bogott:
[operations/puppet@production] labvirts: whitelist the post-meltdown kernel version

https://gerrit.wikimedia.org/r/403455

First test was at Thu Jan 11 03:20:08 UTC 2018

after-1 Screen Shot 2018-01-10 at 9.25.40 PM.png (1×2 px, 585 KB)

Second test was at Thu Jan 11 03:35:23 UTC 2018

after-2 Screen Shot 2018-01-10 at 9.40.53 PM.png (1×2 px, 593 KB)

Third test was at Thu Jan 11 03:50:29 UTC 2018

after-3 Screen Shot 2018-01-10 at 9.55.27 PM.png (1×2 px, 589 KB)

There's a slight change in performance but not much! At least on the newer labvirts it doesn't look like we need to worry about this.

Definitely more expensive, potentially not so severe that it causes us major pains.

I did https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Meltdown_Response#PIlot_in_Toolforge this morning getting pilot instances of all 3 kinds (technically yuvi caught all the Stretch instances there for PAWS yesterday). see https://etherpad.wikimedia.org/p/cloud-meltdown-rollout for live copy

Trusty

sudo apt-get update && sudo apt-get -y install linux-image-generic && sudo mv /boot/grub/menu.lst /boot/grub/menu.lst.old && sudo update-grub -y && sudo uname -r

Jessie

sudo apt-get -y install linux-meta

Stretch [ so far all noops today]

apt-get -s install linux-image-amd64
0 upgraded, 0 newly installed, 0 to remove and 71 not upgraded.
Linux tools-paws-worker-1001 4.9.0-5-amd64 #1 SMP Debian 4.9.65-3+deb9u2 (2018-01-04) x86_64 GNU/Linux


side note, since Trusty is not updating the grub menu for new kernels we are sitting on a pile of unused kernels, while still running the oldest thing possible.

tools-exec-1402

Found kernel: /boot/vmlinuz-3.13.0-139-generic
Found kernel: /boot/vmlinuz-3.13.0-137-generic
Found kernel: /boot/vmlinuz-3.13.0-135-generic
Found kernel: /boot/vmlinuz-3.13.0-133-generic
Found kernel: /boot/vmlinuz-3.13.0-132-generic
Found kernel: /boot/vmlinuz-3.13.0-129-generic
Found kernel: /boot/vmlinuz-3.13.0-128-generic
Found kernel: /boot/vmlinuz-3.13.0-126-generic
Found kernel: /boot/vmlinuz-3.13.0-125-generic
Found kernel: /boot/vmlinuz-3.13.0-123-generic
Found kernel: /boot/vmlinuz-3.13.0-121-generic
Found kernel: /boot/vmlinuz-3.13.0-119-generic
Found kernel: /boot/vmlinuz-3.13.0-109-generic
Found kernel: /boot/vmlinuz-3.13.0-100-generic
Updating /boot/grub/menu.lst ... done

That's 2G+ of unused kernels, seems kinda crazy. My issue atm is we were actually running 3.13.0-100-generic which means if I cleanup say up to the last 2 it would be all kernels kept we never ran here :)

If 3.13.0-139-generic seems fruitful we should clean this up with something like:

sudo apt-get install bikeshed

keeps last 2

sudo purge-old-kernels

rush@tools-exec-1401:~$ dmesg | grep -i isolation

[ 0.000000] Kernel/User page tables isolation: enabled

rush@tools-paws-worker-1001:~$ sudo dmesg | grep -i isolation

[ 0.000000] Kernel/User page tables isolation: enabled

rush@tools-worker-1011:~$ sudo dmesg | grep -i isolation

[ 0.000000] Kernel/User page tables isolation: enabled

It seems tools-worker-1015 did not get the update as I forgot to reboot it. But I'm hoping we have enough a small snapshot to know if immediately we are in trouble or to move to the next step. If a reboot happened this morning to update it would happen right before 15:00.

Screen Shot 2018-01-11 at 1.21.51 PM.png (458×773 px, 164 KB)

Screen Shot 2018-01-11 at 1.21.57 PM.png (458×771 px, 77 KB)

Screen Shot 2018-01-11 at 1.22.03 PM.png (496×776 px, 73 KB)

Screen Shot 2018-01-11 at 1.22.12 PM.png (452×775 px, 98 KB)

Screen Shot 2018-01-11 at 1.22.20 PM.png (451×774 px, 60 KB)

Screen Shot 2018-01-11 at 1.22.27 PM.png (456×777 px, 57 KB)

Screen Shot 2018-01-11 at 1.22.36 PM.png (453×772 px, 71 KB)

Screen Shot 2018-01-11 at 1.22.43 PM.png (452×775 px, 86 KB)

Screen Shot 2018-01-11 at 1.22.51 PM.png (447×769 px, 82 KB)

Screen Shot 2018-01-11 at 1.23.00 PM.png (456×776 px, 58 KB)

Screen Shot 2018-01-11 at 1.23.07 PM.png (457×777 px, 100 KB)

Screen Shot 2018-01-11 at 1.23.14 PM.png (456×771 px, 81 KB)

Screen Shot 2018-01-11 at 1.23.21 PM.png (455×776 px, 69 KB)

Screen Shot 2018-01-11 at 1.23.26 PM.png (459×773 px, 55 KB)

These are relative times so have to be shifted if looking at them after today.

https://graphite-labs.wikimedia.org/render/?width=777&height=459&_salt=1515696010.908&areaMode=stacked&target=tools.tools-exec-140[1-5].cpu.total.irq&target=tools.tools-exec-140[1-5].cpu.total.nice&target=tools.tools-exec-140[1-5].cpu.total.softirq&target=tools.tools-exec-140[1-5].cpu.total.steal&target=tools.tools-exec-140[1-5].cpu.total.system&target=tools.tools-exec-140[1-5].cpu.total.user&tools.tools-exec-140[1-5].cpu.total.iowait&hideLegend=false&from=-24h

https://graphite-labs.wikimedia.org/render/?width=777&height=459&_salt=1515696212.788&areaMode=all&target=cactiStyle(tools.tools-exec-140[1-5].loadavg.05)&target=cactiStyle(tools.tools-exec-140[1-5].loadavg.01)&from=-8h

https://graphite-labs.wikimedia.org/render/?width=777&height=500&_salt=1515696212.788&areaMode=all&target=cactiStyle(tools.tools-worker-101[0-6].loadavg.05)&from=-8h&hideLegend=false

https://graphite-labs.wikimedia.org/render/?width=777&height=459&_salt=1515696718.888&areaMode=stacked&target=tools.tools-worker-1011.cpu.total.user&target=tools.tools-worker-1011.cpu.total.system&target=tools.tools-worker-1011.cpu.total.softirq&target=tools.tools-worker-1011.cpu.total.nice&target=tools.tools-worker-1011.cpu.total.irq&target=tools.tools-worker-1011.cpu.total.iowait&target=tools.tools-worker-1011.cpu.total.steal&from=-8h

https://graphite-labs.wikimedia.org/render/?width=777&height=459&_salt=1515696718.888&areaMode=stacked&target=cactiStyle(tools.tools-worker-1011.cpu.total.user)&target=cactiStyle(tools.tools-worker-1011.cpu.total.system&target=cactiStyle(tools.tools-worker-1011.cpu.total.softirq)&target=cactiStyle(tools.tools-worker-1011.cpu.total.nice)&target=cactiStyle(tools.tools-worker-1011.cpu.total.irq)&target=cactiStyle(tools.tools-worker-1011.cpu.total.iowait)&target=cactiStyle(tools.tools-worker-1012.cpu.total.steal)&from=-72h

https://graphite-labs.wikimedia.org/render/?width=777&height=459&_salt=1515696718.888&areaMode=stacked&target=cactiStyle(tools.tools-worker-1012.cpu.total.user)&target=cactiStyle(tools.tools-worker-1012.cpu.total.system&target=cactiStyle(tools.tools-worker-1012.cpu.total.softirq)&target=cactiStyle(tools.tools-worker-1012.cpu.total.nice)&target=cactiStyle(tools.tools-worker-1012.cpu.total.irq)&target=cactiStyle(tools.tools-worker-1012.cpu.total.iowait)&target=cactiStyle(tools.tools-worker-1012.cpu.total.steal)&from=-72h

https://graphite-labs.wikimedia.org/render/?width=777&height=459&_salt=1515696718.888&areaMode=stacked&target=cactiStyle(tools.tools-worker-1013.cpu.total.user)&target=cactiStyle(tools.tools-worker-1013.cpu.total.system&target=cactiStyle(tools.tools-worker-1013.cpu.total.softirq)&target=cactiStyle(tools.tools-worker-1013.cpu.total.nice)&target=cactiStyle(tools.tools-worker-1013.cpu.total.irq)&target=cactiStyle(tools.tools-worker-1013.cpu.total.iowait)&target=cactiStyle(tools.tools-worker-1013.cpu.total.steal)&from=-8h

https://graphite-labs.wikimedia.org/render/?width=777&height=459&_salt=1515696718.888&areaMode=stacked&target=cactiStyle(tools.tools-worker-1014.cpu.total.user)&target=cactiStyle(tools.tools-worker-1014.cpu.total.system&target=cactiStyle(tools.tools-worker-1014.cpu.total.softirq)&target=cactiStyle(tools.tools-worker-1014.cpu.total.nice)&target=cactiStyle(tools.tools-worker-1014.cpu.total.irq)&target=cactiStyle(tools.tools-worker-1014.cpu.total.iowait)&target=cactiStyle(tools.tools-worker-1014.cpu.total.steal)&from=-8h

https://graphite-labs.wikimedia.org/render/?width=777&height=459&_salt=1515696718.888&areaMode=stacked&target=cactiStyle(tools.tools-worker-1015.cpu.total.user)&target=cactiStyle(tools.tools-worker-1015.cpu.total.system&target=cactiStyle(tools.tools-worker-1015.cpu.total.softirq)&target=cactiStyle(tools.tools-worker-1015.cpu.total.nice)&target=cactiStyle(tools.tools-worker-1015.cpu.total.irq)&target=cactiStyle(tools.tools-worker-1015.cpu.total.iowait)&target=cactiStyle(tools.tools-worker-1015.cpu.total.steal)&from=-5h

https://graphite-labs.wikimedia.org/render/?width=777&height=459&_salt=1515696718.888&areaMode=stacked&target=cactiStyle(tools.tools-worker-1016.cpu.total.user)&target=cactiStyle(tools.tools-worker-1016.cpu.total.system&target=cactiStyle(tools.tools-worker-1016.cpu.total.softirq)&target=cactiStyle(tools.tools-worker-1016.cpu.total.nice)&target=cactiStyle(tools.tools-worker-1016.cpu.total.irq)&target=cactiStyle(tools.tools-worker-1016.cpu.total.iowait)&target=cactiStyle(tools.tools-worker-1016.cpu.total.steal)&from=-8h

https://graphite-labs.wikimedia.org/render/?width=777&height=459&_salt=1515697792.99&target=cactiStyle(tools.tools-worker-101[1-6].loadavg.05)&from=-8h

https://graphite-labs.wikimedia.org/render/?width=777&height=459&_salt=1515697959.788&areaMode=stacked&target=tools.tools-paws-worker-1001.cpu.total.irq&target=tools.tools-paws-worker-1001.cpu.total.nice&target=tools.tools-paws-worker-1001.cpu.total.softirq&target=tools.tools-paws-worker-1001.cpu.total.steal&target=tools.tools-paws-worker-1001.cpu.total.system&target=tools.tools-paws-worker-1001.cpu.total.user

https://graphite-labs.wikimedia.org/render/?width=777&height=459&_salt=1515697959.788&areaMode=stacked&target=tools.tools-paws-worker-1002.cpu.total.irq&target=tools.tools-paws-worker-1002.cpu.total.nice&target=tools.tools-paws-worker-1002.cpu.total.softirq&target=tools.tools-paws-worker-1002.cpu.total.steal&target=tools.tools-paws-worker-1002.cpu.total.system&target=tools.tools-paws-worker-1002.cpu.total.user&from=-2d

https://graphite-labs.wikimedia.org/render/?width=777&height=459&_salt=1515697959.788&areaMode=stacked&target=tools.tools-paws-worker-1003.cpu.total.irq&target=tools.tools-paws-worker-1003.cpu.total.nice&target=tools.tools-paws-worker-1003.cpu.total.softirq&target=tools.tools-paws-worker-1003.cpu.total.steal&target=tools.tools-paws-worker-1003.cpu.total.system&target=tools.tools-paws-worker-1003.cpu.total.user&from=-2d

tldr; I do think there is an impact, and that it's workload dependent which is made hugely complicated because our workers, exec nodes, and paws workers do not have predictable or consistent workloads. Let's move ahead w/ labvirt1017, labvirt1003, and co per https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Meltdown_Response#First_canaries_from_1001-1009_(pcid)_and_1010-1019_(pcid_and_invpcid)_(both_should_have_headroom) and see how things fair. We only have simulated numbers on a labvirt at the moment so it'll be interesting.

https://graphite.wikimedia.org/render/?width=586&height=308&_salt=1515702661.283&target=servers.labvirt1017.cpu.total.guest_nice&target=servers.labvirt1017.cpu.total.guest&areaMode=stacked&hideLegend=false&from=-8h

https://graphite.wikimedia.org/render/?width=959&height=320&_salt=1515702706.035&areaMode=stacked&hideLegend=false&target=servers.labvirt1017.cpu.total.user&target=servers.labvirt1017.cpu.total.system&target=servers.labvirt1017.cpu.total.steal&target=servers.labvirt1017.cpu.total.softirq&target=servers.labvirt1017.cpu.total.nice&target=servers.labvirt1017.cpu.total.irq&target=servers.labvirt1017.cpu.total.iowait&from=-4h

https://graphite.wikimedia.org/render/?width=959&height=320&_salt=1515705013.51&areaMode=stacked&hideLegend=false&target=cactiStyle(servers.labvirt1017.loadavg.05)&from=-4h


https://graphite.wikimedia.org/render/?width=586&height=308&_salt=1515702661.283&target=servers.labvirt1003.cpu.total.guest_nice&target=servers.labvirt1003.cpu.total.guest&areaMode=stacked&hideLegend=false&from=-8h

https://graphite.wikimedia.org/render/?width=959&height=320&_salt=1515702706.035&areaMode=stacked&hideLegend=false&target=servers.labvirt1003.cpu.total.user&target=servers.labvirt1003.cpu.total.system&target=servers.labvirt1003.cpu.total.steal&target=servers.labvirt1003.cpu.total.softirq&target=servers.labvirt1003.cpu.total.nice&target=servers.labvirt1003.cpu.total.irq&target=servers.labvirt1003.cpu.total.iowait&from=-4h

https://graphite.wikimedia.org/render/?width=959&height=320&_salt=1515705013.51&areaMode=stacked&hideLegend=false&target=cactiStyle(servers.labvirt1003.loadavg.05)&from=-4h

@MoritzMuehlenhoff reports in T184910 that there are servers just pending the reboot. Should that ticket be merged into this one?

Change 404588 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] cloud: labvirt settle on meltdown kernel

https://gerrit.wikimedia.org/r/404588

Change 404588 merged by Andrew Bogott:
[operations/puppet@production] cloud: labvirt settle on meltdown kernel

https://gerrit.wikimedia.org/r/404588

chasemp claimed this task.

Full working etherpad is archived at https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Meltdown_Response

The initial perspective has held on I think, not a free upgrade but 5% or something overhead that is difficult to pin down due to our varied workloads. labvirt1015 is the only hypervisor I've seen over 24h that seems to have really upturned but that's in relative terms as the absolute resource usage is still not worrysome.

https://graphite.wikimedia.org/render/?width=959&height=320&_salt=1515705013.51&areaMode=stacked&hideLegend=false&target=cactiStyle(servers.labvirt1015.loadavg.05)&from=-48h

Screen Shot 2018-01-17 at 10.02.08 AM.png (315×958 px, 61 KB)

resolving until we have further information, further work is happening in T184910