Page MenuHomePhabricator

Cloud: Labvirt and instance reboots for Meltdown
Closed, ResolvedPublic

Details

Related Gerrit Patches:

Event Timeline

Andrew created this task.Jan 4 2018, 3:25 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 4 2018, 3:25 PM
chasemp added a subscriber: chasemp.EditedJan 4 2018, 3:43 PM

For the labvirts themselves I would like to do a few things:

  • Run the profile we ran in the past for labvirt1015 with CPU issues on a labvirt in labtest (labtestvirt2001 seems to be working fine)
  • Note in https://gerrit.wikimedia.org/r/#/c/399243/ we whitelisted possible labvirt kernel versions so that needs to be updated for whatever is now valid (it's an array so it can contain both for the transition)
  • Update labtestvirt2001 to Meltdown mitigated kernel
  • Run the profile again. It's rough numbers but we need something.

Edit: well ideally we use a labtestvirt to do this staging and compare but I imagine it will be a better and faster test to use the spare labvirt in production on reflection. Use the same methodology from labvirt1015 https://phabricator.wikimedia.org/T171473#3699835

chasemp renamed this task from Labvirt reboots for Meltdown to Cloud: Labvirt and instance reboots for Meltdown.Jan 4 2018, 3:43 PM
chasemp triaged this task as High priority.
chasemp updated the task description. (Show Details)Jan 4 2018, 3:46 PM
aborrero updated the task description. (Show Details)Jan 4 2018, 3:47 PM
aborrero updated the task description. (Show Details)Jan 4 2018, 3:52 PM
chasemp updated the task description. (Show Details)Jan 4 2018, 3:55 PM
chasemp updated the task description. (Show Details)

Regarding toolsforge instances, we should take into account that some of them have pending kernel upgrades already: T180809

aborrero updated the task description. (Show Details)Jan 4 2018, 4:06 PM
aborrero updated the task description. (Show Details)
bd808 added a subscriber: bd808.Jan 4 2018, 4:20 PM
Paladox added a subscriber: Paladox.Jan 4 2018, 4:34 PM
Paladox added a comment.EditedJan 4 2018, 10:41 PM

I think upstream have started rolling out the security update.

And the package has been updated in jessie

[22:09:48] <moritzm> !log uploaded linux-meta 1.16 for jessie-wikimedia to apt.wikimedia.org (which installs the new KPTI-enabled kernel with the new ABI)

@Paladox: Most of WMCS runs trusty with either the 3.13 or 4.4 kernel and needs an update by Canonical (which isn't available).

chasemp added a comment.EditedJan 5 2018, 2:21 PM

OS_TENANT_NAME=testlabs openstack server create --flavor 2 --image 85e8924b-b25d-4341-ad3e-56856d4de2cc --availability-zone host:labvirt1018 labvirt1018stresstest-1
OS_TENANT_NAME=testlabs openstack server create --flavor 2 --image 85e8924b-b25d-4341-ad3e-56856d4de2cc --availability-zone host:labvirt1018 labvirt1018stresstest-2
OS_TENANT_NAME=testlabs openstack server create --flavor 4 --image 85e8924b-b25d-4341-ad3e-56856d4de2cc --availability-zone host:labvirt1018 labvirt1018stresstest-3
OS_TENANT_NAME=testlabs openstack server create --flavor 4 --image 85e8924b-b25d-4341-ad3e-56856d4de2cc --availability-zone host:labvirt1018 labvirt1018stresstest-4

sudo cumin "name:labvirt1018stresstest*" "sudo aptitude install -y stress-ng"

sudo cumin --force D{labvirt1018stresstest-[1-4].testlabs.eqiad.wmflabs} 'stress-ng --cpu 1 --io 2 --vm 1 --vm-bytes 1G &'

or

sudo cumin --force D{labvirt1018stresstest-4.testlabs.eqiad.wmflabs} 'sudo screen -d -m "stress-ng --cpu 1 --io 2 --vm 1 --vm-bytes 1G"'

sudo cumin --force D{labvirt1018stresstest-[1-5].testlabs.eqiad.wmflabs} 'sudo screen -d -m "stress-ng --timeout 600 --fork 4 --cpu 1 --io 2 --vm 1 --vm-bytes 1G --switch 5"'

Andrew added a comment.Jan 9 2018, 6:57 PM

First, on labvirt1018:

Create new base images with updated kernels

  1. upgrade kernels on all VMs (how? Probably we need to VMs per distro and then run distro-selected apt commands)
    1. Trusty: apt-get install <???> (in particular, make sure we're moving everything to 4.x kernels)
    2. Stretch: apt-get install <???>
    3. Jessie: apt-get install <???>
  2. upgrade kernel on labvirt
  3. reboot labvirt
  4. restart all VMs
  5. test
  6. wait a few hours
  7. check performance stats
chasemp updated the task description. (Show Details)Jan 9 2018, 7:03 PM
chasemp updated the task description. (Show Details)
chasemp updated the task description. (Show Details)Jan 9 2018, 7:09 PM
chasemp updated the task description. (Show Details)Jan 9 2018, 7:28 PM
chasemp updated the task description. (Show Details)EditedJan 9 2018, 7:31 PM

tldr PCID feature on all, invpcid on 1010+ only

Checking for PCID and INVPCID feature flags across labvirts.

for i in `grep labvirt main`; do echo $i; ssh $i 'grep pcid /proc/cpuinfo &> /dev/null; echo $?'; done
labvirt1001.eqiad.wmnet
0
labvirt1002.eqiad.wmnet
0
labvirt1003.eqiad.wmnet
0
labvirt1004.eqiad.wmnet
0
labvirt1005.eqiad.wmnet
0
labvirt1006.eqiad.wmnet
0
labvirt1007.eqiad.wmnet
0
labvirt1008.eqiad.wmnet
0
labvirt1009.eqiad.wmnet
0
labvirt1010.eqiad.wmnet
0
labvirt1011.eqiad.wmnet
0
labvirt1012.eqiad.wmnet
0
labvirt1013.eqiad.wmnet
0
labvirt1014.eqiad.wmnet
0
labvirt1015.eqiad.wmnet
0
labvirt1016.eqiad.wmnet
0
labvirt1017.eqiad.wmnet
0
labvirt1018.eqiad.wmnet
0
labvirt1019.eqiad.wmnet
0
labvirt1020.eqiad.wmnet
0
for i in `grep labvirt main`; do echo $i; ssh $i 'grep invpcid /proc/cpuinfo &> /dev/null; echo $?'; done
labvirt1001.eqiad.wmnet
1
labvirt1002.eqiad.wmnet
1
labvirt1003.eqiad.wmnet
1
labvirt1004.eqiad.wmnet
1
labvirt1005.eqiad.wmnet
1
labvirt1006.eqiad.wmnet
1
labvirt1007.eqiad.wmnet
1
labvirt1008.eqiad.wmnet
1
labvirt1009.eqiad.wmnet
1
labvirt1010.eqiad.wmnet
0
labvirt1011.eqiad.wmnet
0
labvirt1012.eqiad.wmnet
0
labvirt1013.eqiad.wmnet
0
labvirt1014.eqiad.wmnet
0
labvirt1015.eqiad.wmnet
0
labvirt1016.eqiad.wmnet
0
labvirt1017.eqiad.wmnet
0
labvirt1018.eqiad.wmnet
0
labvirt1019.eqiad.wmnet
0
labvirt1020.eqiad.wmnet
0

I've build new base images, and I'm concerned about what I'm seeing for Jessie.

Trusty:

andrew@trusty-meltdown-image:~$ lsb_release -a
No LSB modules are available.
Distributor ID:    Ubuntu
Description:    Ubuntu 14.04.5 LTS
Release:    14.04
Codename:    trusty
andrew@trusty-meltdown-image:~$ uname -a
Linux trusty-meltdown-image 3.13.0-139-generic #188-Ubuntu SMP Tue Jan 9 14:43:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

(looks right)

Stretch:

andrew@stretch-meltdown-image:~$ lsb_release -a
No LSB modules are available.
Distributor ID:    Debian
Description:    Debian GNU/Linux 9.3 (stretch)
Release:    9.3
Codename:    stretch

andrew@stretch-meltdown-image:~$ uname -a
Linux stretch-meltdown-image 4.9.0-5-amd64 #1 SMP Debian 4.9.65-3+deb9u2 (2018-01-04) x86_64 GNU/Linux

(that looks right to me)

Jessie:

andrew@jessie-meltdown-image:~$ lsb_release -a
No LSB modules are available.
Distributor ID:    Debian
Description:    Debian GNU/Linux 8.10 (jessie)
Release:    8.10
Codename:    jessie
andrew@jessie-meltdown-image:~$ uname -a
Linux jessie-meltdown-image 4.9.0-0.bpo.5-amd64 #1 SMP Debian 4.9.65-3+deb9u1~bpo8+2 (2018-01-04) x86_64 GNU/Linux

(So... apparently we are running 4.9 kernels on Jessie even though the security patch for Jessie is only in the 3.16 kernel. Not sure how to move forward from this. That also raises concerns about the upgrade path for existing VMs.)

Here are all the distros and kernels currently running: P6565

Linux jessie-meltdown-image 4.9.0-0.bpo.5-amd64 #1 SMP Debian 4.9.65-3+deb9u1~bpo8+2 (2018-01-04) x86_64 GNU/Linux

(So... apparently we are running 4.9 kernels on Jessie even though the security patch for Jessie is only in the 3.16 kernel.  Not sure how to move forward from this.  That also raises concerns about the upgrade path for existing VMs.)

No, that's correct. While Debian jessie by default uses a 3.16 kernel, we've been using a 4.9 backport for a while (to be able to run the same base kernel as on stretch and for various features not in 3.16). 4.9.65-3+deb9u1~bpo8+2 is a kernel I built internally, if you look into /usr/share/doc/linux-image-4.9.0-0.bpo.5-amd64/changelog.Debian.gz it'll list my changelog entry wrt KPTI patches.

The load-testing command I've settled on is:

sudo cumin --force --timeout 120 -o json "project:testlabs name:labvirt1018stresstest*"  'screen -d -m stress-ng --timeout 3600 --fork 4 --cpu 1 --io 2 --vm 1 --vm-bytes 1G --switch 5'

I've run three load tests with the above command. The last test started at Wed Jan 10 15:51:10 UTC 2018

Andrew added a comment.EditedJan 10 2018, 5:41 PM

The terrible way to fix grub on Trusty VMs is:

sudo cumin --force --timeout 120 -o json "A:all" "lsb_release -si | grep Ubuntu && mv /boot/grub/menu.lst /boot/grub/menu.lst.old && update-grub -y"

Change 403455 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] labvirts: whitelist the post-meltdown kernel version

https://gerrit.wikimedia.org/r/403455

Change 403455 merged by Andrew Bogott:
[operations/puppet@production] labvirts: whitelist the post-meltdown kernel version

https://gerrit.wikimedia.org/r/403455

chasemp updated the task description. (Show Details)Jan 10 2018, 6:38 PM
chasemp updated the task description. (Show Details)Jan 10 2018, 6:51 PM

First test was at Thu Jan 11 03:20:08 UTC 2018

Second test was at Thu Jan 11 03:35:23 UTC 2018

Third test was at Thu Jan 11 03:50:29 UTC 2018

There's a slight change in performance but not much! At least on the newer labvirts it doesn't look like we need to worry about this.

Definitely more expensive, potentially not so severe that it causes us major pains.

chasemp added a comment.EditedJan 11 2018, 3:19 PM

I did https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Meltdown_Response#PIlot_in_Toolforge this morning getting pilot instances of all 3 kinds (technically yuvi caught all the Stretch instances there for PAWS yesterday). see https://etherpad.wikimedia.org/p/cloud-meltdown-rollout for live copy

Trusty

sudo apt-get update && sudo apt-get -y install linux-image-generic && sudo mv /boot/grub/menu.lst /boot/grub/menu.lst.old && sudo update-grub -y && sudo uname -r

Jessie

sudo apt-get -y install linux-meta

Stretch [ so far all noops today]

apt-get -s install linux-image-amd64
0 upgraded, 0 newly installed, 0 to remove and 71 not upgraded.
Linux tools-paws-worker-1001 4.9.0-5-amd64 #1 SMP Debian 4.9.65-3+deb9u2 (2018-01-04) x86_64 GNU/Linux


side note, since Trusty is not updating the grub menu for new kernels we are sitting on a pile of unused kernels, while still running the oldest thing possible.

tools-exec-1402

Found kernel: /boot/vmlinuz-3.13.0-139-generic
Found kernel: /boot/vmlinuz-3.13.0-137-generic
Found kernel: /boot/vmlinuz-3.13.0-135-generic
Found kernel: /boot/vmlinuz-3.13.0-133-generic
Found kernel: /boot/vmlinuz-3.13.0-132-generic
Found kernel: /boot/vmlinuz-3.13.0-129-generic
Found kernel: /boot/vmlinuz-3.13.0-128-generic
Found kernel: /boot/vmlinuz-3.13.0-126-generic
Found kernel: /boot/vmlinuz-3.13.0-125-generic
Found kernel: /boot/vmlinuz-3.13.0-123-generic
Found kernel: /boot/vmlinuz-3.13.0-121-generic
Found kernel: /boot/vmlinuz-3.13.0-119-generic
Found kernel: /boot/vmlinuz-3.13.0-109-generic
Found kernel: /boot/vmlinuz-3.13.0-100-generic
Updating /boot/grub/menu.lst ... done

That's 2G+ of unused kernels, seems kinda crazy. My issue atm is we were actually running 3.13.0-100-generic which means if I cleanup say up to the last 2 it would be all kernels kept we never ran here :)

If 3.13.0-139-generic seems fruitful we should clean this up with something like:

sudo apt-get install bikeshed

keeps last 2

sudo purge-old-kernels

rush@tools-exec-1401:~$ dmesg | grep -i isolation

[ 0.000000] Kernel/User page tables isolation: enabled

rush@tools-paws-worker-1001:~$ sudo dmesg | grep -i isolation

[ 0.000000] Kernel/User page tables isolation: enabled

rush@tools-worker-1011:~$ sudo dmesg | grep -i isolation

[ 0.000000] Kernel/User page tables isolation: enabled

It seems tools-worker-1015 did not get the update as I forgot to reboot it. But I'm hoping we have enough a small snapshot to know if immediately we are in trouble or to move to the next step. If a reboot happened this morning to update it would happen right before 15:00.

These are relative times so have to be shifted if looking at them after today.

https://graphite-labs.wikimedia.org/render/?width=777&height=459&_salt=1515696010.908&areaMode=stacked&target=tools.tools-exec-140[1-5].cpu.total.irq&target=tools.tools-exec-140[1-5].cpu.total.nice&target=tools.tools-exec-140[1-5].cpu.total.softirq&target=tools.tools-exec-140[1-5].cpu.total.steal&target=tools.tools-exec-140[1-5].cpu.total.system&target=tools.tools-exec-140[1-5].cpu.total.user&tools.tools-exec-140[1-5].cpu.total.iowait&hideLegend=false&from=-24h

https://graphite-labs.wikimedia.org/render/?width=777&height=459&_salt=1515696212.788&areaMode=all&target=cactiStyle(tools.tools-exec-140[1-5].loadavg.05)&target=cactiStyle(tools.tools-exec-140[1-5].loadavg.01)&from=-8h

https://graphite-labs.wikimedia.org/render/?width=777&height=500&_salt=1515696212.788&areaMode=all&target=cactiStyle(tools.tools-worker-101[0-6].loadavg.05)&from=-8h&hideLegend=false

https://graphite-labs.wikimedia.org/render/?width=777&height=459&_salt=1515696718.888&areaMode=stacked&target=tools.tools-worker-1011.cpu.total.user&target=tools.tools-worker-1011.cpu.total.system&target=tools.tools-worker-1011.cpu.total.softirq&target=tools.tools-worker-1011.cpu.total.nice&target=tools.tools-worker-1011.cpu.total.irq&target=tools.tools-worker-1011.cpu.total.iowait&target=tools.tools-worker-1011.cpu.total.steal&from=-8h

https://graphite-labs.wikimedia.org/render/?width=777&height=459&_salt=1515696718.888&areaMode=stacked&target=cactiStyle(tools.tools-worker-1011.cpu.total.user)&target=cactiStyle(tools.tools-worker-1011.cpu.total.system&target=cactiStyle(tools.tools-worker-1011.cpu.total.softirq)&target=cactiStyle(tools.tools-worker-1011.cpu.total.nice)&target=cactiStyle(tools.tools-worker-1011.cpu.total.irq)&target=cactiStyle(tools.tools-worker-1011.cpu.total.iowait)&target=cactiStyle(tools.tools-worker-1012.cpu.total.steal)&from=-72h

https://graphite-labs.wikimedia.org/render/?width=777&height=459&_salt=1515696718.888&areaMode=stacked&target=cactiStyle(tools.tools-worker-1012.cpu.total.user)&target=cactiStyle(tools.tools-worker-1012.cpu.total.system&target=cactiStyle(tools.tools-worker-1012.cpu.total.softirq)&target=cactiStyle(tools.tools-worker-1012.cpu.total.nice)&target=cactiStyle(tools.tools-worker-1012.cpu.total.irq)&target=cactiStyle(tools.tools-worker-1012.cpu.total.iowait)&target=cactiStyle(tools.tools-worker-1012.cpu.total.steal)&from=-72h

https://graphite-labs.wikimedia.org/render/?width=777&height=459&_salt=1515696718.888&areaMode=stacked&target=cactiStyle(tools.tools-worker-1013.cpu.total.user)&target=cactiStyle(tools.tools-worker-1013.cpu.total.system&target=cactiStyle(tools.tools-worker-1013.cpu.total.softirq)&target=cactiStyle(tools.tools-worker-1013.cpu.total.nice)&target=cactiStyle(tools.tools-worker-1013.cpu.total.irq)&target=cactiStyle(tools.tools-worker-1013.cpu.total.iowait)&target=cactiStyle(tools.tools-worker-1013.cpu.total.steal)&from=-8h

https://graphite-labs.wikimedia.org/render/?width=777&height=459&_salt=1515696718.888&areaMode=stacked&target=cactiStyle(tools.tools-worker-1014.cpu.total.user)&target=cactiStyle(tools.tools-worker-1014.cpu.total.system&target=cactiStyle(tools.tools-worker-1014.cpu.total.softirq)&target=cactiStyle(tools.tools-worker-1014.cpu.total.nice)&target=cactiStyle(tools.tools-worker-1014.cpu.total.irq)&target=cactiStyle(tools.tools-worker-1014.cpu.total.iowait)&target=cactiStyle(tools.tools-worker-1014.cpu.total.steal)&from=-8h

https://graphite-labs.wikimedia.org/render/?width=777&height=459&_salt=1515696718.888&areaMode=stacked&target=cactiStyle(tools.tools-worker-1015.cpu.total.user)&target=cactiStyle(tools.tools-worker-1015.cpu.total.system&target=cactiStyle(tools.tools-worker-1015.cpu.total.softirq)&target=cactiStyle(tools.tools-worker-1015.cpu.total.nice)&target=cactiStyle(tools.tools-worker-1015.cpu.total.irq)&target=cactiStyle(tools.tools-worker-1015.cpu.total.iowait)&target=cactiStyle(tools.tools-worker-1015.cpu.total.steal)&from=-5h

https://graphite-labs.wikimedia.org/render/?width=777&height=459&_salt=1515696718.888&areaMode=stacked&target=cactiStyle(tools.tools-worker-1016.cpu.total.user)&target=cactiStyle(tools.tools-worker-1016.cpu.total.system&target=cactiStyle(tools.tools-worker-1016.cpu.total.softirq)&target=cactiStyle(tools.tools-worker-1016.cpu.total.nice)&target=cactiStyle(tools.tools-worker-1016.cpu.total.irq)&target=cactiStyle(tools.tools-worker-1016.cpu.total.iowait)&target=cactiStyle(tools.tools-worker-1016.cpu.total.steal)&from=-8h

https://graphite-labs.wikimedia.org/render/?width=777&height=459&_salt=1515697792.99&target=cactiStyle(tools.tools-worker-101[1-6].loadavg.05)&from=-8h

https://graphite-labs.wikimedia.org/render/?width=777&height=459&_salt=1515697959.788&areaMode=stacked&target=tools.tools-paws-worker-1001.cpu.total.irq&target=tools.tools-paws-worker-1001.cpu.total.nice&target=tools.tools-paws-worker-1001.cpu.total.softirq&target=tools.tools-paws-worker-1001.cpu.total.steal&target=tools.tools-paws-worker-1001.cpu.total.system&target=tools.tools-paws-worker-1001.cpu.total.user

https://graphite-labs.wikimedia.org/render/?width=777&height=459&_salt=1515697959.788&areaMode=stacked&target=tools.tools-paws-worker-1002.cpu.total.irq&target=tools.tools-paws-worker-1002.cpu.total.nice&target=tools.tools-paws-worker-1002.cpu.total.softirq&target=tools.tools-paws-worker-1002.cpu.total.steal&target=tools.tools-paws-worker-1002.cpu.total.system&target=tools.tools-paws-worker-1002.cpu.total.user&from=-2d

https://graphite-labs.wikimedia.org/render/?width=777&height=459&_salt=1515697959.788&areaMode=stacked&target=tools.tools-paws-worker-1003.cpu.total.irq&target=tools.tools-paws-worker-1003.cpu.total.nice&target=tools.tools-paws-worker-1003.cpu.total.softirq&target=tools.tools-paws-worker-1003.cpu.total.steal&target=tools.tools-paws-worker-1003.cpu.total.system&target=tools.tools-paws-worker-1003.cpu.total.user&from=-2d

tldr; I do think there is an impact, and that it's workload dependent which is made hugely complicated because our workers, exec nodes, and paws workers do not have predictable or consistent workloads. Let's move ahead w/ labvirt1017, labvirt1003, and co per https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Meltdown_Response#First_canaries_from_1001-1009_(pcid)_and_1010-1019_(pcid_and_invpcid)_(both_should_have_headroom) and see how things fair. We only have simulated numbers on a labvirt at the moment so it'll be interesting.

https://graphite.wikimedia.org/render/?width=586&height=308&_salt=1515702661.283&target=servers.labvirt1017.cpu.total.guest_nice&target=servers.labvirt1017.cpu.total.guest&areaMode=stacked&hideLegend=false&from=-8h

https://graphite.wikimedia.org/render/?width=959&height=320&_salt=1515702706.035&areaMode=stacked&hideLegend=false&target=servers.labvirt1017.cpu.total.user&target=servers.labvirt1017.cpu.total.system&target=servers.labvirt1017.cpu.total.steal&target=servers.labvirt1017.cpu.total.softirq&target=servers.labvirt1017.cpu.total.nice&target=servers.labvirt1017.cpu.total.irq&target=servers.labvirt1017.cpu.total.iowait&from=-4h

https://graphite.wikimedia.org/render/?width=959&height=320&_salt=1515705013.51&areaMode=stacked&hideLegend=false&target=cactiStyle(servers.labvirt1017.loadavg.05)&from=-4h


https://graphite.wikimedia.org/render/?width=586&height=308&_salt=1515702661.283&target=servers.labvirt1003.cpu.total.guest_nice&target=servers.labvirt1003.cpu.total.guest&areaMode=stacked&hideLegend=false&from=-8h

https://graphite.wikimedia.org/render/?width=959&height=320&_salt=1515702706.035&areaMode=stacked&hideLegend=false&target=servers.labvirt1003.cpu.total.user&target=servers.labvirt1003.cpu.total.system&target=servers.labvirt1003.cpu.total.steal&target=servers.labvirt1003.cpu.total.softirq&target=servers.labvirt1003.cpu.total.nice&target=servers.labvirt1003.cpu.total.irq&target=servers.labvirt1003.cpu.total.iowait&from=-4h

https://graphite.wikimedia.org/render/?width=959&height=320&_salt=1515705013.51&areaMode=stacked&hideLegend=false&target=cactiStyle(servers.labvirt1003.loadavg.05)&from=-4h

@MoritzMuehlenhoff reports in T184910 that there are servers just pending the reboot. Should that ticket be merged into this one?

chasemp added a parent task: Restricted Task.Jan 16 2018, 1:46 PM
chasemp updated the task description. (Show Details)Jan 16 2018, 3:12 PM

Change 404588 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] cloud: labvirt settle on meltdown kernel

https://gerrit.wikimedia.org/r/404588

Change 404588 merged by Andrew Bogott:
[operations/puppet@production] cloud: labvirt settle on meltdown kernel

https://gerrit.wikimedia.org/r/404588

chasemp closed this task as Resolved.Jan 17 2018, 4:04 PM
chasemp claimed this task.

Full working etherpad is archived at https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Meltdown_Response

The initial perspective has held on I think, not a free upgrade but 5% or something overhead that is difficult to pin down due to our varied workloads. labvirt1015 is the only hypervisor I've seen over 24h that seems to have really upturned but that's in relative terms as the absolute resource usage is still not worrysome.

https://graphite.wikimedia.org/render/?width=959&height=320&_salt=1515705013.51&areaMode=stacked&hideLegend=false&target=cactiStyle(servers.labvirt1015.loadavg.05)&from=-48h

resolving until we have further information, further work is happening in T184910