Page MenuHomePhabricator

Re-add intel-microcode
Closed, ResolvedPublic

Description

intel-microcode was initially added to the standard packages with https://phabricator.wikimedia.org/rOPUPe6e960b69d6764f5fbe5b1bf513c8a60be008696 but later on reverted with https://phabricator.wikimedia.org/rOPUP0fccee1f88fb312069f94fa634cb2d3b8205651c due to CPU frequency problems.

I'd like to re-add intel-microcode, since it fixes both stability/correctness bugs and around 2013 Intel also used microcode updates to address security problems on the CPU level. (These vulnerabilitiesare also addressed by kernel changes, but there might be some in the future which not easily maskable by the kernel, so I'd also be good to be prepared)

One way to add this gradually would be to not add intel-microcode to standard-packages.pp, but rather to the new meta package for the 4.4 kernel, that way all systems would get it step by step as we move to 4.4 (it needs a reboot to become effective anyway).

Comments/objections?

(We have only a single server with AMD CPUs (stat1001), that one can be dealt with manually)

Event Timeline

Change 312714 had a related patch set uploaded (by Matanya):
standard packages: re-add intel-microcode

https://gerrit.wikimedia.org/r/312714

+1 on tying this to 4.4+ and it being a good idea to get it going again. I'm not sure (unless someone's investigated already) that our latest 3.1[69] and such kernels don't still have the initramfs cpu frequency issues when installing intel-microcode, but I'd guess 4.4 is safe (also needs testing)?

We have two clusters which need updated microcode to provide support for the new IBPB instruction needed to secure KVM instances against Spectre. In addition to that keeping the microcode updated it also recommended by upstream since microcode updates can also address functional CPU issues.

Updated microcode is only picked up during reboots and we'll enable this per cluster (or rather per hardware generation per cluster for the bigger clusters compromised of several generations of servers). If at some point we're confident to make it the new default, we can simply enable it in standard_packages

Change 433160 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Allow enabling microcode updates gradually

https://gerrit.wikimedia.org/r/433160

Change 433160 merged by Muehlenhoff:
[operations/puppet@production] Allow enabling microcode updates gradually

https://gerrit.wikimedia.org/r/433160

Six out of the mw* servers have been switched to using microcode updates. (https://phabricator.wikimedia.org/P7148)

Mentioned in SAL (#wikimedia-operations) [2018-05-28T14:21:20Z] <ema> cp1045,cp2001,cp3007,cp5001: reboot with intel-microcode T127825

Change 436271 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Enable microcode updates for all mediawiki servers

https://gerrit.wikimedia.org/r/436271

It would be useful to check if a new microcode is available (and thus a system restart is needed). Something along these lines should do the trick:

#!/bin/bash

running="$(awk '/microcode/ { printf "%x", $3; exit }' /proc/cpuinfo)"
available="$(sudo iucode-tool -tb -lS /lib/firmware/intel-ucode/* 2> /dev/null | awk 'NR>1 { printf "%x", $9 ; exit }')"

msg="running microcode 0x${running}, 0x${available} available"

if [ "${running}" = "${available}" ]; then
    echo "OK - ${msg}"
    exit 0
else
    echo "WARNING - ${msg}. Reboot needed."
    exit 1
fi

Prerequisites:

  • iucode-tool (which is a dependecy of intel-microcode)
  • the cpuid kernel module needs to be loaded

The following cache hosts have been running with updated microcodes for the past two days:

HostCPUMicrocode version on dieUpdated microcode version
cp2001CPU E 5-2680 v3 @ 2.50GHz0x2d0x3c
cp5001CPU E 5-2650 v4 @ 2.20GHz0xb0000210xb00002c
cp3007CPU E 5-2640 0 @ 2.50GHz0x70a0x713
cp1045CPU E 5-2680 0 @ 2.70GHz0x70d0x713

I've checked the output of lscpu and none of the above exhibit the low-frequency issues mentioned here: https://phabricator.wikimedia.org/rOPUP0fccee1f88fb312069f94fa634cb2d3b8205651c

Change 436490 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache hosts: enable microcode updates

https://gerrit.wikimedia.org/r/436490

Change 436271 merged by Muehlenhoff:
[operations/puppet@production] Enable microcode updates for all mediawiki servers

https://gerrit.wikimedia.org/r/436271

Change 436490 merged by Ema:
[operations/puppet@production] cache hosts: enable microcode updates

https://gerrit.wikimedia.org/r/436490

Change 436553 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] prometheus: export intel-microcode information via node_exporter

https://gerrit.wikimedia.org/r/436553

Change 437206 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Enable microcode for Hadoop cluster

https://gerrit.wikimedia.org/r/437206

Change 437206 merged by Muehlenhoff:
[operations/puppet@production] Enable microcode for Hadoop cluster

https://gerrit.wikimedia.org/r/437206

Change 437430 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Enable microcode updates for Parsoid hosts

https://gerrit.wikimedia.org/r/437430

Change 437430 merged by Muehlenhoff:
[operations/puppet@production] Enable microcode updates for Parsoid hosts

https://gerrit.wikimedia.org/r/437430

Change 437464 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Move microcode hiera call to profile::base

https://gerrit.wikimedia.org/r/437464

Change 437464 merged by Ema:
[operations/puppet@production] Move microcode hiera call to profile::base

https://gerrit.wikimedia.org/r/437464

Change 436553 merged by Ema:
[operations/puppet@production] prometheus: export intel-microcode information via node_exporter

https://gerrit.wikimedia.org/r/436553

Change 442269 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Enable microcode for all database roles

https://gerrit.wikimedia.org/r/442269

Change 442269 merged by Muehlenhoff:
[operations/puppet@production] Enable microcode for all database roles

https://gerrit.wikimedia.org/r/442269

Change 443038 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Enable microcode updates for all elasticsearch hosts

https://gerrit.wikimedia.org/r/443038

Change 443038 merged by Muehlenhoff:
[operations/puppet@production] Enable microcode updates for all elasticsearch hosts

https://gerrit.wikimedia.org/r/443038

Change 443944 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Enable microcode updates for CI masters

https://gerrit.wikimedia.org/r/443944

Change 443944 merged by Muehlenhoff:
[operations/puppet@production] Enable microcode updates for CI masters

https://gerrit.wikimedia.org/r/443944

Mentioned in SAL (#wikimedia-operations) [2018-07-09T07:40:36Z] <ema> reboot lvs canaries for microcode updates: lvs4007 lvs3004 lvs2006 lvs2010 lvs1006 T127825

Change 445573 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Enable microcode on restbase servers

https://gerrit.wikimedia.org/r/445573

Change 445576 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Enable microcode for Swift backend servers

https://gerrit.wikimedia.org/r/445576

Change 445573 merged by Muehlenhoff:
[operations/puppet@production] Enable microcode on restbase servers

https://gerrit.wikimedia.org/r/445573

Change 445576 merged by Muehlenhoff:
[operations/puppet@production] Enable microcode for Swift backend servers

https://gerrit.wikimedia.org/r/445576

Change 451646 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Enable microcode on spare hosts

https://gerrit.wikimedia.org/r/451646

Change 451646 merged by Muehlenhoff:
[operations/puppet@production] Enable microcode on spare hosts

https://gerrit.wikimedia.org/r/451646

Change 451830 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Enable microcode for WMCS puppet masters

https://gerrit.wikimedia.org/r/451830

Change 451830 merged by Muehlenhoff:
[operations/puppet@production] Enable microcode for WMCS puppet masters

https://gerrit.wikimedia.org/r/451830

Change 451858 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Enable microcode for LVS load balancers

https://gerrit.wikimedia.org/r/451858

Change 451858 merged by Muehlenhoff:
[operations/puppet@production] Enable microcode for LVS load balancers

https://gerrit.wikimedia.org/r/451858

Change 453997 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Enable intel-microcode for all bare metal servers with an Intel CPU

https://gerrit.wikimedia.org/r/453997

Change 453997 merged by Muehlenhoff:
[operations/puppet@production] Enable intel-microcode for all bare metal servers with an Intel CPU

https://gerrit.wikimedia.org/r/453997

Change 454203 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Remove enable_microcode logic

https://gerrit.wikimedia.org/r/454203

Change 312714 abandoned by Matanya:
standard packages: re-add intel-microcode

Reason:
Per comment

https://gerrit.wikimedia.org/r/312714

Change 454203 merged by Muehlenhoff:
[operations/puppet@production] Remove enable_microcode logic

https://gerrit.wikimedia.org/r/454203

Microcode is now enabled on all baremetal servers with an Intel CPU and we haven't seen any issues so far. Closing the task.