Page MenuHomePhabricator

Cross fleet runc upgrades
Closed, ResolvedPublic

Description

https://security-tracker.debian.org/tracker/CVE-2024-21626

Has the details, but the gist is that runc versions prior to

  • 1.0.0~rc93+ds1-5+deb11u3 for bullseye
  • 1.1.5+ds1-1+deb12u1 for bookworm
  • 1.0.0~rc6+dfsg1-3+deb10u3 for buster

are susceptible to container escape attacks.

This task is to coordinate deploying the fixes.

Since runc is a cli tool, without a long running daemon, we don't need any system container restarts (e.g. containerd/docker). However, it's prudent to force a restart of all containers/pods in our environments so that we don't end up with forgotten containers that remain vulnerable.

Event Timeline

akosiaris changed the visibility from "Public (No Login Required)" to "Custom Policy".Feb 5 2024, 5:25 PM
akosiaris changed the edit policy from "All Users" to "Subscribers".

https://security-tracker.debian.org/tracker/CVE-2024-21626 got updated, we got runc updates for bullseye and buster.

I 've already testing newer runc on staging-eqiad and everything appears to be ok

runc was updated: 1.0.0~rc93+ds1-5+deb11u2 -> 1.0.0~rc93+ds1-5+deb11u3

kubernetes[2005-2060].codfw.wmnet,kubernetes[1005-1062].eqiad.wmnet,

mw[2260,2267,2291-2297,2355,2357,2366,2368,2370,2381,2395,2420-2425,24
27,2429-2431,2434-2437,2440,2442-2443,2445-2451].codfw.wmnet,mw[1360-1
363,1374-1383,1419,1423-1425,1439-1440,1457,1459-1466,1469-1475,1482,1
486,1488,1495-1496].eqiad.wmnet (195 hosts)

We got 6 hosts that are for some reason buster and need to be reimaged:

mw[2318-2319,2350,2352,2354,2356].codfw.wmnet

I 'll file a different task for those. They aren't proper nodes right now in the codfw cluster (they don't even appear in the API)

I 've started the rolling restart of all pods in eqiad. This is right now done with this magnificent piece of code (</sarcasm>)

#!/bin/bash

#set -x

NODES=$(/usr/bin/kubectl get nodes -o json | jq -r '.items[].metadata.name')

for node in $NODES
do
	/usr/bin/kubectl drain --ignore-daemonsets --delete-emptydir-data $node
	# We are talking daemonsets here
	daemonsets=$(/usr/bin/kubectl get pods -A -o json --field-selector spec.nodeName=${node} | jq -r -c ".items[] | .metadata.namespace + \"+\" + .metadata.name")
	for daemonset in $daemonsets
	do
		namespace=$(echo $daemonset | awk -F "+" '{print $1}')
		pod=$(echo $daemonset | awk -F "+" '{print $2}')
		/usr/bin/kubectl -n ${namespace} delete pods $pod
	done
	/usr/bin/kubectl uncordon $node
	sleep 300
done

which arguably should be transform into a cookbook.

Rolling restart started in codfw (and a repeat of staging clusters just to make sure my code didn't forget something yesterday)

We got 6 hosts that are for some reason buster and need to be reimaged:

mw[2318-2319,2350,2352,2354,2356].codfw.wmnet

Those are failing to be reimaged and tracked in T356709

akosiaris renamed this task from WikiKube runc upgrades to Cross fleet runc upgrades.Feb 6 2024, 10:46 AM
akosiaris updated the task description. (Show Details)

Summary for non-wikikube services:

  • Toolforge is tracked in https://phabricator.wikimedia.org/T356507
  • The DSE and aux k8s clusters are already updated, ml-serve is work in progress.
  • The gitlab runners are also upgraded
  • releases* is upgraded as well

staging clusters done, the production clusters are proceeding at a rate of ~10m per node (I have on purpose a big sleep 300 after each node to allow the situation to settle).

Summary for non-wikikube services:

I noticed the runc upgrade, to be on the safe side, I 've roll restart pods in dse and aux-k8s clusters. I 've left ml-serve though.

  • The gitlab runners are also upgraded
  • releases* is upgraded as well

Cool, thanks for the update.

The rest of ml-serve is now upgraded as well, Tobias will drain/undrain the nodes.

This leaves the buster hosts: alert (in the process of being reimaged to bookworm currently), cloudweb and the deployment servers. I'll update the task when a Buster/LTS update is out.

All of wikikube production clusters done as well. Adding @klausman for an FYI regarding the state of the rest of the clusters and some low quality ready to use code used to to the rolling restarts.

Adding @CDanis, @jhathaway and @fgiunchedi as an FYI for the aux-k8s cluster.

Roll-restart of the staging ML cluster is done, eqiad and codfw prod clusters today and tomorrow.

ml-serve in codfw also done, so all done for ML team

ml-serve in codfw also done, so all done for ML team

Cool, thanks! I 'll resolve this one then.

akosiaris changed the visibility from "Custom Policy" to "Public (No Login Required)".Feb 8 2024, 2:29 PM
akosiaris changed the edit policy from "Subscribers" to "All Users".

Marking this as public now that we are done patching.

Mentioned in SAL (#wikimedia-operations) [2024-02-09T17:18:28Z] <cdanis> rolling restart of pods on k8s aux eqiad T356661

All pods on k8s-aux-eqiad restarted, thanks @akosiaris for the script.

Adding @rzl @Scott_French and @Volans per recent discussion on spicerack/cumin training/onboarding. The 15 line candidate patch is at T356661#9516327. T277677 might also have some useful information

This leaves the buster hosts: alert (in the process of being reimaged to bookworm currently), cloudweb and the deployment servers. I'll update the task when a Buster/LTS update is out.

LTS update is now also out: alert* and deployment hosts are now also upgraded (cloudweb got reimaged to bullseye in the mean time).