Page MenuHomePhabricator

Expand the Lift Wing workers' kubelet partition
Closed, ResolvedPublic

Description

We learned that the models pulled by the storage-initializer to /mnt/models are using space on the kubelet's disk partition (via k8s emptyDirs). The partition is currently small (~40G), so we should expand it on all nodes since we have some space left on the LVM physical volume.

sudo lvextend -L+80g /dev/mapper/vg0-kubelet
sudo resize2fs /dev/mapper/vg0-kubelet

We have enough space (with ml-serve1001 already done of course):

elukey@cumin1001:~$ sudo cumin 'ml-serve[1,2]*' 'pvs'
16 hosts will be targeted:
ml-serve[2001-2008].codfw.wmnet,ml-serve[1001-1008].eqiad.wmnet
OK to proceed on 16 hosts? Enter the number of affected hosts to confirm or "q" to quit: 16
===== NODE GROUP =====                                                                                                                      
(1) ml-serve1001.eqiad.wmnet                                                                                                                
----- OUTPUT of 'pvs' -----                                                                                                                 
  PV         VG  Fmt  Attr PSize   PFree                                                                                                    
  /dev/md0   vg0 lvm2 a--  446.72g 9.35g                                                                                                    
===== NODE GROUP =====                                                                                                                      
(4) ml-serve[1005-1008].eqiad.wmnet                                                                                                         
----- OUTPUT of 'pvs' -----                                                                                                                 
  PV         VG  Fmt  Attr PSize   PFree                                                                                                    
  /dev/md0   vg0 lvm2 a--  446.21g 89.25g                                                                                                   
===== NODE GROUP =====                                                                                                                      
(11) ml-serve[2001-2008].codfw.wmnet,ml-serve[1002-1004].eqiad.wmnet                                                                        
----- OUTPUT of 'pvs' -----                                                                                                                 
  PV         VG  Fmt  Attr PSize   PFree                                                                                                    
  /dev/md0   vg0 lvm2 a--  446.72g 89.35g

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2023-06-27T13:32:03Z] <elukey> expand ml-staging200[12] kubelet partitions - T339231

Icinga downtime and Alertmanager silence (ID=3c9cc021-58b9-4756-9cf7-4880a033e42a) set by elukey@cumin1001 for 0:30:00 on 1 host(s) and their services with reason: Expand the kubelet disk partition

ml-serve2001.codfw.wmnet

Just did ml-serve2001, but of course the resize had to be online since most of our pods store data on the kubelet partition (like the model binary).

We should drain the nodes first as precaution.

Mentioned in SAL (#wikimedia-operations) [2023-08-11T08:32:09Z] <elukey> expand kubelet partition on ml-serve2001 - T339231

(copied from T343900, this ticket is more appropriate for this info)

I have done ml2002 and ml2003 today (two machines to force some pods back onto 2002, to see it works properly). So far, everything seems fine.

Steps I took:

On deployXXXX, useful things to run in watch (after kube_env ml-serve-XXXX):

  • See node status: kubectl get nodes
  • When waiting for drain to complete: kubectl get pods -A --field-selector spec.nodeName=ml-serveXXXX.codfw.wmnet

Steps for drain/resize/undrain:

On ml-serveXXXX, verify you're dealing with a machine that still has a small partition (<30G): df -h /var/lib/kubelet/

On deployXXXX:

  • Drain node: kubectl drain --ignore-daemonsets --delete-emptydir-data ml-serveXXXX.codfw.wmnet
  • Wait until only daemonsets are running (istio etc), see get pods command above.

On ml-serveXXXX:

  • Optionally watch number of docker processes (will not reach 0 due to daemonsets): watch sudo docker ps\|wc -l
  • Stop kubelet, resize partition and filesystem (does not need to be unmounted):
sudo systemctl stop kubelet
sudo lvextend -L+80g /dev/mapper/vg0-kubelet
sudo resize2fs /dev/mapper/vg0-kubelet
  • Verify increased size:
$ df -h /var/lib/kubelet
Filesystem               Size  Used Avail Use% Mounted on
/dev/mapper/vg0-kubelet  107G  480K  102G   1% /var/lib/kubelet`
  • Start kubelet: sudo systemctl start kubelet.service
  • Check kubelet logs: sudo journalctl -xefu kubelet.service Must not have ExecSync errors! Keep watching this file while doing the second machine, to see if there are errors as pods are moved.

On deployXXXX: Undrain/uncordon: kubectl uncordon ml-serveXXXX.codfw.wmnet

If you do two machines one after the other, the first machine will receive new pods during the drain of the second, you can use this to make sure it's working correctly.

Machines ml-serve2001-2006 are now done. Zero errors or irregularities. Will do 7 and 8 later this week.

2007 and 2008 are now also done, again without problems.

I will leave eqiad to Luca next week.

The reboots for codfw from T344587 I will do tomorrow. I had meant to coordinate with this, but I think it's better to do them separately, to avoid problems having many probably causes.

Mentioned in SAL (#wikimedia-operations) [2023-09-20T08:40:17Z] <klausman> Draining ml-serve1002 for kubelet partition increase (T339231)

Mentioned in SAL (#wikimedia-operations) [2023-09-20T08:47:50Z] <klausman> Draining ml-serve1003 for kubelet partition increase (T339231)

Mentioned in SAL (#wikimedia-operations) [2023-09-20T08:57:03Z] <klausman> Draining ml-serve1004 for kubelet partition increase (T339231)

Mentioned in SAL (#wikimedia-operations) [2023-09-20T09:06:00Z] <klausman> Draining ml-serve1005 for kubelet partition increase (T339231)

Mentioned in SAL (#wikimedia-operations) [2023-09-20T09:15:38Z] <klausman> Draining ml-serve1006 for kubelet partition increase (T339231)

Mentioned in SAL (#wikimedia-operations) [2023-09-20T09:24:00Z] <klausman> Draining ml-serve1007 for kubelet partition increase (T339231)

Mentioned in SAL (#wikimedia-operations) [2023-09-20T09:29:23Z] <klausman> Draining ml-serve1008 for kubelet partition increase (T339231)

klausman moved this task from Backlog/SRE to Complete Q3 2022/23 on the Machine-Learning-Team board.

I've done 1002-1008 today, and everything went smoothly. All done!