Expand the Lift Wing workers' kubelet partition
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	elukey
	Jun 15 2023, 12:55 PM

Description

We learned that the models pulled by the storage-initializer to /mnt/models are using space on the kubelet's disk partition (via k8s emptyDirs). The partition is currently small (~40G), so we should expand it on all nodes since we have some space left on the LVM physical volume.

sudo lvextend -L+80g /dev/mapper/vg0-kubelet
sudo resize2fs /dev/mapper/vg0-kubelet

We have enough space (with ml-serve1001 already done of course):

elukey@cumin1001:~$ sudo cumin 'ml-serve[1,2]*' 'pvs'
16 hosts will be targeted:
ml-serve[2001-2008].codfw.wmnet,ml-serve[1001-1008].eqiad.wmnet
OK to proceed on 16 hosts? Enter the number of affected hosts to confirm or "q" to quit: 16
===== NODE GROUP =====                                                                                                                      
(1) ml-serve1001.eqiad.wmnet                                                                                                                
----- OUTPUT of 'pvs' -----                                                                                                                 
  PV         VG  Fmt  Attr PSize   PFree                                                                                                    
  /dev/md0   vg0 lvm2 a--  446.72g 9.35g                                                                                                    
===== NODE GROUP =====                                                                                                                      
(4) ml-serve[1005-1008].eqiad.wmnet                                                                                                         
----- OUTPUT of 'pvs' -----                                                                                                                 
  PV         VG  Fmt  Attr PSize   PFree                                                                                                    
  /dev/md0   vg0 lvm2 a--  446.21g 89.25g                                                                                                   
===== NODE GROUP =====                                                                                                                      
(11) ml-serve[2001-2008].codfw.wmnet,ml-serve[1002-1004].eqiad.wmnet                                                                        
----- OUTPUT of 'pvs' -----                                                                                                                 
  PV         VG  Fmt  Attr PSize   PFree                                                                                                    
  /dev/md0   vg0 lvm2 a--  446.72g 89.35g

Related Objects

Mentioned In: T365971: Tweak partman recipe for ML k8s workers
T343900: Alert triage: overdue alert [warning] Kubelet exec_sync operations on ml-serve1001.eqiad.wmnet take 1.133s in p99
T334583: [Spike] Run models and frameworks on AMD GPU and identify challenges
Mentioned Here: T343900: Alert triage: overdue alert [warning] Kubelet exec_sync operations on ml-serve1001.eqiad.wmnet take 1.133s in p99

Event Timeline

elukey created this task.Jun 15 2023, 12:55 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 15 2023, 12:55 PM

elukey mentioned this in T334583: [Spike] Run models and frameworks on AMD GPU and identify challenges.Jun 15 2023, 1:38 PM

elukey moved this task from Unsorted to New Projects to review on the Machine-Learning-Team board.Jun 20 2023, 2:21 PM

Mentioned in SAL (#wikimedia-operations) [2023-06-27T13:32:03Z] <elukey> expand ml-staging200[12] kubelet partitions - T339231

Icinga downtime and Alertmanager silence (ID=3c9cc021-58b9-4756-9cf7-4880a033e42a) set by elukey@cumin1001 for 0:30:00 on 1 host(s) and their services with reason: Expand the kubelet disk partition

ml-serve2001.codfw.wmnet

Just did ml-serve2001, but of course the resize had to be online since most of our pods store data on the kubelet partition (like the model binary).

We should drain the nodes first as precaution.

Mentioned in SAL (#wikimedia-operations) [2023-08-11T08:32:09Z] <elukey> expand kubelet partition on ml-serve2001 - T339231

klausman mentioned this in T343900: Alert triage: overdue alert [warning] Kubelet exec_sync operations on ml-serve1001.eqiad.wmnet take 1.133s in p99.Aug 15 2023, 10:43 AM

(copied from T343900, this ticket is more appropriate for this info)

I have done ml2002 and ml2003 today (two machines to force some pods back onto 2002, to see it works properly). So far, everything seems fine.

Steps I took:

On deployXXXX, useful things to run in watch (after kube_env ml-serve-XXXX):

See node status: kubectl get nodes
When waiting for drain to complete: kubectl get pods -A --field-selector spec.nodeName=ml-serveXXXX.codfw.wmnet

Steps for drain/resize/undrain:

On ml-serveXXXX, verify you're dealing with a machine that still has a small partition (<30G): df -h /var/lib/kubelet/

On deployXXXX:

Drain node: kubectl drain --ignore-daemonsets --delete-emptydir-data ml-serveXXXX.codfw.wmnet
Wait until only daemonsets are running (istio etc), see get pods command above.

On ml-serveXXXX:

Optionally watch number of docker processes (will not reach 0 due to daemonsets): watch sudo docker ps\|wc -l
Stop kubelet, resize partition and filesystem (does not need to be unmounted):

sudo systemctl stop kubelet
sudo lvextend -L+80g /dev/mapper/vg0-kubelet
sudo resize2fs /dev/mapper/vg0-kubelet

Verify increased size:

$ df -h /var/lib/kubelet
Filesystem               Size  Used Avail Use% Mounted on
/dev/mapper/vg0-kubelet  107G  480K  102G   1% /var/lib/kubelet`

Start kubelet: sudo systemctl start kubelet.service
Check kubelet logs: sudo journalctl -xefu kubelet.service Must not have ExecSync errors! Keep watching this file while doing the second machine, to see if there are errors as pods are moved.

On deployXXXX: Undrain/uncordon: kubectl uncordon ml-serveXXXX.codfw.wmnet

If you do two machines one after the other, the first machine will receive new pods during the drain of the second, you can use this to make sure it's working correctly.

Machines ml-serve2001-2006 are now done. Zero errors or irregularities. Will do 7 and 8 later this week.

2007 and 2008 are now also done, again without problems.

I will leave eqiad to Luca next week.

The reboots for codfw from T344587 I will do tomorrow. I had meant to coordinate with this, but I think it's better to do them separately, to avoid problems having many probably causes.

elukey moved this task from New Projects to review to Watching on the Machine-Learning-Team board.Sep 5 2023, 1:17 PM

elukey moved this task from Watching to Backlog/SRE on the Machine-Learning-Team board.Sep 5 2023, 1:26 PM

Mentioned in SAL (#wikimedia-operations) [2023-09-20T08:40:17Z] <klausman> Draining ml-serve1002 for kubelet partition increase (T339231)

Mentioned in SAL (#wikimedia-operations) [2023-09-20T08:47:50Z] <klausman> Draining ml-serve1003 for kubelet partition increase (T339231)

Mentioned in SAL (#wikimedia-operations) [2023-09-20T08:57:03Z] <klausman> Draining ml-serve1004 for kubelet partition increase (T339231)

Mentioned in SAL (#wikimedia-operations) [2023-09-20T09:06:00Z] <klausman> Draining ml-serve1005 for kubelet partition increase (T339231)

Mentioned in SAL (#wikimedia-operations) [2023-09-20T09:15:38Z] <klausman> Draining ml-serve1006 for kubelet partition increase (T339231)

Mentioned in SAL (#wikimedia-operations) [2023-09-20T09:24:00Z] <klausman> Draining ml-serve1007 for kubelet partition increase (T339231)

Mentioned in SAL (#wikimedia-operations) [2023-09-20T09:29:23Z] <klausman> Draining ml-serve1008 for kubelet partition increase (T339231)

I've done 1002-1008 today, and everything went smoothly. All done!

klausman closed this task as Resolved.Oct 12 2023, 10:51 AM

klausman mentioned this in T365971: Tweak partman recipe for ML k8s workers.May 27 2024, 9:12 AM

Expand the Lift Wing workers' kubelet partitionClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Expand the Lift Wing workers' kubelet partition
Closed, ResolvedPublic
Actions