Page MenuHomePhabricator

Update GPU labels in Hadoop 's Yarn config
Closed, ResolvedPublic

Description

Hi folks,

We currently have only two nodes with a GPU on Hadoop, but we still have the corresponding Yarn label to multiple nodes:

elukey@an-master1003:~$ for host in 1096 1097 1098 1099 1100 1101; do echo "an-worker${host}"; sudo -u yarn kerberos-run-command yarn yarn node -status an-worker$host.eqiad.wmnet:8041 2>&1| grep Labels; done
an-worker1096
	Node-Labels : GPU
an-worker1097
	Node-Labels : GPU
an-worker1098
	Node-Labels : GPU
an-worker1099
	Node-Labels : GPU
an-worker1100
	Node-Labels : GPU
an-worker1101
	Node-Labels : GPU

I would do this to fix:

sudo -u yarn kerberos-run-command yarn yarn rmadmin -replaceLabelsOnNode "an-worker1096.eqiad.wmnet="
sudo -u yarn kerberos-run-command yarn yarn rmadmin -replaceLabelsOnNode "an-worker1097.eqiad.wmnet="
sudo -u yarn kerberos-run-command yarn yarn rmadmin -replaceLabelsOnNode "an-worker1098.eqiad.wmnet="
sudo -u yarn kerberos-run-command yarn yarn rmadmin -replaceLabelsOnNode "an-worker1099.eqiad.wmnet="

The ML team is currently testing training models on Hadoop with GPUs :)

Event Timeline

That's excellent, please feel free to proceed @elukey. I had forgotten to remove them.

elukey claimed this task.

Commands executed, new status:

an-worker1096
	Node-Labels : 
an-worker1097
	Node-Labels : 
an-worker1098
	Node-Labels : 
an-worker1099
	Node-Labels : 
an-worker1100
	Node-Labels : GPU
an-worker1101
	Node-Labels : GPU

Mentioned in SAL (#wikimedia-analytics) [2024-03-28T15:00:47Z] <elukey> remove GPU labels in Hadoop Yarn for an-worker[1096-1099] (the hosts don't have a GPU anymore) - T361225