Review the Yarn Capacity scheduler and see if we can move to it
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	elukey
	Mar 10 2021, 4:36 PM

Description

There are two main reasons:

more granular limits for users to avoid hammering the cluster
apply labels to GPU nodes

Current config for fair-scheduler:

elukey@an-master1001:~$ cat /etc/hadoop/conf/fair-scheduler.xml 
<?xml version="1.0"?>
<allocations>

  <queue name="nice">
    <!--
    The nice queue is for big long running jobs that don't need to finish
    fast. Having this queue helps smaller requests to finish faster.
     -->
    <weight>1.0</weight>
    <maxRunningApps>50</maxRunningApps>
    <schedulingMode>fair</schedulingMode>
  </queue>

  <queue name="sequential">
    <!--
      Applications submitted to this queue will be run sequentially. This
      is for heavy jobs that might be automatically scheduled concurrently
      and are not concerned with timeliness.
    -->
    <weight>1.0</weight>
    <maxRunningApps>1</maxRunningApps>
    <schedulingMode>fifo</schedulingMode>
  </queue>

  <queue name="default">
    <weight>2.0</weight>
    <maxRunningApps>50</maxRunningApps>
    <schedulingMode>fair</schedulingMode>
  </queue>

  <queue name="priority">
    <!--
    The priority queue is for non-adhoc jobs that should get some priority.
    This queue has a higher weight than default, but will never preempt.
     -->
    <weight>10.0</weight>
    <maxRunningApps>50</maxRunningApps>
    <schedulingMode>fair</schedulingMode>
  </queue>

  <queue name="production">
    <schedulingMode>fair</schedulingMode>
    <aclSubmitApps>hdfs</aclSubmitApps>

    <!--
    The production queue has a higher priority than default,
    and it will start killing (preempting) jobs in other queues
    if it can't get its minimum share within 10 minutes, and
    fair share within 30 minutes.
     -->
    <weight>10.0</weight>
    <minSharePreemptionTimeout>600</minSharePreemptionTimeout>
    <maxRunningApps>50</maxRunningApps>
    <fairSharePreemptionThreshold>1800</fairSharePreemptionThreshold>
  </queue>

  <!-- essential jobs will aggressively preempt jobs in other queues -->
  <queue name="essential">
    <!--
    Use FIFO for essential queue.  We want jobs submitted
    here to run in sequential order.
    -->
    <schedulingMode>fifo</schedulingMode>
    <aclSubmitApps>hdfs</aclSubmitApps>

    <!--
    The essential queue has a much higher priority than production,
    and it will start killing (preempting) jobs in other queues,
    first after 60 seconds if it can't get its minimum share,
    and then more after 5 minutes if it can't get its fair share.
     -->
    <weight>20.0</weight>
    <minSharePreemptionTimeout>60</minSharePreemptionTimeout>
    <fairSharePreemptionThreshold>300</fairSharePreemptionThreshold>
    <maxRunningApps>50</maxRunningApps>
  </queue>

</allocations>

Reading: https://blog.cloudera.com/yarn-capacity-scheduler/

Details

Subject	Repo	Branch	Lines +/-
Enable the Yarn Labels for Hadoop Analytics	operations/puppet	production	+3 -0
Enable the Yarn Capacity scheduler for Hadoop Analytics	operations/puppet	production	+15 -12
Add user 'yarn' among the admins in Hadoop test	operations/puppet	production	+3 -1
User the 'yarn' user in the Hadoop test yarn's UI	operations/puppet	production	+5 -4
hadoop: set dr.who as yarn admin in the Test cluster	operations/puppet	production	+4 -1
hadoop: fix the test cluster's yarn queue settings	operations/puppet	production	+1 -3
hadoop: refactor and simplify the Yarn Capacity scheduler's settings	operations/puppet	production	+36 -64
hadoop: enable yarn.admin.acl in Hadoop test	operations/puppet	production	+8 -0
hadoop: add analytics-search keytab on the test client	operations/puppet	production	+4 -0
hadoop: add QueueMetrics for Yarn ResourceManager	operations/puppet	production	+1 -0
hadoop: tune Yarn's capacity scheduler defaults	operations/puppet	production	+23 -1
analytics: fix admin/submit policies for the yarn capacity scheduler	operations/puppet	production	+15 -8
analytics: move druid load jobs to the analytics Yarn queue	operations/puppet	production	+2 -1
analytics: set test refine jobs in a different Yarn queue	operations/puppet	production	+2 -1
camus: switch to the yarn 'ingest' queue for Hadoop test	operations/puppet	production	+3 -1
hadoop: fix Yarn capacity scheduler queue mappings	operations/puppet	production	+3 -1
hadoop: set the Yarn capacity scheduler for the test cluster	operations/puppet	production	+36 -17
hadoop: add a profile to deploy the capacity scheduler's settings	operations/puppet	production	+158 -0

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		elukey	T276791 Configure the Hadoop cluster to use the GPUs available on some workers
		Resolved		elukey	T277062 Review the Yarn Capacity scheduler and see if we can move to it

Event Timeline

elukey created this task.Mar 10 2021, 4:36 PM

elukey updated the task description. (Show Details)Mar 10 2021, 4:41 PM

Video from the ApacheCon about the fs2cs tool (https://www.youtube.com/watch?v=kYBKQmBrAgg), that it is available from Yarn 3. The tool spins up a FairScheduler instance to work, so I think it may be tailored for Hadoop 3, but in the video there is a nice example about how it works at high level. Since we have a small set up, I think that this is how it would work in our use case:

Fair scheduler config (high level)

nice queue - weight 1
sequential - weight 2
default - weight 2
production - weight 10
essential - weight 20

Capacity scheduler:

queue	min %	max %
nice	2	100
sequential	4	100
default	4	100
production	30	100
essential	60	100

The above is a quick and very rough proportion between weights and min/max memory usable in queues. The 100% max values is the "elasticity" part of capacity, since queues can grow if there is capacity free. So for example, essential doesn't keep its min 60% of resources busy when nothing runs on it, but other queues can take capacity.

The important bit in capacity scheduler, IIUC, is that at any given level (we have only root -> leafs so simpler) of the tree structure of the queue, the sum of min capacity is 100.

Another important bit, that I can see in https://blog.cloudera.com/yarn-capacity-scheduler/, is the order of resource assignment within a queue. The default is FIFO, so first come first served - if a job keeps requesting resources, the other ones submitted later on will have to wait. Otherwise there is a FAIR ordering, that allows newest apps to get some resources if they need them, even if other jobs are running. We could use FIFO for the sequential queue, and FAIR for the rest.

I would also allow the 100% value only for the production and essential queues, not to the other ones, so that people cannot allocate TBs of memory even if the cluster is temporarily empty (something that doesn't last long since we have a regualar amount of jobs running).

Thanks for the nice prep work @elukey :)

We could use FIFO for the sequential queue, and FAIR for the rest

I would also allow the 100% value only for the production and essential queues

I'm not sure about that one... Elasticity is really great, and since the capacity scheduler allows for resource preemption, I think it'd be fine to make any queue being able to use any amount of resource :)

Some other consideration: With the adoption of Spark for user queries/jobs (instead of Hive), resource consumption patterns on the cluster have changed for user jobs. In the practice we ask users to follow, spark jobs have a predefined maximum amount of resource it can allocate ((executor-memory + memory-overhead) * max-executors). This makes a big difference with Hive, as a hive job would request the whole cluster/queue resource (we could have set parameters to limi this, but never did). Now what that means in term of patterns is that there is no more use for the nice queue, as spark jobs are nice by construction :)

Given the radical change of resource allocation management (minimum guaranteed capacity vs priority), I suggest we take advantage of the change to update our queue settings using hierarchical queues. I think having 2 parent queues for production and users jobs allows for easy understanding, and also flexibility through sub-queues.

queue	min %	comment
production	60%	Parent queue of production jobs
production.ingest	50%	Leaf queue for ingestion jobs - We have this to make sure that our ingestion jobs always get the amount of resource they need
production.default	50%	Default leaf queue for production jobs
users	40%	Parent queue of users jobs
users.default	80%	Default leaf queue for user jobs
users.fifo	20%	Leaf queue for user jobs that needs to be run in a FIFO way (special case, small minimal amount of resource)

Note: I on purpose don't set max %, assuming we could disable it to maximize elasticity.

Let me know what you think :)

@JAllemandou very nice and clean, I like it, going to prep a puppet code change to start testing it :)

Ottomata edited projects, added Analytics-Clusters; removed Analytics.Mar 11 2021, 6:11 PM

I forgot some settings I think would be interesting for us when reading the capacity-scheduler docs:

To set

yarn.scheduler.capacity.<queue-path>.minimum-user-limit-percent - Using this should help being better at user-fairness in user queue
yarn.scheduler.capacity.<queue-path>.user-limit-factor - We should set this value in order for users not to be able to run jobs larger than the upper limit we define.
yarn.scheduler.capacity.queue-mappings - To automatically assign jobs to queues based on user running the job
yarn.scheduler.capacity.queue-mappings-override.enable - To allow for users to manually specify the queue for their jobs
yarn.scheduler.capacity.<queue-path>.maximum-application-lifetime - To force long-running applications to be killed (for user queues only, or we might be willing to have a dedicated queue for never-ending jobs)
yarn.resourcemanager.scheduler.monitor.enableand yarn.resourcemanager.scheduler.monitor.policies, to enable resource preemption - I think we should use default values for preemption settings, and tweak if we have issues.
yarn.scheduler.capacity.<queue-path>.disable_preemption - To disable preempting resource from production queue
yarn.scheduler.capacity.<queue-path>.intra-queue-preemption.disable_preemption To enable intra-queue preemption for ingest queue in production queue

To keep in mind

yarn.scheduler.capacity.<queue-path>.state - We should keep this property in mind when we wish to drain the cluster

Still trying to figure out the best way to set this in puppet, but I have created this first draft to kick off a discussion:

yarn-site.xml (listed as properties for clarity)
yarn.resourcemanager.scheduler.monitor.enable: true
yarn.acl.enable: true

capacity-scheduler.xml (listed as properties for clarity)
  # Global config
  # Maximum number of applications that can be pending and running.
  yarn.scheduler.capacity.maximum-applications: 10000
  # Maximum percent of resources in the cluster which can be used to run 
  # application masters i.e. controls number of concurrent running applications.
  yarn.scheduler.capacity.maximum-am-resource-percent: 0.1
  # The ResourceCalculator implementation to be used to compare  Resources in the scheduler.
  # The default DefaultResourceCalculator only uses Memory while DominantResourceCalculator
  #  uses dominant-resource to compare multi-dimensional resources such as Memory, CPU etc.
  yarn.scheduler.capacity.resource-calculator: 'org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator'
  # Number of missed scheduling opportunities after which the CapacityScheduler 
  # attempts to schedule rack-local containers. 
  # Typically this should be set to number of nodes in the cluster.
  yarn.scheduler.capacity.node-locality-delay: 78
  # If a queue mapping is present, will it override the value specified by the user?
  yarn.scheduler.capacity.queue-mappings-override.enable: false
  # Useful to enable/disable any new job in the cluster (for example to let it drain before maintenance)
  # yarn.scheduler.capacity.root.state: 'STOPPED'

  # Queue definitions
  # Sum of capacity (not max) needs to be 100 at any level/branch of the tree
  # First layer
  yarn.scheduler.capacity.root.queues: 'users, production'
  yarn.scheduler.capacity.root.production.capacity: 60
  yarn.scheduler.capacity.root.production.maximum-capacity: 100
  yarn.scheduler.capacity.root.users.capacity: 40
  yarn.scheduler.capacity.root.users.maximum-capacity: 100
  # Second layer (users)
  yarn.scheduler.capacity.root.users.queues: 'default, fifo'
  yarn.scheduler.capacity.root.users.default.capacity: 80
  yarn.scheduler.capacity.root.users.default.maximum-capacity: 100
  yarn.scheduler.capacity.root.users.fifo.capacity: 20
  yarn.scheduler.capacity.root.users.fifo.maximum-capacity: 100
  # Second layer (production)
  yarn.scheduler.capacity.root.production.queues: 'default, ingest'
  yarn.scheduler.capacity.root.production.default.capacity: 50
  yarn.scheduler.capacity.root.production.default.maximum-capacity: 100
  yarn.scheduler.capacity.root.production.ingest.capacity: 50
  yarn.scheduler.capacity.root.production.ingest.maximum-capacity: 100

  # Default mappings
  yarn.scheduler.capacity.queue-mappings: 'u:analytics:production.ingest,u:analytics-search:production.default,u:analytics-product:production.default,g:analytics-privatedata-users:users.default'

  # Limits
  # https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.4/bk_yarn-resource-management/content/setting_user_limits.html
  yarn.scheduler.capacity.root.production.user-limit-factor: 2
  yarn.scheduler.capacity.root.users.default.user-limit-factor: 2  
  yarn.scheduler.capacity.root.production.default.minimum-user-limit-percent: 20
  yarn.scheduler.capacity.root.users.default.minimum-user-limit-percent: 10
  yarn.scheduler.capacity.root.users.default.maximum-application-lifetime: 2629746 # 1 month in seconds

  # ACLs
  # Permissions cannot be reduced on the lower layer of the tree once set for a specific
  # queue, they can only be incremented.
  yarn.scheduler.capacity.root.acl_submit_applications: " "
  yarn.scheduler.capacity.root.acl_administer_queue: " "
  yarn.scheduler.capacity.root.production.default.acl_submit_applications: 'analytics,analytics-search,analytics-product'
  yarn.scheduler.capacity.root.production.default.acl_administer_queue: 'analytics-admin,analytics-product-users,analytics-search-users'
  yarn.scheduler.capacity.root.production.ingest.acl_submit_applications: 'analytics'
  yarn.scheduler.capacity.root.production.ingest.acl_administer_queue: 'analytics-admins'
  yarn.scheduler.capacity.root.users.default.acl_submit_applications: 'analytics-privatedata-users'
  yarn.scheduler.capacity.root.users.default.acl_administer_queue: 'analytics-privatedata-users'
  yarn.scheduler.capacity.root.users.fifo.acl_submit_applications: 'analytics-privatedata-users'
  yarn.scheduler.capacity.root.users.fifo.acl_administer_queue: 'analytics-privatedata-users'

  # Preemption
  yarn.scheduler.capacity.root.production.ingest.disable_preemption: true

Change 672373 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] hadoop: add a profile to deploy the capacity scheduler's settings

https://gerrit.wikimedia.org/r/672373

gerritbot added a project: Patch-For-Review.Mar 15 2021, 11:45 AM

Ottomata assigned this task to elukey.Mar 15 2021, 3:42 PM

Ottomata added a project: Analytics-Kanban.

Ottomata moved this task from Backlog to Q3 2020/2021 on the Analytics-Clusters board.

Change 672654 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] hadoop: set the Yarn capacity scheduler for the test cluster

https://gerrit.wikimedia.org/r/672654

Change 672373 merged by Elukey:
[operations/puppet@production] hadoop: add a profile to deploy the capacity scheduler's settings

https://gerrit.wikimedia.org/r/672373

Change 672654 merged by Elukey:
[operations/puppet@production] hadoop: set the Yarn capacity scheduler for the test cluster

https://gerrit.wikimedia.org/r/672654

Change 673936 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] hadoop: fix Yarn capacity scheduler queue mappings

https://gerrit.wikimedia.org/r/673936

Change 673936 merged by Elukey:
[operations/puppet@production] hadoop: fix Yarn capacity scheduler queue mappings

https://gerrit.wikimedia.org/r/673936

Change 673943 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] camus: switch to the yarn 'ingest' queue for Hadoop test

https://gerrit.wikimedia.org/r/673943

Change 673943 merged by Elukey:
[operations/puppet@production] camus: switch to the yarn 'ingest' queue for Hadoop test

https://gerrit.wikimedia.org/r/673943

Change 673948 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] analytics: set test refine jobs in a different Yarn queue

https://gerrit.wikimedia.org/r/673948

Change 673948 merged by Elukey:
[operations/puppet@production] analytics: set test refine jobs in a different Yarn queue

https://gerrit.wikimedia.org/r/673948

Change 673954 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] analytics: move druid load jobs to the analytics Yarn queue

https://gerrit.wikimedia.org/r/673954

Change 673954 merged by Elukey:
[operations/puppet@production] analytics: move druid load jobs to the analytics Yarn queue

https://gerrit.wikimedia.org/r/673954

Change 674071 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] analytics: fix admin/submit policies for the yarn capacity scheduler

https://gerrit.wikimedia.org/r/674071

Change 674071 merged by Elukey:
[operations/puppet@production] analytics: fix admin/submit policies for the yarn capacity scheduler

https://gerrit.wikimedia.org/r/674071

Change 674806 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/puppet@production] hadoop: tune Yarn's capacity scheduler defaults

https://gerrit.wikimedia.org/r/674806

Change 674806 merged by Elukey:
[operations/puppet@production] hadoop: tune Yarn's capacity scheduler defaults

https://gerrit.wikimedia.org/r/674806

Change 675448 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/puppet@production] hadoop: add QueueMetrics for Yarn ResourceManager

https://gerrit.wikimedia.org/r/675448

Change 675448 merged by Elukey:
[operations/puppet@production] hadoop: add QueueMetrics for Yarn ResourceManager

https://gerrit.wikimedia.org/r/675448

Change 675458 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/puppet@production] hadoop: add analytics-search keytab on the test client

https://gerrit.wikimedia.org/r/675458

Change 675458 merged by Elukey:
[operations/puppet@production] hadoop: add analytics-search keytab on the test client

https://gerrit.wikimedia.org/r/675458

Change 675467 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/puppet@production] hadoop: enable yarn.admin.acl in Hadoop test

https://gerrit.wikimedia.org/r/675467

Change 675467 merged by Elukey:
[operations/puppet@production] hadoop: enable yarn.admin.acl in Hadoop test

https://gerrit.wikimedia.org/r/675467

Ok how about:

default
fifo
production
essential

With GPU version of just fifo ?

We could do:

fifo - 5%
default - 35%
production - 50%
essential - 10%

agreed for GPU for fifo only, with even a limitation to have a single AM launched in that queue - Let's talk wtih @EBernhardson about that.

Change 677102 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] hadoop: refactor and simplify the Yarn Capacity scheduler's settings

https://gerrit.wikimedia.org/r/677102

Change 677103 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] hadoop: fix the test cluster's yarn queue settings

https://gerrit.wikimedia.org/r/677103

Change 677102 merged by Elukey:

[operations/puppet@production] hadoop: refactor and simplify the Yarn Capacity scheduler's settings

https://gerrit.wikimedia.org/r/677102

Change 677103 merged by Elukey:

[operations/puppet@production] hadoop: fix the test cluster's yarn queue settings

https://gerrit.wikimedia.org/r/677103

Change 677112 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] hadoop: set dr.who as yarn admin in the Test cluster

https://gerrit.wikimedia.org/r/677112

Change 677112 merged by Elukey:

[operations/puppet@production] hadoop: set dr.who as yarn admin in the Test cluster

https://gerrit.wikimedia.org/r/677112

@JAllemandou these are the settings now, lemme know what needs to be tuned :)

# Global config
# Maximum number of applications that can be pending and running.
'yarn.scheduler.capacity.maximum-applications' => 10000,
# Maximum percent of resources in the cluster which can be used to run
# application masters i.e. controls number of concurrent running applications.
'yarn.scheduler.capacity.maximum-am-resource-percent' => 0.1,
# The ResourceCalculator implementation to be used to compare  Resources in the scheduler.
# The default DefaultResourceCalculator only uses Memory while DominantResourceCalculator
#  uses dominant-resource to compare multi-dimensional resources such as Memory, CPU etc.
'yarn.scheduler.capacity.resource-calculator' => 'org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator',
# Number of missed scheduling opportunities after which the CapacityScheduler
# attempts to schedule rack-local containers.
# Typically this should be set to number of nodes in the cluster.
'yarn.scheduler.capacity.node-locality-delay' => 78,
# If a queue mapping is present, will it override the value specified by the user?
'yarn.scheduler.capacity.queue-mappings-override.enable' => false,
# Useful to enable/disable any new job in the cluster (for example to let it drain before maintenance)
# 'yarn.scheduler.capacity.root.state' => 'STOPPED'
# Or a specific leaf queue:
# 'yarn.scheduler.capacity.root.users.default.state' => 'STOPPED'

# Queue definitions
# Sum of capacity (not max) needs to be 100 at any level/branch of the tree
# First layer
'yarn.scheduler.capacity.root.queues' => 'fifo,default,production,essential',
'yarn.scheduler.capacity.root.fifo.capacity' => 5,
'yarn.scheduler.capacity.root.fifo.maximum-capacity' => -1,
'yarn.scheduler.capacity.root.default.capacity' => 35,
'yarn.scheduler.capacity.root.default.maximum-capacity' => -1,
'yarn.scheduler.capacity.root.production.capacity' => 50,
'yarn.scheduler.capacity.root.production.maximum-capacity' => -1,
'yarn.scheduler.capacity.root.essential.capacity' => 10,
'yarn.scheduler.capacity.root.essential.maximum-capacity' => -1,

# Default mappings
# PLEASE NOTE: use only the leaf queue names, not full path.
# Example: root.production.analytics BAD, analytics GOOD
'yarn.scheduler.capacity.queue-mappings' => 'u:druid:production,u:analytics:production,u:analytics-search:production,u:analytics-product:production,g:analytics-privatedata-users:default',

# Limits
# https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.4/bk_yarn-resource-management/content/setting_user_limits.html
# https://hadoop.apache.org/docs/r2.10.1/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
# The user limit factor is a multiplier used to allow users of a specific queue to take up to X
# times the resource allocated (as min value) for the queue. It is needed to allow/control elasticity,
# so users can overcome Yarn default limits in case there are free resources.
'yarn.scheduler.capacity.root.fifo.user-limit-factor' => 5,
'yarn.scheduler.capacity.root.default.user-limit-factor' => 2,
'yarn.scheduler.capacity.root.production.user-limit-factor' => 2,
'yarn.scheduler.capacity.root.essential.user-limit-factor' => 10,
# The user limit percent is different from the factor, since it is about how many users can run jobs on a queue
# at any given time. For example, if we set:
# 'yarn.scheduler.capacity.root.production.analytics.minimum-user-limit-percent' => 50,
# we want to allow up to two users concurrently in the queue (druid and analytics), leaving the others waiting.
# If we use '25', we'll allow a max of 4 different users, etc..
'yarn.scheduler.capacity.root.fifo.minimum-user-limit-percent' => 100,
'yarn.scheduler.capacity.root.default.minimum-user-limit-percent' => 10,
'yarn.scheduler.capacity.root.production.minimum-user-limit-percent' => 5,
'yarn.scheduler.capacity.root.essential.minimum-user-limit-percent' => 50,

# Max lifetime for a Yarn application
'yarn.scheduler.capacity.root.default.maximum-application-lifetime' => 604800, # 1 week in seconds
'yarn.scheduler.capacity.root.fifo.maximum-application-lifetime' => 604800, # 1 week in seconds

# Ordering policy
'yarn.scheduler.capacity.root.fifo.ordering-policy' => 'fifo',
'yarn.scheduler.capacity.root.default.ordering-policy' => 'fair',
'yarn.scheduler.capacity.root.production.ordering-policy' => 'fair',
'yarn.scheduler.capacity.root.essential.ordering-policy' => 'fair',

# Labels
# https://hadoop.apache.org/docs/r2.10.0/hadoop-yarn/hadoop-yarn-site/NodeLabel.html
# Only one label can be assigned to every node, by default ending up in the DEFAULT_PARTITION.
# When a label is assigned, it creates a partition between the nodes, and the Capacity scheduler
# settings gets "duplicated" (so all the queues, etc..). In this case we want just one queue to
# use the GPU label, so we concentrate all the capacity to it.
'yarn.scheduler.capacity.root.accessible-node-labels' => 'GPU',
'yarn.scheduler.capacity.root.accessible-node-labels.GPU.capacity' => '100',
'yarn.scheduler.capacity.root.fifo.accessible-node-labels' => 'GPU',
'yarn.scheduler.capacity.root.fifo.accessible-node-labels.GPU.capacity' => '100',

# ACLs
# Permissions cannot be reduced on the lower layer of the tree once set for a specific
# queue, they can only be incremented.
'yarn.scheduler.capacity.root.acl_submit_applications' => ' ',
'yarn.scheduler.capacity.root.acl_administer_queue' => ' ',
'yarn.scheduler.capacity.root.fifo.acl_submit_applications' => ' analytics-privatedata-users',
'yarn.scheduler.capacity.root.fifo.acl_administer_queue' => ' analytics-privatedata-users',
'yarn.scheduler.capacity.root.default.acl_submit_applications' => ' analytics-privatedata-users',
'yarn.scheduler.capacity.root.default.acl_administer_queue' => ' analytics-privatedata-users',
'yarn.scheduler.capacity.root.production.acl_submit_applications' => 'analytics,druid,analytics-search,analytics-product',
'yarn.scheduler.capacity.root.production.acl_administer_queue' => ' analytics-admins,analytics-search-users,analytics-product-users',
'yarn.scheduler.capacity.root.essential.acl_submit_applications' => 'analytics,druid',
'yarn.scheduler.capacity.root.essential.acl_administer_queue' => ' analytics-admins',

# Preemption
'yarn.scheduler.capacity.root.essential.disable_preemption' => true,

elukey moved this task from Next Up to In Code Review on the Analytics-Kanban board.Apr 7 2021, 6:35 AM

Ottomata moved this task from Q3 2020/2021 to Q4 2020/2021 on the Analytics-Clusters board.Apr 19 2021, 3:51 PM

Change 681400 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] User the 'yarn' user in the Hadoop test yarn's UI

https://gerrit.wikimedia.org/r/681400

Change 681400 merged by Elukey:

[operations/puppet@production] User the 'yarn' user in the Hadoop test yarn's UI

https://gerrit.wikimedia.org/r/681400

elukey moved this task from In Code Review to In Progress on the Analytics-Kanban board.Apr 21 2021, 9:25 AM

Change 681698 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Add user 'yarn' among the admins in Hadoop test

https://gerrit.wikimedia.org/r/681698

Change 681698 merged by Elukey:

[operations/puppet@production] Add user 'yarn' among the admins in Hadoop test

https://gerrit.wikimedia.org/r/681698

Change 681700 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Enable the Yarn Capacity scheduler for Hadoop Analytics

https://gerrit.wikimedia.org/r/681700

elukey moved this task from In Progress to In Code Review on the Analytics-Kanban board.Apr 21 2021, 2:49 PM

elukey moved this task from In Code Review to Ready to Deploy on the Analytics-Kanban board.

Change 681700 merged by Elukey:

[operations/puppet@production] Enable the Yarn Capacity scheduler for Hadoop Analytics

https://gerrit.wikimedia.org/r/681700

Capacity scheduler deployed, all good up to now. In order to add labels, the following commands are needed:

sudo -u yarn kerberos-run-command yarn yarn rmadmin -addToClusterNodeLabels "GPU(exclusive=false)"
sudo -u yarn kerberos-run-command yarn yarn rmadmin -replaceLabelsOnNode "an-worker1096.eqiad.wmnet=GPU"
sudo -u yarn kerberos-run-command yarn yarn rmadmin -replaceLabelsOnNode "an-worker1097.eqiad.wmnet=GPU"
sudo -u yarn kerberos-run-command yarn yarn rmadmin -replaceLabelsOnNode "an-worker1098.eqiad.wmnet=GPU"
sudo -u yarn kerberos-run-command yarn yarn rmadmin -replaceLabelsOnNode "an-worker1099.eqiad.wmnet=GPU"
sudo -u yarn kerberos-run-command yarn yarn rmadmin -replaceLabelsOnNode "an-worker1100.eqiad.wmnet=GPU"
sudo -u yarn kerberos-run-command yarn yarn rmadmin -replaceLabelsOnNode "an-worker1101.eqiad.wmnet=GPU"

Change 682889 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Enable the Yarn Labels for Hadoop Analytics

https://gerrit.wikimedia.org/r/682889

Change 682889 merged by Elukey:

[operations/puppet@production] Enable the Yarn Labels for Hadoop Analytics

https://gerrit.wikimedia.org/r/682889

GPU label added! \o/

elukey moved this task from Ready to Deploy to Done on the Analytics-Kanban board.Apr 27 2021, 7:20 AM

Very cool!

Added https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#Yarn_Labels

elukey mentioned this in T276791: Configure the Hadoop cluster to use the GPUs available on some workers.May 10 2021, 4:31 PM

Ottomata moved this task from Q4 2020/2021 to Done on the Analytics-Clusters board.May 17 2021, 3:34 PM