Page MenuHomePhabricator

Cannot request more than 4 cores per spark executor
Closed, ResolvedPublic1 Estimated Story Points

Description

Trying to spin up a spark job with --executor-cores greater than 4 is able to start, but it is never assigned any executors from yarn. This looks to be limited by the configuration key yarn.scheduler.maximum-allocation-cores:

ebernhardson@stat1005:~$ hdfs getconf -confKey yarn.scheduler.maximum-allocation-vcores
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
4

I'd like to experiment with different values to figure out what the most efficient use of resources is when training ML models. It may be that fewer executors with more cores per executor is more efficient (or it might not) in terms of total cpu time used. To find out i would need to be able to test,

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Ottomata set the point value for this task to 1.

Change 368806 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Set maximum yarn vcore allocation to 32

https://gerrit.wikimedia.org/r/368806

Change 368806 merged by Ottomata:
[operations/puppet@production] Set maximum yarn vcore allocation to 32

https://gerrit.wikimedia.org/r/368806

Hm the default should be 32, not sure why you are seeing 4. Anyway, just merged ^. We'll have to wait for some a cluster restart (or at least ResourceManager?) for this to take affect. How urgent is this?

Not super urgent, everything certainly works now with 4 cores i was just doing some measurements to see if there was a sweet spot in vcores seconds with varied parallelism. Turns out i can use 10k or 30k vcore seconds to basically do the same thing with different parallelism configs.

@EBernhardson, Luca just restarted the cluster. Can you tell if the change we merged fixes this?

hdfs getconf now reports 32, and spinning up a spark repl with 8 cores per executor is able to get executors and run code. Looks to be working! I'll try it out with model training a little later but not expecting any problems.