Page MenuHomePhabricator

Kerberos-run-command doesn't work with spark-submit [workaround]
Closed, ResolvedPublic5 Estimated Story Points

Description

We've run a bunch of times into this issue when submitting a druid ingestion spark job with kerberos-run-command. When running the following command:

sudo -u analytics kerberos-run-command analytics spark2-submit --class org.wikimedia.analytics.refinery.job.HiveToDruid --master yarn --deploy-mode cluster --conf spark.driver.extraClassPath=/usr/lib/hive/lib/hive-jdbc.jar:/usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-common.jar:/usr/lib/hive/lib/hive-service.jar --files /etc/hive/conf/hive-site.xml,/home/fdans/properties --conf spark.dynamicAllocation.maxExecutors=64 --driver-memory 8G /srv/deployment/analytics/refinery/artifacts/org/wikimedia/analytics/refinery/refinery-job-0.0.105.jar --config_file properties --since 2020-02-01T00:00:0 --until 2020-02-02T00:00:00

Kerberos-run-command will throw the following error:

Traceback (most recent call last):
  File "/usr/local/bin/kerberos-run-command", line 82, in <module>
    main()
  File "/usr/local/bin/kerberos-run-command", line 78, in main
    subprocess.call(cmd)
  File "/usr/lib/python3.5/subprocess.py", line 247, in call
    with Popen(*popenargs, **kwargs) as p:
  File "/usr/lib/python3.5/subprocess.py", line 676, in __init__
    restore_signals, start_new_session)
  File "/usr/lib/python3.5/subprocess.py", line 1282, in _execute_child
    raise child_exception_type(errno_num, err_msg)
OSError: [Errno 8] Exec format error

This problem can be easily worked around by putting the spark-submit command inside an sh file with a shebang like this:

#!/bin/bash
spark2-submit --class org.wikimedia.analytics.refinery.job.HiveToDruid --master yarn --deploy-mode cluster --conf spark.driver.extraClassPath=/usr/lib/hive/lib/hive-jdbc.jar:/usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-common.jar:/usr/lib/hive/lib/hive-service.jar --files /etc/hive/conf/hive-site.xml,/home/fdans/properties --conf spark.dynamicAllocation.maxExecutors=64 --driver-memory 8G /srv/deployment/analytics/refinery/artifacts/org/wikimedia/analytics/refinery/refinery-job-0.0.105.jar --config_file properties --since 2020-02-01T00:00:0 --until 2020-02-02T00:00:00

And just running the script like this:

sudo -u analytics kerberos-run-command analytics /home/fdans/loading-data-1-liner.sh

So it's not an urgent fix or anything, but I wonder if with other use cases the workaround might not be as straightforward.

Event Timeline

Milimetric moved this task from Incoming to Operational Excellence on the Analytics board.

Change 595859 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/debs/spark2@debian] Add bash shabang to all bin scripts

https://gerrit.wikimedia.org/r/595859

Change 595859 merged by Elukey:
[operations/debs/spark2@debian] Add bash shabang to all bin scripts

https://gerrit.wikimedia.org/r/595859

Mentioned in SAL (#wikimedia-analytics) [2020-05-13T06:47:33Z] <elukey> upgrade spark2 on stat1004 - canary host - T250161

Just tried sudo -u analytics-privatedata kerberos-run-command analytics-privatedata spark2-submit on stat1004 and the exec format error is gone. Let's test the package just to be sure that everything is ok and roll it out on all client nodes.

Change 596142 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/debs/spark2@debian] Update version in changelog for Buster and update README

https://gerrit.wikimedia.org/r/596142

Change 596142 abandoned by Elukey:
Update version in changelog for Buster and update README

Reason:
Seems not needed!

https://gerrit.wikimedia.org/r/596142

Mentioned in SAL (#wikimedia-analytics) [2020-05-13T13:46:45Z] <elukey> upgrade spark2 on all stat100x hosts - T250161

elukey set the point value for this task to 5.May 13 2020, 2:09 PM
elukey moved this task from Next Up to Done on the Analytics-Kanban board.
Nuria set Final Story Points to 5.