Migrate CI job search-xgboost-maven to use a Docker container
Closed, ResolvedPublic

Description

The Jenkins job search-xgboost-maven runs on disposable virtual machines (via Nodepool) it should be migrated to use a Docker container.

TLDR:

jvm-packages/xgboost4j/src/main/java/ml/dmlc/xgboost4j/java/RabitTracker.java shells out to python tracker.py but does not handle a non zero exit code (eg: ImportError: No module named argparse).


Docker job: https://integration.wikimedia.org/ci/job/search-xgboost-maven-java8-docker/

The job fails though. Have to be investigated.

Maven is invoked with --file jvm-packages/pom.xml clean verify

TypeResultConsole
NodepoolSuccesssearch-xgboost-maven #23
DockerFailuresearch-xgboost-maven-java8-docker #13

Some differences:

NodepoolDocker
Maven3.5.03.5.2
gcc4.9.26.3.0

When the tests start, there is a major difference though. Under Docker environment variables seems to be missing:

--- nodepool
+++ docker
 [INFO] --- scalatest-maven-plugin:1.0:test (test) @ xgboost4j-spark ---
 Discovery starting.
 Discovery completed in X milliseconds.
 Run starting. Expected test count is: 43
 SparkParallelismTrackerSuite:
 Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
 18/03/19 10:MM:ss WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
 
 [Stage 0:>                                                          (0 + 0) / 2]
 
 - tracker should not affect execution result
 - tracker should throw exception if parallelism is not sufficient
 XGBoostDFSuite:
-Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.68.23.140, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=2}
+Tracker started, with env={}
+- test consistency and order preservation of dataframe-based model *** FAILED ***

Somehow DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.x.y.z, DMLC_TRACKER_PORT=9091. Maybe that is a process that is started in the background and fails on Docker.

They are supposed to be in the environment:

jvm-packages/xgboost4j/src/main/java/ml/dmlc/xgboost4j/java/RabitTracker.java
 25 public class RabitTracker implements IRabitTracker {
...
166   public boolean start(long timeout) {
...
173     if (startTrackerProcess()) {
174       logger.debug("Tracker started, with env=" + envs.toString());
175       System.out.println("Tracker started, with env=" + envs.toString());

The environment is loaded via a python script:

134   private boolean startTrackerProcess() {
135     try {
136       String trackerExecString = this.addTrackerProperties("python " + tracker_py +
137           " --log-level=DEBUG --num-workers=" + String.valueOf(numWorkers));
138 
139       trackerProcess.set(Runtime.getRuntime().exec(trackerExecString));
140       loadEnvs(trackerProcess.get().getInputStream());
141       return true;
142     } catch (IOException ioe) {
143       ioe.printStackTrace();
144       return false;
145     }
146   }

Using a local checkout of search/xgboost and the container:

$ cd projects/search/xgboost
$ docker run --pull --rm -it --entrypoint=/bin/bash -v "$(pwd):/src" docker-registry.wikimedia.org/releng/java8-xgboost:0.1.0
nobody:/src$ python ./dmlc-core/tracker/dmlc_tracker/tracker.py --log-level=DEBUG --=num-workers=1
Traceback (most recent call last):
  File "./dmlc-core/tracker/dmlc_tracker/tracker.py", line 19, in <module>
    import argparse
ImportError: No module named argparse
$ echo $?
1
$

That is because the container has python-minimal installed. From /usr/share/doc/python2.7-minimal/README.Debian, it is stripped from a lot of modules

hashar created this task.Mar 19 2018, 10:40 AM
hashar triaged this task as High priority.
hashar updated the task description. (Show Details)Mar 19 2018, 11:15 AM
hashar updated the task description. (Show Details)Mar 19 2018, 11:45 AM
hashar updated the task description. (Show Details)
hashar added a subscriber: EBernhardson.

Change 420311 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Use a full python for xgboost

https://gerrit.wikimedia.org/r/420311

Change 420311 merged by jenkins-bot:
[integration/config@master] Use a full python for xgboost

https://gerrit.wikimedia.org/r/420311

Mentioned in SAL (#wikimedia-releng) [2018-03-19T11:56:01Z] <hashar> Creating docker container docker-registry.wikimedia.org/releng/java8-xgboost:0.1.1 | https://gerrit.wikimedia.org/r/#/c/420311/ | T190032

00:09:17.674 XGBoostDFSuite:
00:09:24.660 Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=172.17.0.2, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=2}

search-xgboost-maven-java8-docker/ passed.

Change 420316 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Migrate xgboost maven job to Docker

https://gerrit.wikimedia.org/r/420316

Change 420316 merged by jenkins-bot:
[integration/config@master] Migrate xgboost maven job to Docker

https://gerrit.wikimedia.org/r/420316

hashar closed this task as Resolved.Mar 19 2018, 1:05 PM
hashar claimed this task.

Tested on a dummy change https://gerrit.wikimedia.org/r/#/c/410441/ and the build worked \o/