We have been shuffling data through spark to get it to the executors for training, but this leads to a bunch of unexpected memory use as spark is shuffling data around and we are duplicating the data between jvm and c++. This has lead to a variety of problems with executors being killed due to going past very generous memory allocations.
Make a new utility script that reads in the spark dataframe and emits binary xgboost datasets to hdfs. Convert training to copy these files from hdfs to the local fs and load the DMatrix from the file. The memory usage of the copy should be minimal and training will only need the actual memory needed for training.