Currently, we withhold a test set when training each of the models. This is usually about 20% of the data.
Instead of randomly sampling and withholding a single test set, why not instead perform a cross-validation with folds on the entire dataset (let's say 5 or 10 folds) and then train the model on the entire dataset.
This way, we'd (1) get a more stable set of test statistics -- since they'd represent a mean value among many folds of train/test and (2) be able to train our models on all available data -- which would improve overall fitness in practice.