Page MenuHomePhabricator

Test GPUs with an end-to-end training task (Photo vs Graphics image classifier)
Closed, ResolvedPublic


We would like to test the performances of the GPU on stat1005 with an end-to-end training task for image classification.
Inspired by @Gilles' idea in T215413, I will try to build a classifier that distinguishes photos from graphics (diagrams, maps, etc)

  • Collect data from graphics and photo Commons categories -- important to time this as we have to estimate how much of a bottleneck this stage is (see T215250)
  • Train the model on stat1005 with GPU (using simple architecture for now)
  • Evaluate the model in terms of accuracy and computation time
  • Repeat the training using CPU only and compare time/accuracy performances
  • [optional] repeat everything using with a pre-trained model, and compre time/accuracy performances

Event Timeline

Data collection is over, it took 271044s (~75hours) for 320815 images (~160k per class), i.e. 0.85 sec/image.
I downloaded 600-px thumbnails, and used 4 parallel sessions, as suggested in T215250.

To clarify, was that via the internet or internal cluster?

This was via the internet. But we should try to do this from the internal cluster, too, for comparison, if possible. I just need few instructions on how to do this!

As an example, maps to https://ms-fe.svc.eqiad.wmnet/wikipedia/commons/thumb/7/7f/NY_308_in_Rhinebeck_4.jpg/800px-NY_308_in_Rhinebeck_4.jpg in the cluster if fetching directly from stat1005.

So if you have the list of URLs at a prebucketed size you're very much likely to be able to pull the files into /srv or HDFS.

You'd want to go for a small prebucketed size if pulling a lot of images so as to save network transfer even internally and have speedy network transfer as well (no re-render, less data, etc.) on a hit. I think for your own purposes you're already not grabbing large files, but the bucketed size is important IIRC.

Do you have a list of files bearing the sharded thumb paths like above? I was poking around and didn't see an obvious reference to a database table bearing the paths and couldn't remember where to look (I think MediaWiki generates the paths at runtime via a's a sha1, base36 thing). But if someone here knows how to query for all the things please chime in!

Earlier napkin math suggested that with, say, 40 million files at 70k per file that's about 2.8tb of data transfer, not too bad. The thought was that 100 GETs per second shouldn't be detrimental to Swift - that would be on the order of about 5 days (however I might suggest pinging @fgiunchedi and @Gilles as usual before kicking off such a process in case there will be any maintenance or intensive jobs going on otherwise). I know you all were talking on that other ticket.

You necessarily will have some misses on files that for one reason or another have never been pregenerated or stored on fetch so I imagine keeping track of the misses will be helpful for reconciling before any model training.

@elukey hi! DId anything change on stat1005 since last time we tried tensorflow? I now get this message when I try to import it in python. Thanks!

>>> import tensorflow as tf
Traceback (most recent call last):
ImportError: cannot open shared object file: No such file or directory
Failed to load the native TensorFlow runtime.

hiprand looks to be rocm specific, it should come from the package here:

That repo doesn't seem to be part of the apt sources on stat1005, do we still need to make a mirror of the rocm repos on our prod repositories perhaps? I also looked through puppet but didn't find where the packages to install are defined.

The deb itself does seem to be inplace on stat1005, at /opt/apt/apt_1.9.1.211/pool/main/r/rocrand/rocrand_1.8.1_amd64.deb. I confirmed with ar x that this includes a

@EBernhardson you are right, no import of the packages has been done yet due to Need to sort that out, upstream still doesn't answer (will also follow up with SRE). For the moment I manually add the repos when needed (puppet cleans the config afterwards), still very hacky but I am planning to get proper puppet code soon.

@Miriam yes I have updated rocm, and probably some packages got cleaned up. Let's meet when we are both on IRC so we can figure out the list of packages needed!

@Miriam rocrand installed, but I guess it will not be the last one :D

@Miriam`import tensorflow as tf` should work now (fixed other packages in need of an upgrade, will need to make a proper upgrade procedure in the future). You may need to upgrade tensorflow-rocm via pip to allow it to support the new packages :)

@elukey thanks so much, everything works now!!

This comment was removed by elukey.

@EBernhardson opened T224723 to track the package imports, feel free to chime in :)

Finally I managed to do some progress. I built a simple model for testing purposes.

TL; DR: I built a tensorflow-based model to classify images as photo vs graphics, first tested it on CPU, then on GPU, before and after optimization.

  • System performance: When the system is optimized for GPU, our GPU trains the model at 6.5x speed compared to CPU only.
  • Model performance: Network is very simple, so on test/validation data everything works well, but additional evaluation is showing a little bit of overfitting <- TODO for me

Training time comparison_ CPU VS GPU VS GPU Optimized (10 epochs).png (473×740 px, 27 KB)
Accuracy Comparison_ CPU VS GPU .png (473×740 px, 24 KB)

System Details:

  • Model objective: distinguish between photographic and non photographic images
  • Data: ~200'000 images from Commons, divided in 2 classes
    • Photographic images: all images from the "Quality" and "Low Quality" categories from Commons
    • Non-Photographic images: images in "Graphics"-related categories and subcategories (mined through Magnus's PetScan)
  • Network architecture: simple network with 3 convolutional + 2 fully connected layers
  • Hyperparameters: 10 epochs, batch size 128, image size 128, 2-20% validation data
  • Library: Tensorflow-ROC 14.0.0
  • Infrastructure: compared 3 architectures on stat1005:
    • CPU
    • GPU without tensorflow optimization for GPU
    • GPU after optimizing code and input pipeline

While model performances in terms of classification accuracy look good, looking at the individual epochs I see that the model is overfitting. Indeed I validated this observations by testing on a few new images. While on Commons images the accuracy is pretty high, especially for non-photographic images, the model is at times confusing stock photos with graphics. To overcome this issue, I need to change the hyperparameters and add dropouts and other mechanisms to avoid overfitting in the network architecture.

However the main purpose of the task was to test the performances of the GPU for image classification with tensorflow. Results above show that even without optimization, our GPU trains at 2x compared to CPU. When we optimize input pipeline and data processing for GPU, we get to up to 6.5x. If needed, performances can be further improved by pre-serializing input images. While this solution would make the training more efficient, we would need a lot of storage space. So this is a trade off we need to think about.

That's all for now, I'll come back with more details once I solved the overfitting problem

CC @leila @elukey (thanks so much for the support!!!) and @Nuria

Update after changing learning rate and modifying few parameters in the training:

  • we reach 91% accuracy on the validation set (2% improvement compared to previous model) and a significant loss reduction.
  • Manual validation on an external dataset shows improvements on overfitting (80% of new photos and 100% of new graphics are correctly recognized)
  • Problems persist for near-abstract photos and for extremely low-res photos
  • All this with an training time of 18-minutes only for a dataset of 150k images!