Page MenuHomePhabricator

Test GPUs with an end-to-end training task (Photo vs Graphics image classifier)
Open, Needs TriagePublic

Description

We would like to test the performances of the GPU on stat1005 with an end-to-end training task for image classification.
Inspired by @Gilles' idea in T215413, I will try to build a classifier that distinguishes photos from graphics (diagrams, maps, etc)

  • Collect data from graphics and photo Commons categories -- important to time this as we have to estimate how much of a bottleneck this stage is (see T215250)
  • Train the model on stat1005 with GPU (using simple architecture for now)
  • Evaluate the model in terms of accuracy and computation time
  • Repeat the training using CPU only and compare time/accuracy performances
  • [optional] repeat everything using with a pre-trained model, and compre time/accuracy performances

Event Timeline

Miriam created this task.Apr 24 2019, 11:16 AM
fdans moved this task from Incoming to Radar on the Analytics board.Apr 25 2019, 4:40 PM
Miriam updated the task description. (Show Details)Apr 30 2019, 10:44 AM

Data collection is over, it took 271044s (~75hours) for 320815 images (~160k per class), i.e. 0.85 sec/image.
I downloaded 600-px thumbnails, and used 4 parallel sessions, as suggested in T215250.

To clarify, was that via the internet or internal cluster?

This was via the internet. But we should try to do this from the internal cluster, too, for comparison, if possible. I just need few instructions on how to do this!

As an example, https://upload.wikimedia.org/wikipedia/commons/thumb/7/7f/NY_308_in_Rhinebeck_4.jpg/800px-NY_308_in_Rhinebeck_4.jpg maps to https://ms-fe.svc.eqiad.wmnet/wikipedia/commons/thumb/7/7f/NY_308_in_Rhinebeck_4.jpg/800px-NY_308_in_Rhinebeck_4.jpg in the cluster if fetching directly from stat1005.

So if you have the list of URLs at a prebucketed size you're very much likely to be able to pull the files into /srv or HDFS.

You'd want to go for a small prebucketed size if pulling a lot of images so as to save network transfer even internally and have speedy network transfer as well (no re-render, less data, etc.) on a hit. I think for your own purposes you're already not grabbing large files, but the bucketed size is important IIRC.

Do you have a list of files bearing the sharded thumb paths like above? I was poking around and didn't see an obvious reference to a database table bearing the paths and couldn't remember where to look (I think MediaWiki generates the paths at runtime via a hook...it's a sha1, base36 thing). But if someone here knows how to query for all the things please chime in!

Earlier napkin math suggested that with, say, 40 million files at 70k per file that's about 2.8tb of data transfer, not too bad. The thought was that 100 GETs per second shouldn't be detrimental to Swift - that would be on the order of about 5 days (however I might suggest pinging @fgiunchedi and @Gilles as usual before kicking off such a process in case there will be any maintenance or intensive jobs going on otherwise). I know you all were talking on that other ticket.

You necessarily will have some misses on files that for one reason or another have never been pregenerated or stored on fetch so I imagine keeping track of the misses will be helpful for reconciling before any model training.

Miriam added a comment.EditedMay 29 2019, 2:35 PM

@elukey hi! DId anything change on stat1005 since last time we tried tensorflow? I now get this message when I try to import it in python. Thanks!

>>> import tensorflow as tf
Traceback (most recent call last):
  [....]
ImportError: libhiprand.so.1: cannot open shared object file: No such file or directory
Failed to load the native TensorFlow runtime.

hiprand looks to be rocm specific, it should come from the package here: http://repo.radeon.com/rocm/apt/debian/pool/main/r/rocrand/

That repo doesn't seem to be part of the apt sources on stat1005, do we still need to make a mirror of the rocm repos on our prod repositories perhaps? I also looked through puppet but didn't find where the packages to install are defined.

The deb itself does seem to be inplace on stat1005, at /opt/apt/apt_1.9.1.211/pool/main/r/rocrand/rocrand_1.8.1_amd64.deb. I confirmed with ar x that this includes a libhiprand.so.1.

@EBernhardson you are right, no import of the packages has been done yet due to https://github.com/RadeonOpenCompute/ROCm/issues/761. Need to sort that out, upstream still doesn't answer (will also follow up with SRE). For the moment I manually add the repos when needed (puppet cleans the config afterwards), still very hacky but I am planning to get proper puppet code soon.

@Miriam yes I have updated rocm, and probably some packages got cleaned up. Let's meet when we are both on IRC so we can figure out the list of packages needed!

@Miriam rocrand installed, but I guess it will not be the last one :D

@Miriam`import tensorflow as tf` should work now (fixed other packages in need of an upgrade, will need to make a proper upgrade procedure in the future). You may need to upgrade tensorflow-rocm via pip to allow it to support the new packages :)

@elukey thanks so much, everything works now!!

This comment was removed by elukey.

@EBernhardson opened T224723 to track the package imports, feel free to chime in :)