In order to figure out how Commons images can be best made available for the purpose of image processing research, we need to figure out how large the dataset is for a given image size. This will inform whether the whole dataset could for example be copied over to the machines equipped with GPUs, or if they would only be able to store a portion of the dataset at any given time.
|Open||Miriam||T215413 Image Classification Working Group|
|Resolved||Gilles||T215250 Estimate size of Commons image corpus at given resolution|
I'm going to consider only JPG images, which is where the vast majority of photographs would be. If you're interested in lossless image types that tend to contain diagrams/maps (SVG, TIFF, PNG), let me know. Those would be a lot heavier individually (roughly 3-5 times larger), even if their corpus is much smaller than JPGs. Here's the page with the breakdown: https://commons.wikimedia.org/wiki/Special:MediaStatistics
Looking at 10 random JPGs on Commons, the average size of their 320px thumbnail is 32895 bytes. We have 45021664 JPGs on Commons, that means a total corpus of around 1.48TB. Sounds like something that can fit on the server(s) used for image processing.
320px sounds like a good solution. In general, neural networks resize images to 256px before processing them.
One issue we might incur if we exclude lossless image types, especially PNGs, is that they tend to be heavily used for scientific topics or logos, and therefore removing them from the training might create a bias in the corpus and the model.
E.g. @dcausse in T202339 found that the image quality model was behaving inconsistently on graphics images.
Resizing sizes so close to each other (320/256) could have a significant impact on quality in the form of blurriness and artefacts. 1024 is probably a better choice, as it's exactly 4 times larger than 256px, making it downscale perfectly.
Running the numbers again at 1024, the estimated corpus size would be 9.7TB for lossy thumbnails (JPG + TIFFs + PDFs + DJVUs).
For lossless images (SVG + PNG), the 1024px corpus would be approximately 1.9TB.
Given that what you really need 256px, I would suggest downloading the 1024px ones + resizing them on the fly to 256px on the GPU machine, which should be a fast operation.
Seeing that downloading files between 2 production machines happens at 112MB/s, transferring a total of roughly 12TB would take a bit more than a day.
This goes back to the other thing we said, which is that copying the whole corpus over is only worthwhile if the neural network training can process images faster than 112MB/s, which would be about 539 lossy images per second (or 225 lossy images per second).
It depends on what thumbnail sizes they're downloading. If it's highly cached sizes, they can very high concurrency, as they will be getting them from Varnish most of the time. If there's high a chance that it will trigger the generation of new thumbnails (such as a size that's not widely used on wiki), then no more than 4 concurrent downloads per IP address.
512 is the 41st most common size and 600 the 20th most common (as of 2017-04, last time we ran this analysis). 1024 would be a much better choice (6th most common), followed by a local resize, imho.