Page MenuHomePhabricator

Estimate size of Commons image corpus at given resolution
Closed, ResolvedPublic

Description

In order to figure out how Commons images can be best made available for the purpose of image processing research, we need to figure out how large the dataset is for a given image size. This will inform whether the whole dataset could for example be copied over to the machines equipped with GPUs, or if they would only be able to store a portion of the dataset at any given time.

Event Timeline

Gilles created this task.Feb 5 2019, 10:46 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 5 2019, 10:46 AM

@Miriam for the past couple of years these thumbnail sizes have been prerendered at upload time:

320, 640, 800, 1024, 1280, 1920

I'm going to do my estimate based on 320px, assuming it would be a sufficient size for your purpose. Let me know if it isn't.

I'm going to consider only JPG images, which is where the vast majority of photographs would be. If you're interested in lossless image types that tend to contain diagrams/maps (SVG, TIFF, PNG), let me know. Those would be a lot heavier individually (roughly 3-5 times larger), even if their corpus is much smaller than JPGs. Here's the page with the breakdown: https://commons.wikimedia.org/wiki/Special:MediaStatistics

Looking at 10 random JPGs on Commons, the average size of their 320px thumbnail is 32895 bytes. We have 45021664 JPGs on Commons, that means a total corpus of around 1.48TB. Sounds like something that can fit on the server(s) used for image processing.

Gilles triaged this task as Medium priority.Feb 5 2019, 11:06 AM
Miriam added a subscriber: dcausse.Feb 5 2019, 11:33 AM

Thanks @Gilles!
320px sounds like a good solution. In general, neural networks resize images to 256px before processing them.

One issue we might incur if we exclude lossless image types, especially PNGs, is that they tend to be heavily used for scientific topics or logos, and therefore removing them from the training might create a bias in the corpus and the model.
E.g. @dcausse in T202339 found that the image quality model was behaving inconsistently on graphics images.

Gilles added a comment.Feb 5 2019, 3:07 PM

Resizing sizes so close to each other (320/256) could have a significant impact on quality in the form of blurriness and artefacts. 1024 is probably a better choice, as it's exactly 4 times larger than 256px, making it downscale perfectly.

Running the numbers again at 1024, the estimated corpus size would be 9.7TB for lossy thumbnails (JPG + TIFFs + PDFs + DJVUs).

For lossless images (SVG + PNG), the 1024px corpus would be approximately 1.9TB.

Given that what you really need 256px, I would suggest downloading the 1024px ones + resizing them on the fly to 256px on the GPU machine, which should be a fast operation.

Seeing that downloading files between 2 production machines happens at 112MB/s, transferring a total of roughly 12TB would take a bit more than a day.

This goes back to the other thing we said, which is that copying the whole corpus over is only worthwhile if the neural network training can process images faster than 112MB/s, which would be about 539 lossy images per second (or 225 lossy images per second).

Let's revisit this once you've been able to benchmark the processing speed of a typical training task on a GPU machine.

Gilles changed the task status from Open to Stalled.Feb 7 2019, 12:07 PM
bd808 added a subscriber: bd808.Feb 20 2019, 4:28 PM

Hi @Gilles @fgiunchedi, a group of researchers from HTW Berlin would be interested in doing Commons image visualization. They would like to download ~1M images on their machines, and they are asking what is a reasonable number of parallel download requests they could send?
Thanks! Grazie! Merci :)

It depends on what thumbnail sizes they're downloading. If it's highly cached sizes, they can very high concurrency, as they will be getting them from Varnish most of the time. If there's high a chance that it will trigger the generation of new thumbnails (such as a size that's not widely used on wiki), then no more than 4 concurrent downloads per IP address.

Thank you @Gilles, I think they aimed for thumbnails sized 512 or 600. Do you think those are sizes reasonably widely used on wikis? I can advise them to keep the number of concurrent downloads below 5 just in case.

512 is the 41st most common size and 600 the 20th most common (as of 2017-04, last time we ran this analysis). 1024 would be a much better choice (6th most common), followed by a local resize, imho.

@Gilles, sounds good, thanks!

Thank you @Gilles, I think they aimed for thumbnails sized 512 or 600. Do you think those are sizes reasonably widely used on wikis? I can advise them to keep the number of concurrent downloads below 5 just in case.

Nothing substantial to add to what @Gilles already said except that looks good to me!

Hi @Gilles,

512 is the 41st most common size and 600 the 20th most common (as of 2017-04, last time we ran this analysis). 1024 would be a much better choice (6th most common), followed by a local resize, imho.

Could you tell me where I can I find these stats? Thanks!

@Miriam I've shared Filippo's google doc that contains the data with you.

Gilles closed this task as Resolved.Sep 25 2019, 1:50 PM

Original estimates have been done a while ago, if you need new data or something, feel free to open a new task.