Estimate size of Commons image corpus at given resolution
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Gilles
	Feb 5 2019, 10:46 AM

Description

In order to figure out how Commons images can be best made available for the purpose of image processing research, we need to figure out how large the dataset is for a given image size. This will inform whether the whole dataset could for example be copied over to the machines equipped with GPUs, or if they would only be able to store a portion of the dataset at any given time.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		Miriam	T215413 Image Classification Research and Development
		Resolved		• Gilles	T215250 Estimate size of Commons image corpus at given resolution

Event Timeline

• Gilles created this task.Feb 5 2019, 10:46 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 5 2019, 10:46 AM

@Miriam for the past couple of years these thumbnail sizes have been prerendered at upload time:

320, 640, 800, 1024, 1280, 1920

I'm going to do my estimate based on 320px, assuming it would be a sufficient size for your purpose. Let me know if it isn't.

I'm going to consider only JPG images, which is where the vast majority of photographs would be. If you're interested in lossless image types that tend to contain diagrams/maps (SVG, TIFF, PNG), let me know. Those would be a lot heavier individually (roughly 3-5 times larger), even if their corpus is much smaller than JPGs. Here's the page with the breakdown: https://commons.wikimedia.org/wiki/Special:MediaStatistics

Looking at 10 random JPGs on Commons, the average size of their 320px thumbnail is 32895 bytes. We have 45021664 JPGs on Commons, that means a total corpus of around 1.48TB. Sounds like something that can fit on the server(s) used for image processing.

• Gilles triaged this task as Medium priority.Feb 5 2019, 11:06 AM

Thanks @Gilles!
320px sounds like a good solution. In general, neural networks resize images to 256px before processing them.

One issue we might incur if we exclude lossless image types, especially PNGs, is that they tend to be heavily used for scientific topics or logos, and therefore removing them from the training might create a bias in the corpus and the model.
E.g. @dcausse in T202339 found that the image quality model was behaving inconsistently on graphics images.

Resizing sizes so close to each other (320/256) could have a significant impact on quality in the form of blurriness and artefacts. 1024 is probably a better choice, as it's exactly 4 times larger than 256px, making it downscale perfectly.

Running the numbers again at 1024, the estimated corpus size would be 9.7TB for lossy thumbnails (JPG + TIFFs + PDFs + DJVUs).

For lossless images (SVG + PNG), the 1024px corpus would be approximately 1.9TB.

Given that what you really need 256px, I would suggest downloading the 1024px ones + resizing them on the fly to 256px on the GPU machine, which should be a fast operation.

Seeing that downloading files between 2 production machines happens at 112MB/s, transferring a total of roughly 12TB would take a bit more than a day.

This goes back to the other thing we said, which is that copying the whole corpus over is only worthwhile if the neural network training can process images faster than 112MB/s, which would be about 539 lossy images per second (or 225 lossy images per second).

Miriam mentioned this in T215413: Image Classification Research and Development.Feb 6 2019, 1:44 PM

Zppix added a project: Commons.Feb 6 2019, 2:12 PM

Let's revisit this once you've been able to benchmark the processing speed of a typical training task on a GPU machine.

• Gilles changed the task status from Open to Stalled.Feb 7 2019, 12:07 PM

• PDrouin-WMF subscribed.Feb 7 2019, 6:32 PM

Miriam added a parent task: T215413: Image Classification Research and Development.Feb 12 2019, 4:42 PM

bd808 subscribed.Feb 20 2019, 4:28 PM

Hi @Gilles @fgiunchedi, a group of researchers from HTW Berlin would be interested in doing Commons image visualization. They would like to download ~1M images on their machines, and they are asking what is a reasonable number of parallel download requests they could send?
Thanks! Grazie! Merci :)

It depends on what thumbnail sizes they're downloading. If it's highly cached sizes, they can very high concurrency, as they will be getting them from Varnish most of the time. If there's high a chance that it will trigger the generation of new thumbnails (such as a size that's not widely used on wiki), then no more than 4 concurrent downloads per IP address.

Thank you @Gilles, I think they aimed for thumbnails sized 512 or 600. Do you think those are sizes reasonably widely used on wikis? I can advise them to keep the number of concurrent downloads below 5 just in case.

512 is the 41st most common size and 600 the 20th most common (as of 2017-04, last time we ran this analysis). 1024 would be a much better choice (6th most common), followed by a local resize, imho.

@Gilles, sounds good, thanks!

In T215250#5057325, @Miriam wrote:

Thank you @Gilles, I think they aimed for thumbnails sized 512 or 600. Do you think those are sizes reasonably widely used on wikis? I can advise them to keep the number of concurrent downloads below 5 just in case.

Nothing substantial to add to what @Gilles already said except that looks good to me!

Miriam mentioned this in T221761: Test GPUs with an end-to-end training task (Photo vs Graphics image classifier).Apr 24 2019, 11:16 AM

Hi @Gilles,

In T215250#5058738, @Gilles wrote:

512 is the 41st most common size and 600 the 20th most common (as of 2017-04, last time we ran this analysis). 1024 would be a much better choice (6th most common), followed by a local resize, imho.

Could you tell me where I can I find these stats? Thanks!

@Miriam I've shared Filippo's google doc that contains the data with you.

@Gilles merci!

Original estimates have been done a while ago, if you need new data or something, feel free to open a new task.

Estimate size of Commons image corpus at given resolutionClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Estimate size of Commons image corpus at given resolution
Closed, ResolvedPublic
Actions

Related Objects
Search...