Page MenuHomePhabricator

Medium-sized image dump
Open, LowPublicFeature

Description

Author: bugzilla.wikimedia.stuff

Description:
Current image dumps include the full-resolution files; as such, it is very difficult to download a large (> 10⁶) number of images. E.g: the 2013-11-29 image grab (latest available from https://archive.org/details/wikimediacommons) is 68 Gb but contains only ~22k media files. This amounts to ~3Mb / file.

Images resized to 1024x800 typically use 300 kb/image.
Images resized to 800x600 typically use 200 kb/image.
Images resized to 500x375 typically use 90 kb/image.

It would be great if torrents were available for monthly or yearly resized media uploads.

So why would anyone need that? I am currently working on an open source, scalable implementation of an image search engine. This engine can return the images in a database that are most similar to another query image. This is useful for casual browsing of an image database, but also for copyright infringement or duplicate image detection. Now that the engine is ready (based on a published, state-of-the art method http://www.robots.ox.ac.uk/~vgg/publications/2013/arandjelovic13/?update=1 ) I am looking for a large (10⁶-10⁸) dataset to provide a convincing demo.

For that demo, 10⁷ 500x375 images would "just" require ~1 Tb, which is a tractable torrent download.

Of course, this dataset would be relevant not just to me, but also to the computer vision community at large. Since category information is associated to each image, this dataset would constitute a great resource for image recognition and classification.

It should be noted that we are talking about a subset of all media files: e.g, not pdf, animated gif of video files.


Version: unspecified
Severity: normal

Details

Reference
bz71405

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:46 AM
bzimport set Reference to bz71405.

I have met Régis in my coworking place. I believe a dump of 800px images thumbnails can probably be useful to a wide researching audience :)

Nemo_bis set Security to None.
Nemo_bis subscribed.

For a complete dump, it's much easier to request files on the fly. There is really no reason not to. https://sourceforge.net/p/kiwix/other/ci/master/tree/mwoffliner/mwoffliner.js already does so for all the images needed by each wiki.

It's been a long time since this was filed. We don't currently even have full size media dumps --of media used on the projects-- available, or even a mirror of them. I'm working on a stealth project to bring those back first; expect it to take some time. Whether we would then provide (via a mirror, most likely) media scaled to the size indicated where the media is used on the project, is yet to be determined. See T53001 for that task.

What about instead of dumps, which have the obvious difficulty of size and scope, could we dump lists of public urls? The use case here is allowing researchers to collect a set of all image thumbnails at approximately a 300 pixel width. Currently they need to go through the mediawiki api's and request thumbnails to be generated, not knowing if close enough thumbnails already exist or not.

I've already mocked up something that generates a dump of the uris for all rendered thumbnails on commons. This is ~6GB compressed and contains a list of ~1.3B thumbnail uri's. For this use case of finding thumbnails close to 300 pixels, it looks like 55 out of 59 million images already have a rendered thumbnail within 10% of the desired dimensions. For comparison wikiview.net has been bringing in data from commons for months but still has less than 50% of these thumbnails indexed.

Mostly I'm not 100% sure if there are any risks of providing a complete list of known uris to swift storage. I could imagine ways to generate a whitelist of known files to filter this if necessary, but if not necessary it would be a ton of extra work for no benefit.

Aklapper changed the subtype of this task from "Task" to "Feature Request".
Aklapper added subscribers: ArielGlenn, Aklapper.

I am resetting the assignee of this task because there has not been progress lately (please correct me if I am wrong!). Resetting the assignee avoids the impression that somebody is already working on this task.

I think there are two actionable things to do here:

  • All media files should be made available through IPFS. Then it would be easy to host a copy of files, or contribute to hosting part of a copy of files. You could pin files you are interested. And it would work like torrent, just that it is dynamic (new files can be added as they are uploaded, removed files can be unpinned by Wikimedia and can be hosted by others, or get lost by the IPFS). It could probably be made it so that Wikimedia does not have to host files twice, so that IPFS would use same files otherwise used for serving the web/API. This is something people behind IPFS are thinking about as well, so it could align: https://filecoin.io/store/#foundation I think this could help the fact that it is hard to make a static dump of all media files at the current size. So making this more distributed and fluid could help.
  • For resized images, I think that instead of making a special dump for them, or even to use CPU to generate all those resized images, a simpler approach would be to put on IPFS also all thumbnails generated by the thumbnail service. They are generated and cached, and they could be made available for others to download them, if they want.

So this is similar to the "list all URLs" idea above, but the list could be on IPFS, both for original files and thumbnails. Wikimedia servers would not have to do more work, just expose what they already have. Then, if somebody wants to make a snapshot in time, they can pin a particular set of files on IPFS and host them.