We have been considering for some time creating image dumps (T73405), but have never been able to get around to it due to various constraints. Something that seems much easier to implement is producing a dump of valid thumbnail URLs. These urls are generally (some ancient urls are different, and there can be other data prepended to the px) in the form of:
By producing a dump of resized thumbnail URLs external users can choose on a per-image basis which existing thumbnail is close enough to their need and download the appropriate images from our existing infrastructure. Use cases that only want a few thousand images should still go to the public api's, but any use case that would like to have all ~60M images will be better served by this dump than hitting our public api's tens of millions of times (and if they follow our rate limit guidelines, that will take many months).
This dump can be generated relatively easily by paginating the swift container listings for the 255 commons thumbnail containers and transforming all the internal swift urls into public external urls. There is an open question of if there are thumbnails or files that should have been deleted inside swift but were not (there are no known offences, but that is far from a guarantee. A first draft of whitelisting should inform if it is actually necessary). For this reason any dump will need to be whitelisted against the set of valid pages. There are only ~60M valid pages on commons, so likely an implementation could build up a Set implementation of all the known files into memory and check all the thumbnails (~1.3B) against it.
To be investigated:
- What kind of purging is going on? How long will the URLs in the dumps be valid?
- Should the dump be taken from the dumps infrastructure, or from analytics and shipped to dumps? In particular analytics has on-demand compute resources which might simplify the work. But scheduling is more complicated.
- Where to get the list of known files on commons? We could extract them from CirrusSearch dumps in analytics, or directly from the search clusters on dumps infra, but perhaps there are better ways.
- Can we provide guidelines for how external users should rate limit their retrieval of thumbnail images?