Change Details

We have been considering for some time creating image dumps (T73405), but have never been able to get around to it due to various constraints. Something that seems much easier to implement is producing a dump of valid thumbnail URLs. These urls are generally (some ancient urls are different, and there can be other data prepended to the px) in the form of: ```https://upload.wikimedia.org/wikipedia/commons/thumb/c/cb/The_Green_and_Golden_Bell_Frog.jpg/750px-The_Green_and_Golden_Bell_Frog.jpg``` By producing a dump of resized thumbnail URLs external users can choose on a per-image basis which existing thumbnail is close enough to their need and download the appropriate images from our existing infrastructure. This dump can be generated relatively easily by paginating the swift container listings for the 255 commons thumbnail containers and transforming all the internal swift urls into public external urls. There is an open question of if there are thumbnails or files that should have been deleted inside swift but were not. For this reason any dump will need to be whitelisted against the set of valid pages. There are only ~60M valid pages on commons, so likely an implementation could build up a Set implementation of all the known files into a few 10's of GB of memory and check all the thumbnails (~1.3B) against it. To be investigated: [ ] What kind of purging is going on? How long will the URLs in the dumps be valid? [ ] Should the dump be taken from the dumps infrastructure, or from analytics and shipped to dumps? In particular analytics has on-demand compute resources which might simplify the work. But scheduling is more complicated. [ ] Where to get the list of known files on commons? We could extract them from CirrusSearch dumps in analytics, or directly from the search clusters on dumps infra, but perhaps there are better ways.