As part of Commons Upload Wizard's planned improvements, I'd like to request access to Commons deleted images in the backup cluster.
See T340546#9005987 for the initial conversation with @jcrespo.
Technical details
- Download an initial dataset of all files deleted in a 1-year interval on Commons. This should roughly amount to 600k files
- no hard requirements for download concurrency. It would be great if you could let me know a reasonable value
- store the dataset either in the Analytics Hadoop cluster or in an Analytics client machine, depending on the final size
- use the dataset to develop experimental machine learning models
- keep the dataset until development is over
Access schedule
- The initial dataset collection requires a one-off access
- if we then decide to productionize the models, a quarterly access will be required (roughly)
Update: file names
@Ladsgroup , @MatthewVernon - please find below 5 attachments, each containing a list of deleted file names we'd like to access:
Can you please let me know when the files are available in stat1008? Thanks a lot!
Notes
- 500 px thumbnails are enough, we don't need full-resolution files
- would you be so kind to group image files by list? One directory per list is perfect, i.e., album covers, books, logos, screenshots, and out of domain
- file names come from deletion requests that ended in a deletion, like this one. Therefore, files that got restored/undeleted afterwards might be present in the lists
- the last attachment is gzipped to bypass phab's limit