Goal
As part of T341907, I am working on releasing more datasets that can help AI practitioners to work on models that are relevant to Wikimedia's needs. Two of those datasets that are particularly exciting deal with image data, however, which is a major challenge in our public data sharing infrastructure. This task is a request for support to consider what pipelines/infrastructure we would need to have in place to share that image data.
Motivation
Wikimedia produces a ton of data and makes the vast majority of that available as regular snapshots through the dumps. This allows researchers and developers to use this data in research, training AI models, etc. when it's infeasible to hit our APIs given how many requests would have to be issued. One major gap in that is image data -- specifically images on Commons and djvu/PDFs on Wikisource.
Engineering needs
The Commons image data likely represents the largest challenge so I'll focus on that:
- Filtering: for Commons imagery, we'd likely need to do some best-effort filtering based on Legal review -- e.g., NSFW removal such that described for WIT. This would require running each image through various computer vision models which could also create some headaches given the scale.
- Augmentation: we might consider generating embeddings for each image as well which again would require running a model across all the images and has scaling challenges.
- Storage: ideally we have a public endpoint where this data can be stored and downloaded. At a minimum, we need a way of moving the files off of our internal infrastructure so they can be uploaded to other public data platforms.