Page MenuHomePhabricator

[Research Engineering Request] Produce image datasets
Open, HighPublic

Description

Goal

As part of T341907, I am working on releasing more datasets that can help AI practitioners to work on models that are relevant to Wikimedia's needs. Two of those datasets that are particularly exciting deal with image data, however, which is a major challenge in our public data sharing infrastructure. This task is a request for support to consider what pipelines/infrastructure we would need to have in place to share that image data.

Motivation

Wikimedia produces a ton of data and makes the vast majority of that available as regular snapshots through the dumps. This allows researchers and developers to use this data in research, training AI models, etc. when it's infeasible to hit our APIs given how many requests would have to be issued. One major gap in that is image data -- specifically images on Commons and djvu/PDFs on Wikisource.

Engineering needs

The Commons image data likely represents the largest challenge so I'll focus on that:

  • Filtering: for Commons imagery, we'd likely need to do some best-effort filtering based on Legal review -- e.g., NSFW removal such that described for WIT. This would require running each image through various computer vision models which could also create some headaches given the scale.
  • Augmentation: we might consider generating embeddings for each image as well which again would require running a model across all the images and has scaling challenges.
  • Storage: ideally we have a public endpoint where this data can be stored and downloaded. At a minimum, we need a way of moving the files off of our internal infrastructure so they can be uploaded to other public data platforms.

Event Timeline

Miriam triaged this task as High priority.Dec 5 2023, 6:42 PM

will not prioritize this task, Adam Baso may be able to explore the possibility of "how to store/retrieve large dataset"

More context:

Currently, hosting large dataset has been a challenge in the foundation, and this task has highlighted the needs of being able to do so. Given this functionality is the prerequisite of completing this task, and the resource/effort caused by this overhead, Adam Baso will be helping us with this regard to with the goal of helping researchers to access dumps, as well as potentially helping other parts of the organization/Enterprise. ETA 12-18 month.

It doesn't have to be dumps downloadable online (btw if they were online they could also be distributed via torrents). They could be cloned / copied to hard drives that people/organizations can buy and this could also generate a little bit of revenue. Moreover, I think videos are also valuable training data and Wikimedia Commons dump are definitely not only about AI training data but also have all sorts of other potential applications including making the data more resilient against data losses. Please see the proposal here and the Wikimedia Commons page about Backups and dumps linked there. For NSFW removal deepcategory:Nude people could be used and one should definitely include the metadata of the files like their categories so people can selectively remove or select such subsets afterwards. One example of how dumps can be used for other purposes than AI training is identifying essentially duplicate and very similar images and then letting a bot e.g. link the other versions in the file description or suggest duplicate files for deletion and so on (that's just one example, there's more ways this could improve WMC itself or the open source software ecosystem). Edit: seems like the main issue rather is at T298394

Just a quick note that we are still evaluating the priority of this task, which will depend on the directions identified by the AI Strategy work (T340693). So this task currently has a dependency upon the strategy work being completed.