A large blocker to getting more researchers and developers involved in training AI models to solve Wikimedia problems is building datasets that map to real tasks that would benefit Wikimedians in their work. The goal of this task is to come up with some potential tasks and associated datasets and then extract and publish these datasets to support research and AI-model development in these spaces. We will aim for coverage of at least a few different Wikimedia projects in the datasets.
The steps are as follows:
- Identify list of 10+ potential datasets and associated tasks
- Evaluate each for feasibility of extracting the data and select several for prioritization
- Develop pipelines to do at least do a one-off extraction of the data for the prioritized datasets
- Develop basic task descriptions and data sheets for each published dataset