Page MenuHomePhabricator

Release datasets in support of Wikimedia-related AI modeling
Open, Needs TriagePublic

Description

A large blocker to getting more researchers and developers involved in training AI models to solve Wikimedia problems is building datasets that map to real tasks that would benefit Wikimedians in their work. The goal of this task is to come up with some potential tasks and associated datasets and then extract and publish these datasets to support research and AI-model development in these spaces. We will aim for coverage of at least a few different Wikimedia projects in the datasets.

The steps are as follows:

  • Identify list of 10+ potential datasets and associated tasks
  • Evaluate each for feasibility of extracting the data and select several for prioritization
  • Develop pipelines to do at least do a one-off extraction of the data for the prioritized datasets
  • Develop basic task descriptions and data sheets for each published dataset

Event Timeline

Isaac and I spent some time brainstorming about this last month. Here's a google doc with a bunch of existing ideas in it!

Thanks @Htriedman ! Copying over a quick summary of the ideas in the doc. Eventually we'll move them to a more accessible place. Some of these datasets already exist but I'm mentioning them here as I see them as relevant to this broader goal.

  • SPARQL + natural language
  • SQL + natural language
    • Dataset: https://huggingface.co/datasets/htriedman/wikidb
    • Project: all (?) but probably mostly Wikipedia
    • Modeling ideas:
      • Generate replica query from natural language
      • Embed replica queries (multilingually) to enable natural-language search
  • Diffs + edit summaries
    • Dataset: doesn't exist though have had some initial explorations
    • Project: Wikipedia
    • Modeling ideas:
      • Generate edit summary from diff
      • Flag edits where edit summary doesn't match diff
  • Dump of page images + transcribed text from Wikisource
    • Dataset: doesn't exist as far as I know
    • Project: Wikisource
    • Modeling ideas:
      • Improved OCR (especially for Indic languages)
  • Plaintext paragraphs + language codes
  • Commons images + captions / alt-text
  • Unstructured text -> QuickStatements
    • Dataset: doesn't exist
    • Project: Wikidata
    • Modeling ideas:
      • Generate QuickStatements from unstructured text
  • Anything audio-related from Commons?
  • Sentences needing citations
  • Citations + claims?
    • Dataset: facebook put this together once for the Side Verifier project that aligned a Creative Commons dataset with Wikipedia citations+passages.
  • Issue templates + passages

I also gave myself a TODO to do a review of more commonly used tools and see if there are any interesting opportunities to collaborate with those developers to build additional datasets from their usage.

Weekly updates:

  • Began talking with Lucie-Aimée Kaffee about a Wikimedia + NLP workshop for ACL. We likely won't center it around a specific dataset (a la Wiki-M3L) but talked about the possibility of focusing on some specific problems or sharing about datasets in the workshop, so that could eventually be a good venue to discuss this effort.

Weekly updates:

  • No work. I'll be out next several weeks so presumably updates will pause during that time though I welcome input and I'll attend to it when I'm back.

Returned but no updates yet.

Isaac updated the task description. (Show Details)

I changed this into the umbrella task and work will continue under T348331, which will focus on getting started with a few datasets.