Page MenuHomePhabricator

Build dataset of Quarry queries
Closed, ResolvedPublic

Description

The goal is to make a pubicly available dump of Quarry queries to enable potential fine-tuning of AI models to try to generate a Wikimedia replica query based on a natural language description.

The dataset would include three fields:

  • Description/title of the query
  • Database
  • SQL syntax

The dataset would be released under CC0, which is what the SQL syntax currently is released as.

Event Timeline

Mentioned in SAL (#wikimedia-cloud-feed) [2023-05-19T09:59:01Z] <wm-bot2> added user isaacj to the project as reader (T337019) - cookbook ran by arturo@endurance

Isaac claimed this task.

Initial dataset put together by Hal that we'll work on cleaning a bit more: https://huggingface.co/datasets/htriedman/wikidb