[long] Train model for auto-generating SQL queries
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Isaac
	Jan 5 2024, 9:21 PM

Description

Overview

A common task for Wikimedia developers looking to build bots or analyze trends on Wikipedia is to execute SQL queries against the Mediawiki database replicas. For example, if you have a bot that is looking to fix redirects, you might use this query to generate a list of candidate broken redirects and then loop through those pages to fix each instance. Or if you're doing an analysis of article deletion trends, you might use this query to generate the statistics.

While some really great tools have been developed to make it technically easy to access the database replicas -- e.g., Quarry -- generating queries can still be quite difficult: understanding the database layout can be a tall order, SQL isn't always the most intuitive, and searching for similar queries is not straightforward. There have been huge improvements in AI coding assistants, which can go a long way towards enabling those less familiar with SQL or the Mediawiki database schema to quickly generate queries. Not everyone has access to ChatGPT etc. though and these tools don't generally have enough data on Mediawiki queries to cover most use-cases effectively.

Task

This research will focus on developing an AI-based approach to assisting developers in writing Mediawiki SQL queries based on natural language descriptions of what they'd like the query to do. For example, someone who articulates that they want to generate a all new editor IDs and creation times on Czech Wikipedia since the start of 2023 should be able to get to something like:

SELECT
  actor_user,
  log_timestamp
FROM cswiki_p.logging l
INNER JOIN cswiki_p.actor a 
  ON (l.log_actor = a.actor_id)
WHERE
  log_action = 'create'
  AND log_type='newusers'
  AND log_timestamp > 20230101000000

Hal Triedman compiled a dataset of existing Mediawiki SQL queries based on the history of user-generated queries on Quarry to assist in training / evaluating a solution (data; code).

There are a number of approaches that could be explored:

Prompt-engineer natural language -> query (simplest)
- Overview: this task would seek to automate the full process of going from a natural-language description of a potential Mediawiki query to executable SQL using an existing third-party-hosted AI chat-bot such as ChatGPT or SQL-specific agents like SQLCoder.
- Idea: develop an effective prompt that developers could reuse for generating SQL queries for their need. This likely will require some trial-and-error to discover what works well with particular chat-bots as far as providing examples, relevant table schema, giving clear instructions, and possibly even breaking down the problem into small pieces. This could be tested with the provided dataset and then made available to folks as e.g., a GPT for OpenAI or just guidance for what to submit to other chat bots.
natural-language -> related query (medium difficulty)
- Overview: this approach would seek to find the most similar existing SQL query to a natural language description of the desired query. It might not exactly solve the need but hopefully is similar enough that it can be tweaked. This would likely use relatively simple AI models to support retrieval that could be run on existing Wikimedia infrastructure.
- Idea: generate embeddings using a pre-trained model for each SQL query in the provided dataset based on its natural language description and index them in a database -- e.g., similar to this example with Wikitech Help pages. For any user-provided natural language description, use the same model to convert to an embedding and look up the most similar queries. This is the simplest but the existing natural-language descriptions of the queries might not be good enough for this to be effective. There are many models available via sentence-transformers.
- Extension: augment the above with the SQL code. This might require a more specialized model or careful cleaning of the SQL queries to turn them into the most relevant keywords.
Fine-tuned natural language -> query (hardest)
- Overview: this is the same end goal as the prompt-engineering but focuses on fine-tuning a model so it is more efffective for Mediawiki. While feasible with today's generative AI models, it likely requires a very large self-hosted model that would not function on existing Wikimedia hardware.
- Idea: fine-tune an existing open-source AI model that has a good understanding of SQL -- e.g., sqlcoder (though note that even the smallest 7B-parameter sqlcoder model still has >10GB of weights so GPUs would be necessary to fine-tune) -- with the provided dataset. The model could then be hosted with a basic interface for submitting natural-language descriptions and getting the resulting query.

The example query provided above highlights the challenge inherent in this task. There is an example query in the provided dataset with the title list of new accounts since July 31, 2014 that is:

SELECT
  log_title,
  log_user,
  log_timestamp
FROM enwiki_p.logging
WHERE
  log_action = 'create'
  AND log_type='newusers'
  AND log_timestamp > 20140731000000
GROUP BY
  1
ORDER BY
  log_timestamp;

For all new editor IDs and creation times on Czech Wikipedia since the start of 2023 to retrieve this query as relevant, it would have to be treated as highly similar to list of new accounts since July 31, 2014 despite have very few words in common.

If the existing query is retrieved effectively and an AI model is going to adapt it to the specific user need, this requires a few adjustments:

Switching the English Wikipedia database (enwiki_p) to the Czech Wikipedia database (cswiki_p)
log_user is now deprecated as a column and instead a join with the new actor table is required to get user IDs.
The date needs to be updated from July 2014 to 2023
Not all of the columns, ordering, etc. might be needed anymore

A fully-working system might involve multiple of the suggested components -- e.g., prompt-engineering as well as query retrieval to find the most relevant example to a given user need to provide the most useful context to the AI model (either via similar queries or relevant table schema for tables used in the most similar queries). And there might be value to more extensive data cleaning.

I highly recommend starting with exploring the data and prompt engineering with different types of context before considering how to more fully-automate the task. This task is considered [long] and it pre-supposes familiarity with AI models and Mediawiki SQL. Initial approaches such as prompt engineering of third-party AI can likely be done in a short period of time but the more custom approaches could easily take several months of consistent work to explore and test. The actual time needed, however, will depend greatly on your level of experience.

Rationale

AI models could be developed to make it easier to devise effective SQL queries for the Mediawiki databases, lowering the barrier to working with that data.

Process

This is an open task for anyone interested. You do not need permission to work on it. This task acts as a place to share findings and ask any questions that arise.
Though @Isaac (me) likely is not available to provide much support on the task, I am quite curious and encourage sharing of updates on the task. I'll do my best to respond to questions as well that you post here.