Page MenuHomePhabricator

[SPIKE] Explore search query datasets for Q&A question generation
Closed, ResolvedPublic5 Estimated Story PointsSpike

Description

The semantic search's Q&A use case requires a set of (natural language question, Wikipedia article segment) pairs as input.
We currently have 2 approaches in mind:

  • question-to-question - machine-generated, see demo
  • real-world search queries - user-generated

In this spike, we should explore available Google datasets of user search queries tied to URLs of Wikimedia projects.
The main goal is to understand whether we can extract enough fine-grained data, at least at the section level, i.e., (query, URL with section fragment) pairs.

Event Timeline

mfossati changed the subtype of this task from "Task" to "Spike".Jan 14 2026, 11:25 AM
mfossati moved this task from Incoming/Inbox to Needs Refinement on the Reader Growth Team board.
mfossati triaged this task as Medium priority.Jan 14 2026, 11:33 AM
egardner set the point value for this task to 5.Jan 14 2026, 5:43 PM
egardner subscribed.

Let's treat this task as a time-box - see how far you can get and return with findings.

mfossati changed the task status from Open to In Progress.Jan 30 2026, 3:06 PM
mfossati claimed this task.

I've explored 2 datasets:

The first one is a manually annotated dataset for search performance evaluation. While it contains curated natural language queries, I don't think it'll be a fit for the Q&A use case, since it currently has too few queries that aren't connected to Wikipedia articles.
The second one is a much larger dataset of real-world user queries and Wikipedia article URLs. I've investigated the schema and a sample, and it seems promising. The next steps will be assessing the overall data quality and coverage over Wikipedia.