Page MenuHomePhabricator

[SPIKE] Assess relevance and coverage of Google dataset's Wikipedia URLs for Q&A question generation
Open, HighPublicSpike

Description

In T414569: [SPIKE] Explore search query datasets for Q&A question generation we identified a private Google dataset as a potential fit for sourcing natural language questions.
This spike has 2 goals:

  1. tell whether section-level Wikipedia URLs are good enough to serve as answers to the paired queries
  2. understand coverage of target Wikipedias - English, but ideally also Arabic, Chinese, French, Indonesian, and Vietnamese

Tasks

  • manually evaluate a sample of (query, URL) pairs
  • compare URLs with vector search results - run a paired query against our vector search prototype, intercept the paired URL, and grab the matching text snippet
  • coverage: for how many articles in a given Wikipedia we don't have queries?
  • general shape:
    • how many queries do we lose constraining to natural language ones?
    • how many queries do we lose constraining to section-level URLs?

Event Timeline

mfossati changed the subtype of this task from "Task" to "Spike".Thu, Feb 5, 11:36 AM