Page MenuHomePhabricator

Create a dataset for evaluation of search on Wikipedia
Open, Needs TriagePublic

Description

In order to develop models for improving search, we need a dataset of queries with annotations of relevant results. Since we are currently lacking such a dataset. the goal of this task is to collect such a dataset. For the first version, we will restrict ourselves to English (potential follow-up work could be expansion to other languages, e.g., via translation).

The details of this task still need to be determined.

Things to figure out

  • What level of annotation: passage-level/paragraph
  • What types of query types (T407603)
  • Collect set of queries for the benchmark dataset (T408121)
  • Collect candidate search results for queries (T409559)
  • Annotate queries with relevant results (T409561)

Additional information:

  • An example of how such a dataset could like is Google's natural questions dataset which contains natural language queries from aggregated and anonymized issues by users to the Google search engine. These queries are then annotated with relevant paragraphs from Wikipedia articles containing the answer.

Event Timeline

weekly update:

  • no major updates this week
  • trying to scope the task
  • coordinating potential external support (contractor)

weekly update:

  • Onboarded @Trokhymovych to the project
  • Scoped out first subtask to identify relevant query types (e.g. keyword queries vs natural language questions) T407603
  • Coordinating how to capture this work as a separate hypothesis in WE3.1

weekly update:

  • We identify 3 main dimensions for categorizing different types of queries based on existing literature that we think are relevant for search in Wikipedia (details in this doc)
    • query intent: following the traditional web query taxonomy, we focus mostly on informational queries (e.g. navigational queries are well served by autocomplete search and are not considered as part of this work). The main distinction of informational queries is whether they are closed or open-ended.
    • query form: this is the distinction between, e.g., (short) lexical queries and (longer) natural language queries.
    • query result: a common distinction is the expected result, e.g. a description or an entity or a numeric.
  • Understanding the different types of queries is important to i) make sure that the benchmark dataset captures a representative sample of queries, and ii) helps to improve different search models by identifying for which types of queries they perform well or poorly.
  • We started work to collect a set of queries for the benchmark dataset. We are considering different potential sources:
  • We scoped the granularity of annotation of search results. We aim to annotate queries with relevant passages (paragraph-level) of Wikipedia articles. This is motivated by findings in search stating that "retrieving a passage or a shorter piece of text is sufficient to properly answer almost all questions.” Source: An Intent Taxonomy for Questions Asked in Web Search (pdf) In addition, this level of granularity will allow us to quantitatively evaluate performance of different models for semantic search.
Gehel subscribed.

Tagging Discovery-Search to make sure this is visible to the Search Platform team.

@Cparle, @mfossati, as @dcausse hinted, that you have been working on an annotation tool in the past. Is that still around or even running somewhere? I would be interested if that could be adapted for annotating search results.

Yeah, it was for images though. The tool is here https://media-search-signal-test.toolforge.org/

And all the code (with the labelled results in sql/) is here https://github.com/cormacparle/media-search-signal-test

Also fyi we did some work on figuring out the relationship between the size of the labelled data set and precision@25 which suggested that you don't need a great deal of labelled data to get a fairly accurate search, see T280368

We did some follow-up work on this more recently, though we didn't get as far as we'd have liked before the team was re-assigned - @matthiasmullie was working on it

weekly update:

weekly update:

  • Collecting candidate search results:
    • We are looking into different options to use external search engines (e.g. Google) to generate candidate search results to queries complementing results from Wikipedia's seach
    • Some challenges that we are facing for this are i) rate limiting when using publicly available packages, or ii) restrictive terms of use for, e.g., Google's custom search json API
    • I captured this work in a separate subtask: T409559
  • Using annotation tool:
    • I learned about annotool which the ML-Team is using to collect human feedback on the tone-check model. I am coordinating with the ML Team to figure out how we could add a search result annotation task to the tool.
    • I captured the other suggestions in a separate subtask: T409561

weekly update:

  • Collect a set of representative queries in WP search:
    • Added filter for navigational queries when there is an exact match of the query with an existing page title
  • Collecting candidate search results:
    • For each query, we now fetch potential candidate results from WP-search
    • We implemented a re-ranking model to prioritize the top-n paragraphs as candidate search results which will be shown to annotators (n will be small ~5)
    • We are exploring different options to complement candidate search results from other search models
  • Using annotation tool:
    • We will likely use prolific's AI task builder to get annotations for the relevance of search results.
    • This allows to easily define an annotation task with a simple interface. The only requirement is a csv-file containing the data to be annotated.

weekly update:

  • We are continuing the make progress on setting up the full pipeline for the dataset generation.
  • Collect a set of representative queries in WP search:
    • This is completed from a technical side. We have a pipeline to extract a set of representative queries
    • We are waiting for the feedback from the privacy consultation about if and how we can store and publish the selected queries for annotation
  • Collecting candidate search results:
    • We are testing different options to select the most relevant paragraphs from a set of search results obtained from, e.g., Wikipedia search, to present as candidate search results to be annotated. This is important to avoid selection bias by missing potential relevant paragraphs as they will be implicitly marked as irrelevant since they will not be available for annotation.
  • Using annotation tool:
    • We are testing the study setup in prolific by using mock-up data (not from the actual query).
    • In order to conduct the actual study I am requesting a survey privacy statement. Once I have the details figured out (e.g. retention time and publication) I will submit the request, probably early next week.
    • I confirmed that we have available budget in the team to run the study on prolific. I am figuring out the details about the process of how to request/spend the budget correctly.

weekly update:

  • Collect a set of representative queries in WP search:
    • Conducted privacy check-in about publishing set of queries. As a one-off dataset for English Wikipedia this was approved.
    • We will implement an additional filter for the frequency of queries such that analysis is considered high-level (>=25 users)
  • Collecting candidate search results:
    • Decided and implemented scheme for selecting top-5 paragraphs as candidate search results
  • Using annotation tool:
    • Requested a privacy survey statement for conducting the data annotation via prolific
    • We set up a test-study with synthetic data in the prolific AI task builder to finalize UI of the annotation

weekly updates:

  • Overall, we are fully on track to get a search result dataset. Before running a smaller test pilot study we need to make minor tweaks to the query filtering.
  • Collect a set of representative queries in WP search:
    • We implemented a filter for the frequency of queries such that analysis is considered high-level (>=25 users). For this, we also needed to optimize the processing pipeline so that we can consider queries from all 3 months that are available in the logs.
    • We are iterating to improve our query filtering to remove, e.g., navigational queries. One example is to make sure we remove queries that exactly match a page title and including all potential redirects
    • We are adding an additional bucket for queries that are formulated as questions. Even when considering long queries (8+ terms), few are actually in the form of natural language questions. However, we want to capture those in our dataset as well even if they are currently rare in our logs.

weekly updates:

  • We collected a set of queries with applied manual filtering (sheet)
  • Submitted a request via L3SC for reviewing Terms of Services of 3rd party search platforms for generation of candidate results (asana ticket)
  • We are updating the processing pipeline to extract paragraphs from all articles from wikitext to Enterprise' structured content snapshots as this provides cleaner representation of the article text.

weekly update:

Next steps:

  • expanding candidate search results with results from 3rd party search
  • set up dataset and study for the full set of queries

weekly update

  • We collected candidate results for the 600 final selected queries
  • We are freezing/storing the corpus containing all paragraphs from all enwiki articles using the 20260125 snapshot

Next step:

  • Running the annotation of the candidate search results. This should be completed next week

weekly update:

  • we ran the relevance annotation for the full dataset of 600 queries.
  • will spend another week on cleaning the dataset and putting together documentation before closing