Page MenuHomePhabricator

Identify a set of relevant query types
Closed, ResolvedPublic

Description

As a first step towards building a benchmark dataset for search, we want to figure out which types of queries we should include. For example, we will want to include, both, keyword queries and natural language questions (see for example Questions vs. Queries in Informational Search Tasks). There are other many other potentially relevant groupings of queries such as the classic taxonomy of web search (navigational, informational, transactional) or whether they are closed/open-ended (Quriosity: Analyzing Human Questioning Behavior and Causal Inquiry through Curiosity-Driven Queries). It is important to identify the a set of relevant types of queries in order to make sure that the benchmark dataset will contain a representative sample.

The goal of this task is to identify a (small) set of query types that we believe are relevant for Wikipedia search. At the minimum, we will have 2 groups (keyword queries vs natural language queries). Ideally, we would like to align these types of queries with the different use-cases of Wikipedia readers from the Readers Foundational research.

Event Timeline

It's worth keeping in mind that this is a two-stage search system. Navigational queries which dominate normal search pipelines are not seen at nearly the same rate in the on-wiki fulltext search because the first stage of search, the autocomplete, sends users directly to the page and typically satisfies the navigational needs.

It's worth keeping in mind that this is a two-stage search system. Navigational queries which dominate normal search pipelines are not seen at nearly the same rate in the on-wiki fulltext search because the first stage of search, the autocomplete, sends users directly to the page and typically satisfies the navigational needs.

Good point. Our current thinking so far, is to focus on informational queries and do not consider the navigational ones at this point. The reason is that the latter are well served by autocomplete which is a very different search mode compared to the full text search.

We identify 3 main dimensions for types of queries based on existing literature:

  • Query intent (see Subtype in the proposed taxonomy): Directed (Closed) or Undirected (Open)
  • Query form: lexical (shorter, focuses on word matches), semantical (focuses on meaning rather than wording, e.g., natural language questions)
  • Type of expected result: (e.g., description, numeric, entity, location, person, based on MS Marco), specific types to be defined.

A more detailed explanation can be found here:
https://docs.google.com/document/d/1vFg335TnKCmnPxg9PBxkxuJQ4gs02g4N7WrausU2kh0/edit?usp=sharingA

I began examining web request logs to compile a dataset of Wikipedia search queries, collecting all queries over a two-day period and exploring their structure and characteristics (based on notebook). Early analysis confirms that most queries are short (around two words). I have found out that a notable portion includes advanced prefixes (e.g., “insource:”) (~3%), links (~3%), or named entities (~55%).

I continue the EDA. The main goal is to define heuristics based on this initial exploration to identify criteria for “good” queries (e.g., natural-language searches) and filter out those that should be excluded (e.g., containing PII or adult content). The next step is to create a document outlining these rules and proposed actions based on the data analysis.