Page MenuHomePhabricator

Analysis: how many search queries are using natural language vs keywords
Closed, ResolvedPublic5 Estimated Story Points

Description

As part of the Semantic Search hypothesis WE3.1.x, we want to know how many natural language search are happening on wiki at this point. This is mostly to establish a baseline so that we can see if improvements to how we respond to natural language queries has an impact on the number of those queries. A better understanding of how users are searching might influence our design. We also want to have some measure of how successful we are at answering natural language searches.

We already have a working definition of success (used for example in the Search Metrics superset dashboard).

The definition of "natural language queries" needs to be created. The research team (@MGerlach) can provide support / review. A simple heuristic is sufficient (for example: queries that contain "how" / "where" / "when" / ...).

Ideally, we want a dashboard that can track changes over time to the number of natural language queries. A first step can be a static report.

AC

  • decision and documentation on what heuristic to use to identify natural language queries
  • report on the number of natural language queries used in on-wiki search

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Analysis of percentage of natural language queriesrepos/search-platform/notebooks!11ebernhardsonwork/ebernhardson/natural-language-query-estimatemain
Customize query in GitLab

Event Timeline

The NaturalQuestions dataset (natural questions from Google search queries annotated with relevant Wikipedia article sections) uses a heuristic to identify natural language queries (described in their paper in Sec. 3.1) which we might serve as a good starting point for us to adapt. Copying here for reference:

  • query was issued by multiple users
  • query contains 8 words or more
  • query matches one of the following conditions
    • start with ‘‘who’’, ‘‘when’’, or ‘‘where’’ directly followed by: a) a finite form of ‘‘do’’ or a modal verb; or b) a finite form of ‘‘be’’ or ‘‘have’’ with a verb in some later position;
    • start with ‘‘who’’ directly followed by a verb that is not a finite form of ‘‘be’’;
    • contain multiple entities as well as an adjective, adverb, verb, or determiner;
    • contain a categorical noun phrase immediately preceded by a preposition or relative clause;
    • end with a categorical noun phrase, and do not contain a preposition or relative clause.
  • query yields a Wikipedia page in the top 5 search results

Note: We dont need to implement that exact definition and something much simpler will probably do.

I poked around the data a bit and experimented with a few things, i suspect we can do something like:

  • Source data from one week of the discovery.query_clicks_hourly table.
    • Despite the name, this contains both clicked and unclicked queries.
    • Contains only searches that recorded events from Special:Search (no api)
    • Contains only searches using the default namespace filter (a simple string search without external filters)
    • This dataset has not had any bot filtering applied
  • Filter bots by dropping all identities with >1k queries/day
    • An identity is a weak fingerprint. Roughly, hash("{$ip}|{$user_agent}|${x_forwarded_for}")
  • Group queries after both stemming and stop word removal. Consider the variant that comes from the most identities as the representative query. Break ties with the longest query.
  • Label queries as potentially natural language queries by looking for a simple list of question words: \b(who|what|where|when|why|how)\b
  • Sample ~200 queries from each class and manually classify them as either natural language or not natural language. Use this as an adjustment factor to get closer to the "true count" of natural language queries.

After further consideration, I remembered that query_clicks_hourly still does not contain mobile web requests, but those will need to be included here. To include mobile web we will need to start the analysis from web requests. This is more tedious as the dataset is quite large, but likely necessary. Will have to see if we can analyze a full week, due to data sizes we may have to break analysis up into per-day numbers and aggregate those daily numbers.

This would look something like:

  • Filter for web requests that have x_analytics_map['special'] == 'Search'
  • Filter for web requests with a 2xx response (many responses are 3xx redirects to articles)
  • Extract query from the request query string
  • Verify the namespaces in the request are for the default set of namespaces (additional bot filter)
  • Generate an identity by hashing user agent and source ip address.
  • Continue as above, filtering bots, normalizing/grouping queries, identifying question words.
pfischer set the point value for this task to 5.Sep 22 2025, 3:33 PM

Initial estimate for the week of Sept 8 - 15.

Initial estimate is relatively simple flagging of question words (who/what/where/when/why/how). These counts represent the number of http requests to the Special:Search page.

stratanum queries% of queries
has question word1882302.4%
other768176297.6%

To try and constrain them a bit i calculated the sample sizes necessary to say with 90% confidence that the true false positive rate is within 2% of the calculated rate, and then manually reviewed the queries. This amounted to manually reviewing 267 queries that were flagged as other, and 555 queries flagged as having a question word.

First is where we consider only queries that are clearly natural language, such as What does Internet stand for? or less clear but still very natural language ish such as song written by George Young Harry Vanda recorded by singer with theatrical stage name

stratafalse positive rateupdated size estimateupdated % of queries
natural language13.5% +- 2%188230 - 25411 + 146721 = 3095403.9%
other1.9% +- 2%7681762 - 146721 + 25411 = 756045296.1%

We can also be a bit more flexible, considering queries such as Roles of kidey in homestasis or Richest Filipino who cames from poverty as natural language

stratafalse positive rateupdated size estimateupdated % of queries
natural language7.75%188230 - 14588 + 376406 = 5500487%
other4.9%7681762 - 376406 + 14588 = 731994493%

Unfortunately i wasn't quite able to figure out how to put confidence intervals on updated % of queries, but based on the sample sizes they will still be fairly significant. The 7% estimate is probably something closer to 7% +- 3%.

Not having the final confidence intervals was unsatisfying, so i went through and worked it up properly with references for how this is supposed to work within a stratified sample. Notebook has been updated to contain the calculation (please review! I am not an expert here).

Strict definition (grade 2 only):

  • Estimate: 3.9%
  • 95% CI: (0.023, 0.055)

Broad definition (grade 1+2):

  • Estimate: 7.0%
  • 95% CI: (0.044, 0.095)

@EBernhardson Thanks for putting together the notebook. Looks really good, I appreciate the level of detail with respect to manual verification and having confidence intervals.

  • from what I understand, you operationalize natural language queries as all queries which contain one of the words who|what|where|when|why|how (and later do some additional manual filtering). Could you confirm? I think that approach makes sense and is sufficient to get a rough idea of the order of mangitude.
  • Do you think it would be (easily) feasible to compare the average number of words in lexical vs natural language queries? I think this could be relevant in the context of the planned hypothesis of search around relaxing matching all keywords?
  • I think that the current code is not filtering bot/automated traffic of the webrequest data (agent_type=="user"). Do you think there are many of those requests for search such that the results could significantly change? Similarly, should we filter searches in main article namespace only? (though I assume that there are very few queries that are not in main namespace).
Gehel triaged this task as Medium priority.Sep 25 2025, 2:23 PM

@EBernhardson Thanks for putting together the notebook. Looks really good, I appreciate the level of detail with respect to manual verification and having confidence intervals.

  • from what I understand, you operationalize natural language queries as all queries which contain one of the words who|what|where|when|why|how (and later do some additional manual filtering). Could you confirm? I think that approach makes sense and is sufficient to get a rough idea of the order of mangitude.

We can think of the question words filter as an initial, rough operationalization. The primary goal of the question words classifier is as a sampling strategy to create groups for manual review. The actual classification for natural language or not is entirely manual. We use those manual grades to estimate the probability of a query being natural language.

From the manual classification we have some data on how accurate the question word classifier is.

Strict definition:
  P(natural language | has question words) = 86.5% (95% CI: 83.6%, 89.3%)
  P(natural language | no question words) = 1.9% (95% CI: 0.2%, 3.5%)

Broad definition:
  P(natural language | has question words) = 92.3% (95% CI: 90.0%, 94.5%)
  P(natural language | no question words) = 4.9% (95% CI: 2.3%, 7.5%)

We can also use that manual classification to evaluate how well the question word classifier performs as a natural language classifier. Confidence intervals come from 1000 rounds of bootstrapping. I've also updated the notebook to include this calculation at the end:

Strict Definition (grade 2 only):
Precision: 0.865 (95% CI: 0.836, 0.892)
Recall:    0.531 (95% CI: 0.367, 0.849)
F1:        0.635 (95% CI: 0.518, 0.853)

Broad Definition (grades 1+2):
Precision: 0.923 (95% CI: 0.899, 0.944)
Recall:    0.317 (95% CI: 0.231, 0.498)
F1:        0.445 (95% CI: 0.369, 0.644)

We find the precision is good, but recall is indeterminate. It might be reasonable for the strict definition, or it might be pretty bad. We can at least say recall is low on the broad definition. We would have to grade significantly more queries to tighten those up. The question word classifier is probably a reasonably proxy for some use cases, but we will want to keep its limitations in mind. There are also of course limits to this analysis, there was a single grader and the definition of natural language query was, as many tasks involving human language are, a bit arbitrary.

  • Do you think it would be (easily) feasible to compare the average number of words in lexical vs natural language queries? I think this could be relevant in the context of the planned hypothesis of search around relaxing matching all keywords?

For the small manually graded sample this can be calculated directly. In this case I've run the queries through the plain (no stopword removal) tokenizer we use for production search in english to get the token count. As the queries come from a stratified sample, and not a random sample, this is not representative of the general dataset, but it's probably close enough for our current investigation. We can see a clear shift in the distribution rightward towards longer queries with 1-5 tokens being typical for grade 0 (not natural language) and 3-10 tokens being typical for grade 2 (clearly natural language). But it's worth remembering there are perhaps 20x as many grade 0 queries as grade 2, even if the counts here appear similar due to the stratified sampling.

grade_0.png (453×571 px, 15 KB)

grade_2.png (453×571 px, 16 KB)

  • I think that the current code is not filtering bot/automated traffic of the webrequest data (agent_type=="user"). Do you think there are many of those requests for search such that the results could significantly change? Similarly, should we filter searches in main article namespace only? (though I assume that there are very few queries that are not in main namespace).

Indeed the lack of an agent_type filter is an oversight on my part. There is some bot filtering applied here still, I used the queries by ip day filter that we often use in search for filtering out high volume traffic (there are ips that issue 100k requests/day). To get an idea i pulled the top level stats for how agent_type is distributed in the dataset. Across the full dataset spiders account for ~4% of queries, and the existing high volume query filter removed ~43% of that ~4%. I'll update the notebook so future evaluations include the agent_type filter, but I think this is small enough that we would get minimal benefit from redoing the current manual classification.

As an aside, there is significantly more bot traffic in search than implied by this table. For the most part they use the api, and this is only looking at web.

spideruser
num_queries3247568462330
num_unique_queries1616395900494
num_unique_norm_queries1577895347387
num_identities588833072975
num_high_volume1410781271449

Similarly, should we filter searches in main article namespace only? (though I assume that there are very few queries that are not in main namespace).

I did initially check into this on our query_clicks data, but indeed it was a very minimal portion of the data. Special:Search queries are overwhelmingly issued to the default search namespaces.