Page MenuHomePhabricator

Investigate dropping obvious question words ('what is' 'who is') to get better results
Open, LowPublic


More than once, we've been asked about the occasional lack of search results when a user types in a question, such as: "what is cognitive inertia?" (probable result) or "who is the oldest baseball pitcher still playing in the MLB?" (probable result).

Let's investigate to see if search results would be better or worse if we stripped out 'what is' or 'who is' and similar obvious questions.

Event Timeline

A couple approaches off the top of my head:

  • Reduce the required match level of search tokens. Right now we require 100% of tokens to exist in the target document. This requirement is a left over from the old custom search engine and has been the standard for wiki search basically forever. We could perhaps relax that to some random number (80%? needs evaluation). Depending on weighting it still might not pull the appropriate documents into the first page, although common enough words like 'what' should have high enough IDF values to make them very low weight. Of course to answer the proposed question 'what is cognitive intertia' the match level has to drop all the way to 50%.
  • Apply some sort of common terms filtering to the search queries. This might make results for pages like 'to be or not to be' or 'what if' harder to find though.
  • When talking to the ex-CTO of Blekko, a now defunct search engine that was purchased by IBM, he said that they found they could only get so far with common terms filtering, and the best solution they came up with was to build a grammer that was able to recognize questions. This has the downside of not being generalizable to all languages, the improvement would only be seen in languages that we could build a reasonable parser to recognize questions in.
    • This perhaps has a side benefit of being able to tie into structured data (wikidata). A query like 'who is john adams' would recognize a 'who is' question is asking about a human, and check wikidata for humans named john adams. This might actually need a ton of NLP work though to not then choke on a longer question, 'who is john adams and how many wives did he have'

Filtering out "what is" is tricky...try searching for "what is life" and then "what is life?" (without and with a question mark). You'll find that if you were actually looking for "life" you wouldn't find it at all, unless you just type in "life."

This begs for a multi-pass analysis that returns results that include the "what is" part, because you may actually be looking for the song "what is life," as well as results for "life" itself, by tokenizing and not doing direct title match on the whole.

Anyway, another example to consider.

Yeah, back in the 90s I built NL query parsers for seven languages as a front end to a search engine. It included stop word filtering after tossing question words. At the first demo someone wanted to search for "to be or not to be"—all of which were stop words.

Anyway, you don't need to speak the languages to build the parser if you have a native speaker who can provide or identify the most common questions. We could, for example, have NDA-covered people find the most common questions in queries for English, French, Russian, Hebrew, Spanish, and maybe a couple others, and then show the question types to speakers of other languages who could translate them.

The tricky part is making sure you have a parser mechanism that can handle the languages you want to handle. Starting with English, we used an FSA that really only parsed best from the beginning of the sentence, so it didn't work well for German, which can drop an important verb at the end if the sentence. Regexes can handle that, but you have to be careful not to be too aggressive about matching or you can eat too much of the query. Etc., etc., etc.

This begs for a multi-pass analysis...

a.k.a. "So Many Search Options" (T156019). I agree—we need to build a proper framework for organizing that. Remember the Puppenspielerin! We definitely have a tension between supporting sophisticated searchers who want power tools and naive searchers who probably really want to talk to a librarian.

Can we do a 'reasonable approximation' of looking through queries to see how often questions actually show up in our logs? It doesn't have to be perfect or totally accurate, but a good swag at the amount of questions that our users in queries.

It wouldn't be that hard to look through 1-2K queries for questions, or to grep through a larger batch of, say, 10K-100K queries and look for question marks (already known to be rare—~0.1%) and obvious likely question patterns (contain who, what, when, where, why, how; start with is, are, was, were or do, does, etc.) and get some rough stats. (And I'd be happy to do the data slogging.)

(And I'd be happy to do the data slogging.)

Yay, thanks @TJones !

TJones lowered the priority of this task from Medium to Low.Aug 27 2020, 7:54 PM