Page MenuHomePhabricator

Exact phrase match trumps title match
Closed, ResolvedPublic

Description

Searching articles should have doesn't find List of articles every Wikipedia should have among its first results.

I'd expect a title which contains all the words I searched to come before one that doesn't. If I cared about exact phrase match that much, I would have used quotes.

Event Timeline

Nemo_bis raised the priority of this task from to Needs Triage.
Nemo_bis updated the task description. (Show Details)
Nemo_bis added a subscriber: Nemo_bis.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

This is an interesting search query, none of the words here are discriminant and their tf.idf score won't be very high. So yes the phrase rescore (and certainly the incoming links/template rescore) will kick the page you want out of the first page.

In order to fix this issue I'd like to experiment with shingles, we have the suggest field which contains shingles (title and redirect), I'd like to reuse this field to add a new boolean clause at query level. I think it will help in this case because 4 terms would match : articles, should, have, "should have".

If shingle analysis is proven useful it could be a good replacement candidate for the phrase rescore we have today.

Deskana added a subscriber: Deskana.

@Nemo_bis The Search Team is investigating this issue. Can you give us some other examples of queries that you're running into where this happens? Thanks!

I don't have other examples in mind right now, I would just go through longer namespace 0 titles to find examples.

TJones set Security to None.

As @dcausse pointed out above, these terms are not strong search terms ("should" and "have" especially will have terrible TF/IDF scores), and "articles" and "should have" are fairly far apart in the target title. Adding a somewhat more contentful term, "wikipedia", to the query (i.e., articles wikipedia should have) gives the desired result in second place (and a link to it in the snippet of the first result).

This is the request of a sophisticated user, looking for a specific article. We have to serve novice users, too (and more of them, and more simply). @Nemo_bis, you said if you wanted a phrase, you could have used quotes—but we've been working with quoted phrases recently, and most human users (i.e., unsophisticated users) don't actually use quotes (and most quotes are not used by human users). Conversely, it is possible (though awkward) to search for everything in the title—intitle:articles intitle:should intitle:have—and that gives exactly the desired result.

When I first saw this ticket a while back, I went trolling through enwiki looking at random articles with long enough names to construct a query like this—spatially separated, non-distinctive words that can be somewhat meaningfully grouped into a search query. I couldn't find anything.

It's easier to find examples in Meta-Wiki because there are more non-contentful names/labels, especially using sub-pages. But even then it was hard for me to come up with another good example. (See below for some not-so-good examples.)

I believe we need more examples of this type of problem before we commit to trying to fix it. This seems to be just an odd combination of quirks of language and Meta-Wiki page naming. (I think we have to be careful about making global changes based on non-Wikipedia usage, too, since the patterns of titles and searches are very different there.)

100% perfect search results are impossible—and even human level intelligence without the context of Meta-Wiki contents might not be able to figure out the intent of the original query. (I—having approximately human-level intelligence—originally read it as a fragment of "articles should have X" not two fragments of "articles X should have".)

I went back to the query mines and went looking for examples of this kind of problem. Using the Random Page button, I looked for long titles where I could put together a plausible seeming two- or three-word phrase using some of the words near the beginning and some near the end, with at least a couple of intervening words.

I searched these extracted phrases, and noted cases where the desired result did not appear in the top 5 results, but a "semi-phrasal" result did appear in the top three results, with no more than one of the search terms in the page title. By "semi-phrasal" I mean that I didn't require an exact phrase match, but words are near each other in the provided snippet, so it's a proximity match.

These kinds of examples are hard to find in many wikis. Proportionally, there aren't that many long article titles with words in them that aren't very contentful in Wikipedias. I looked on en.wikipedia.org, es.wikipedia.org, and fr.wikipedia.org, and couldn't find anything that met all my criteria—there are lots of short titles; words that are less vague bring up good articles with those words in the title (including the one I was looking for), etc.

I found longer titles in en.wikiversity.org, en.wikisource.org, and en.wikinews.org, but no problematic searches.

I found a few examples on meta.wikimedia.org, and one each on www.mediawiki.org and en.wikibooks.org. I looked at at least several dozen pages and tried as many examples as I could on each wiki. I was hoping to find 10 examples, but stopped after finding only six.

In the list below, "phr hit" is a top-three result for the search (indicated by the number before the page title, so "(2) HydroGeoSphere/Selecting Zones" means that was the second result.)

___site:  meta.wikimedia.org
__title:  Wikipedia is not a convalescent center
_search:  wikipedia center
phr hit:  (1) Grants:IEG/Mbazzi Village writes Wikipedia
snippet:  Create a Wikipedia-center in the Mbazzi Village in Uganda,

___site:  meta.wikimedia.org
__title:  Wikimedia Blog/Drafts/Wikinews launches education program
_search:  Wikimedia Education Program
phr hit:  (1) Grants:Evaluation/Learning modules/2What is a Program?
snippet:  content partnerships * Wikimedia Education Program Wiki Loves Monuments * 

___site:  meta.wikimedia.org
__title:  International Wikinews Writing Contest
_search:  International contest
phr hit:  (2) Grants:Learning patterns/Delivering Prizes
snippet:  store might be an attractive prize option for an international contest, but the cost of

___site:  meta.wikimedia.org
__title:  Translation requests/WMF/Press releases/Wikipedia stable, but still dependent on your support
_search:  translation support
phr hit:  (2) Fundraising 2009/supplementary messages/en
snippet:  2009/Translations. Translations of wmf:Support Wikipedia: [+/-] da/dansk 

___site:  www.mediawiki.org
__title:  Extension:PhpTags Functions/Functions/Function handling
_search:  extension handling
phr hit:  (1) Extension:Disambiguator (category MIT licensed extensions)
snippet:  his allows other extensions to handle disambiguation pages as a separate class of page.

___site:  en.wikibooks.org
__title:  Change Issues in Curriculum and Instruction
_search:  issues instruction
phr hit:  (2) HydroGeoSphere/Selecting Zones
snippet:  due to previously issued instructions

None of these are great examples, but I don't think the original example was a great query either (adding "wikipedia" to it improves it immensely).

Unless there are other, better concrete examples, I suggest we close this issue.

I'm just adding a note concerning some thoughts we had (Erik, Trey and I) on the way we boost content in titles with Cirrus.

Cirrus uses an unconventional way to boost words in titles. This technique is based on the copy_to elasticsearch keyword used in mapping configuration.
The relevant commit is https://gerrit.wikimedia.org/r/#/c/146793/

It works by copying a boosted field multiple times to the all field. For instance a word in the title is copied 10 times to the all fields. This was implemented mostly for performance purpose, with the all field we can run a query over 2 fields (plain & stemmed).
Drawbacks are :

  • boosting a field by increasing its number of occurrence have limited impact on high frequencies words, to grossly over-simplify the tf part of the tf.idf formula is implemented as sqrt(wordfreq). This boost technique will simply add +10 to wordfreq if the word is in the title, if wordfreq is already high the impact is very limited.
  • we analyze the same text multiple times
  • we increase disk utilization (by copying multiple times we generate more positions).

The conventional way to boost a word from a specific field with Lucene is to apply a boost factor on the tf.idf result. This was the default behavior before we implemented the boosted all field.

You can see the results of such queries by adding &cirrusUseAllFields=no to the search results URL. In that case your article becomes the first result : https://meta.wikimedia.org/w/index.php?title=Special%3ASearch&profile=default&search=articles+should+have&fulltext=Search&cirrusUseAllFields=no

You can see the results of such queries by adding &cirrusUseAllFields=no to the search results URL. In that case your article becomes the first result : https://meta.wikimedia.org/w/index.php?title=Special%3ASearch&profile=default&search=articles+should+have&fulltext=Search&cirrusUseAllFields=no

Fascinating, thanks for this investigation. :) (The expectations I had when filing this report were vastly exceeded.)
That trick was 4 months before the successful en.wiki enabling of Cirrus, probably some performance considerations were (rightly) over-conservative and can be reviewed.