Page MenuHomePhabricator

Search find a section name but not a page name
Open, MediumPublic

Description

Looking for “Je suis venir te dire que je m'en vais” on fr.wp finds “#Je_suis_venue_te_dire_que_je_m'en_vais” section” as second result but does not find following pages:

  • Je suis venu te dire que je m'en vais
  • Je suis venue te dire que je m'en vais…
  • Je suis venue te dire que je m'en vais - Sheila live à l'Olympia 89

which are the three top results when searching the correct “Je suis venu te dire que je m'en vais” phrase.

Note that Wdsearch gadget results already well include “Je suis venu te dire que je m'en vais” pages, but it’s probably T219108.

Event Timeline

Pols12 created this task.Dec 20 2019, 6:34 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 20 2019, 6:34 PM
dcausse triaged this task as Medium priority.Jan 2 2020, 2:21 PM
dcausse moved this task from needs triage to later on... on the Discovery-Search board.
dcausse added a subscriber: dcausse.

One problem is that the french stemmer does not conflate venir with its conjugated form venu or venue.
The page Je suis venu te dire que je m'en vais does not have venir meaning that it cannot match the query Je suis venir te dire que je m'en vais.

To add more confusion we display:
Résultats affichés pour je suis venu te dire que je m'en vais. Rechercher Je suis venir te dire que je m'en vais à la place.
But we are actually displaying results for Je suis venir te dire que je m'en vais.
This suggestion is handled by the new glent system which is currently broken, I think this issue is already fixed in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/554603

Pols12 added a comment.Jan 2 2020, 9:30 PM

One problem is that the french stemmer does not conflate venir with its conjugated form venu or venue.
The page Je suis venu te dire que je m'en vais does not have venir meaning that it cannot match the query Je suis venir te dire que je m'en vais.

That make sense, but how does it achieve to find “Sheila (section Je suis venue te dire que je m'en vais (1989))“ as first result?

Pols12 updated the task description. (Show Details)Jan 2 2020, 9:31 PM
dcausse added a comment.EditedJan 3 2020, 8:29 AM

One problem is that the french stemmer does not conflate venir with its conjugated form venu or venue.
The page Je suis venu te dire que je m'en vais does not have venir meaning that it cannot match the query Je suis venir te dire que je m'en vais.

That make sense, but how does it achieve to find “Sheila (section Je suis venue te dire que je m'en vais (1989))“ as first result?

I think the reason is that venir in its infinitive form is present in the page for Sheila elsewhere. Note that in the section snippet venue is not highlighted. Also the fact that the section snippet is found does not mean that all the words in your query are in this section. The smallest unit is the whole page in search, the section snippet in the search result is just a hint that some words have matched the section name nothing more.
Here venir is found for example in

dans le but de venir en aide aux femmes du Sahel

in the 1983-1989 : Virage artistique section.
On the other hand venir is never found on the page about Gainsbourg's song nor the Jane Birkin album.

Pols12 added a comment.Jan 3 2020, 2:00 PM

Indeed! =)

In fact, I didn’t expect the stemmer achieved to look for “venu” instead of “venir”, but I would think CirrusSearch would try to remove the “venir” keyword to find articles where all other words are in the title.
(Before your explanation, I believed it already did it for finding Sheila article; so I found that bahavior inconsistent.)

Indeed! =)

In fact, I didn’t expect the stemmer achieved to look for “venu” instead of “venir”, but I would think CirrusSearch would try to remove the “venir” keyword to find articles where all other words are in the title.
(Before your explanation, I believed it already did it for finding Sheila article; so I found that bahavior inconsistent.)

Oh I see, I should have started with this, yes Cirrus is applying a AND between all words of your query even if the query contains many words like this one.
We have been looking into relaxing this and allowing some words to not appear in the documents found (see T112178) but this work was never finished due to other priorities. We still believe it might be helpful.

Pols12 added a comment.Jan 3 2020, 6:18 PM

So I think the help should be fixed, it currently indicates: “There may be results that do not contain one or more of your search terms.

Many thanks for your answers, please feel free to close this as T112583 duplicate, or keep it for French stemming tracking.