Page MenuHomePhabricator

Inconsistent search behavior when asciifolding is not activated on text/plain
Closed, ResolvedPublic

Description

Searching sydvast (Sydväst) on sv.wikipedia.org triggers inconsistent behavior:
Completion finds it (even if not properly ranked) : https://sv.wikipedia.org/w/api.php?action=opensearch&format=json&formatversion=2&search=sydvast&namespace=0&limit=10&suggest=true
The go feature works: https://sv.wikipedia.org/w/index.php?search=sydvast&title=Special:S%C3%B6k&go=G%C3%A5+till
But a classic fulltext search does not find the page: https://sv.wikipedia.org/w/index.php?title=Special:S%C3%B6k&profile=default&fulltext=Search&search=sydvast&searchengineselect=on

This is most likely a regression with the new fulltext builder activated for BM25. This new builder uses a filter + scoring query approach, sadly asciifolding being on by default on some special fields (all_near_match) but not activated on other text fields the filter will exclude this page.

A quick workaround would be to add new SHOULD clause on the all_near_match field.
The proper fix would be to fix this inconsistency in the analysis chain.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
debt triaged this task as Medium priority.Jan 26 2017, 11:21 PM
debt edited projects, added Discovery-Search (Current work); removed Discovery-Search.
debt subscribed.

Let's do the quick work-around for this for now.

Did some testing and all_near_match won't do the trick, but all_near_match.asciifolding in the filter looks to do the trick. I'm not 100% clear if this should be for all queries, or just some. Going to do the first patch applying it everywhere. I think this might be a bit problematic though, because even though it makes it through the filter stage it still doesn't match for any text scoring. The entire score of the document comes from incoming links.

Change 342117 had a related patch set uploaded (by EBernhardson):
[mediawiki/extensions/CirrusSearch] Workaround inconsistent search behaviour without asciifolding

https://gerrit.wikimedia.org/r/342117

Change 342141 had a related patch set uploaded (by EBernhardson):
[mediawiki/extensions/CirrusSearch] [WIP] Alternate fix for swedish folding

https://gerrit.wikimedia.org/r/342141

Deskana added subscribers: TJones, Deskana.

We thought this would be easy... but it's not! @TJones is taking a long look at this one.

Do we have any analysis of search patterns on Swedish Wikipedia that would indicate this is a problem? It might be you are fully aware of everything I write and have good reasons to do this, but:

The Swedish alphabet includes the letters Å, Ä and Ö. In Swedish, the tréma in Ä is not considered to be an A with a diacritical mark giving hints how to pronounce it in this particular case, like e.g. naïve in English. It's a letter in its own right, as much as I and U are, with a sound that's distinctly different from A. For example, when Swedish children learn to write, they normally don't confuse Å or Ä with A. The sounds are different enough that this is not a problem. They do confuse Ä with the often similar-sounding E, and Å with the often similar-sounding O. Someone who speaks Swedish is probably more prone to misspell "sydväst" as "sydvest" than as "sydvast".

My name ("Jönsson") is the 16th most common surname in Sweden. Jonsson is the 12th most common surname in Sweden. If you search for Jönsson and get all the Jonsson hits, or the other way around, that's a pretty irritating bug. This will be true for a large number of words. If we assume most people searching for something on the Swedish wikis speak Swedish well enough to read the pages, it's usually not more likely they'd want results for Å and Ä to find A than they'd want results for I to find Y.

Now, I've never used search that much on Wikipedia, but I can't remember "sydvast" ever being able to find "sydväst". I'd assume it's been done this way on purpose, because if you don't differentiate between Å, Ä and A, or O and Ö, it has pretty much the same effect as if you didn't differentiate between M and N in English.

Forgot to post this earlier: my analysis of the effects on Swedish of character folding (which turns out to be ICU folding, even if configured as ascii-folding).

This is not in response to @Johan's comments above—his comments are in response to this analysis (thanks for the review!)

I remember now that in the past we specifically called out å, ä, and ö in Swedish (and Finnish) as things we don't want to fold (which is why we need to do some minimal investigation into each language before turning on folding—to find such things). I just got caught up in doing the analysis after @EBernhardson asked for help.

I'm wondering if this is a big deal. From the point of view of the completion suggester, sydvast is pretty much the same as sydvqst—it's sydväst with some random character stuck where the ä goes. Lots of other letters will give the same results: q, k, t— sydvast only happens to work out because there aren't any real words that are closer. sydvest and sydvost give different results because there are partial matches that are better than sydväst.

I'm leaning toward thinking we are fixing the wrong thing. I feel like the completion suggester is doing the right thing, and it's just that sydvast and sydväst happen to look similar and only one is a word.

So, is the Go feature doing the right thing? Clearly it's doing some folding, as searching for ţĥę ḿâțȑīˣ gets you The Matrix—as it does on English, Spanish, French, and Hungarian Wikipedias... so I guess that's right.

If you find articles (or redirects) that differ only by the diacritics (e.g., äpple vs Apple or töm vs tom or tår vs tar), the Go feature does what you'd expect.

I'm leaning heavily towards this not really being a problem, but I've got one or two more Swedish speakers who I'm hoping will share their opinion, too.

Yes, sorry for not contextualising my comment properly.

I'd agree with @TJones that folding of e.g. á, à, ã etc is expected and appreciated behaviour – anything that's not considered to be a full, normal character in the local alphabet.

(What's said here also goes for Norwegian and Danish, and – I'm fairly certain – Icelandic and Faroese.)

I agree with @Johan. The sydvast - sydväst example is a bad example, I don't think any Swedish speaker would make that spelling mistake. And we are pretty used that searching for something without the ÅÄÖ will result in a miss so that is not a big issue (for me at least). I would say Ö is never mistaken by O, and ÅÄ never by A.

However we have different dialects in Sweden and where I'm from Ä is often mistaken by E. So for example fox in Swedish "räv" is spoken as "rev". But "rev" in written Swedish means clawed. Something totally different, so I don't want to have a hit for that if I do a search. And Johan already said Å could be mistaken by an O.

I think changing something like this would need an massive analyze and there's so much that could go wrong. I don't see how it works now as a big issue. I would say this is a no fix.

I also looked into @TJones the effects on Swedish of character folding): I'm surprised are so many words there that are not Swedish words (to the left in Collision Examples). Something like 28-29 of the 100 aren't Swedish words at all. And I mean not at all. And the words to the right is not the Swedish version.

There's also some bad examples of merging:
[1 törstande] -> [13 Torsten] - mapping thirsty to the person name Torsten
[1 störar] -> [31 Stor][78 Stora][4 Store][264 stor][210 stora] [13 store][3 stores][3 storhet] - mapping a poke with large
[1 fårade] -> [5 Far][5 Fara][7 Faras][2 Farlig][40 Fars] [66 far][8 fara][2 farlig][5 farliga][1 farligaste] [4 farligt][5 fars] - mapping "lines in the face" with father and dangeorus.

But also some good ones:
[75 stränder] -> [13 Strand][42 strand][8 stranden - mapping multipe beaches with one beach.

Thanks, @Peter & @Johan! I'll dig into the origin of this issue and make sure there's not something beyond sydväst/sydvast that is an issue. If not, I think I'll vote to close it. Though I might argue for leaving the general folding in place (for non-Swedish diacritics), but just exclude å ä and ö from folding.

@Peter, the reason there are so many non-Swedish words because Wikipedias always have lots of foreign words (maybe not many instances of each, but lots of individual words) and a lot of them have "foreign" accents and folding those is usually helpful. The one that stuck out the most to me was French cœur—very few people are going to figure out how to type œ if it's not on their keyboard.

Okay, so the consensus is that the original goal here was to have more consistency in behavior between the Go feature/near match/upper right search box and full text search. Folding å, ä, and ö is the wrong way to go about it.

Since we're 90% of the way done with doing folding in Swedish the right way (folding everything but å, ä, and ö), I'm going to go ahead and do that real quick in T160562.

A better approach to solving the inconsistency in general is to incorporate all_near_match_ascii_folding into the full text scoring, so that a full text search for sydvast also has a chance to return sydväst when there is no competing title match for sydvast.

That's a much bigger task than finishing off proper folding for Swedish, so I'm going to drop this task and do that. There's also a question of what priority adding all_near_match_ascii_folding into the full text scoring should have. As far as I know, no users are complaining loudly about sydvast not finding sydväst in full text search because it's a typo on par with sydvost or sydvest—it's always nice when search could read your mind, but it's still a typo.

debt lowered the priority of this task from Medium to Low.Mar 21 2017, 5:15 PM
debt edited projects, added Discovery-Search; removed Discovery-Search (Current work).

Moving to backlog board for now, until we get T160562 done and we come to a consensus on how to move forward with this.

Change 342141 abandoned by EBernhardson:
[WIP] Alternate fix for swedish folding

Reason:
went a different direction, see linked ticket for details

https://gerrit.wikimedia.org/r/342141

Change 342117 abandoned by EBernhardson:
Workaround inconsistent search behaviour without asciifolding

Reason:
went a different direction, see ticket for more details

https://gerrit.wikimedia.org/r/342117

There's a related discussion on Swedish Wikipedia where users are complaining about how a search for måst will, if there's no article with that name, lead you to the article Mast. This is confusing for two reasons: If there's no mentioning of måst at all on the wiki, that's not obvious. You just ended up at an article about a different word for no apparent reason – it's not necessarily obvious to a Swedish-speaking reader that folding has been done, or why you would do so.

Second, since there are plenty of occurrences of the word måst in Swedish Wikipedia, you're not seeing results with the word you're looking for even though it's there in a good number of articles.

I am not completely sure this is related, but:

Typing "styckhylm" into the search field on sv.wikipedia.org gives search results for "stockholm", an option to create the article Styckhylm, and to search for (literal) "styckhylm" instead. This is all expected.

But typing "stöckhölm" gives no search results, but directly displays the article Stockholm.

This is not good behaviour, since the letter Ö (together with Å and Ä) is a letter in its own right.

A specific example given in the discussion linked by Johan above is the island Gälön, which could have an article in Swedish Wikipedia. Typing "gälön" into the search field directly transfers to the article Galon. Instead, it should provide search results for "galon" (or similar), and an option to create the page Gälön.

The Swedish alphabet is ABC...ZÅÄÖ. As it works now, the search function behaves very differently depending upon what letters are in the words you search for.

Thanks for the feedback, @NH!

There are a lot of glaring problems with search, such as the ones you pointed out. In many cases, they've existed for many years. The code is often old and confusing, which makes fixing these problems hard. The "go" behaviour (as in, what happens when you hit enter in the search box) is one of the most confusing and old bits there is. I'm not surprised that there issues with it. Often it's hard for the Search Team, which is relatively small, to know what is right and wrong in different languages; Swedish is not a language we know well, but fortunately we have people like you, @Johan, and @Peter to help us.

@TJones mentioned above that folding ö to o is the wrong approach, which you are confirming here by saying that "stöckhölm" should not take you to "stockholm". That's good! It means we're moving in the right direction. Thanks for helping us figure that out.

In summary, the behaviour that you're describing as problematic is exactly what we're trying to fix. We might not get to it first try though. Please bear with us while we work on it! If you have more examples, we'd appreciate that too.

Thanks. Maybe it should be pointed out that this is new behaviour. I don't know since when, maybe a month or so. (It makes creating pages with Å,Ä or Ö in the title hard, which has not been the case before. This is how it was discovered.)

In T155822#3157812, @NH wrote:

Thanks. Maybe it should be pointed out that this is new behaviour. I don't know since when, maybe a month or so. (It makes creating pages with Å,Ä or Ö in the title hard, which has not been the case before. This is how it was discovered.)

I didn't know that. Thanks! I'm not sure whether it's directly related to this issue or not; perhaps @TJones and @dcausse can help me figure that out.

In T155822#3157812, @NH wrote:

Thanks. Maybe it should be pointed out that this is new behaviour. I don't know since when, maybe a month or so. (It makes creating pages with Å,Ä or Ö in the title hard, which has not been the case before. This is how it was discovered.)

I didn't know that. Thanks! I'm not sure whether it's directly related to this issue or not; perhaps @TJones and @dcausse can help me figure that out.

The timing makes it likely that the Elasticsearch upgrade is at fault. With a system as complex as ours, in so many languages, with content changing so quickly, it's hard to be aware of every effect that an upgrade could have. (It wasn't until I went to upgrade Ukrainian that I noticed that the ES5 upgrade had caused regressions over there, too.)

It might be something else, and @dcausse is probably the best person to look into it, since he knows that code the best. (David, now that I've thrown you under the bus, let me know if you want me to try to figure out what's going on.)

TJones claimed this task.

I think everything here is fixed. ö, ä, and å are all treated as independent letters and using a instead of ä is the same as using u instead of ä, and other diacritics like á are ignored. Depending on whether you use the completion suggester, go feature, or full text search, you get additional suggestions depending on the place of the typos or the frequency of the incorrect word—all as expected.