Page MenuHomePhabricator

Allow autocomplete (search) link selection in VE to tolerate more typos
Open, LowPublic

Description

In VE, the autocomplete (search) link selection tolerates some scope of typos. That toleration works for English and other more or less analytic languages.

However, the synthetic languages have endings that aren't ignored by the autocomplete. That causes users to fix little regular things again and again, whenever they want to add a link. Please have a look at an example, Polish plural endings.

In Polish, when I add a link to an article entitled "Sąd Najwyższy Stanów Zjednoczonych" (US Supreme Court) in a sentence where the genitive case is correct ("Sądu Najwyższego Stanów Zjednoczonych"), I don't have to change the first word (-u is like a typo) but I have to do it with the second one (y -> ego). It gets messy when more words have endings. And I didn't mention Finnish and alike (e.g. very simple changes: Helsinki -> Helsingissä or talo -> taloissani).

Each language has its specific endings. It's impossible to list all of those for all the languages in one place. Instead, you could allow wikis to set "tolerated typos" on their local pages. Just like it has been done with Citoid.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Deskana added subscribers: TJones, Deskana.

Before I moved to the Editing team, I was the product manager for the Search team for a few years, so I happen to know a bit about this!

The completion suggester was introduced by the Search team a few years ago as an improvement to the old prefix search system. One of the features the completion suggester introduced was the ability to correct up to two typos. The computational power required to correct typos increases significantly as you try to correct more and more typos, so we had to choose a number to correct, and we chose two. This doesn't represent an English language bias, merely an unfortunate fact of engineering.

For the above reasons, it's not quite as simple as "allowing" the system to complete more typos; by doing so, there could be too much strain placed on the system. From a product perspective, giving users the option of breaking the entire search system is out of the question. However, it's possible that the number of typos corrected could be changed for languages that would benefit from it, such as Polish. Since I'm not on that team any more, I'll defer to them as to whether this number can be increased now, or if there are any better solutions than the one I thought of off the top of my head. :-)

P.S. The Search team has native speakers of English, French, Russian, and Mandarin, as well as people with intermediate skills in Cantonese, Hebrew, Spanish, German, Ukrainian, and quite a few other languages. The team also has a specialist in linguistics, @TJones. We did a lot of work for non-English languages, such as deploying specialised language analysers, incidentally including one for Polish (T158682: Deploy new Polish language analyser ). The team continues work along these lines. :-)

Ofc, I didn't assume that there could be an English-centric bias. It just works for languages that don't use many endings.

Thanks for pushing forward. I'll keep an eye on this task.

@tarlocesilion, can you point us to more info on the configuration of "tolerated typos" for Citoid? Is it just the fuzziness of the match, or does it allow specification of specific variants that don't really count as typos?

Unfortunately, some languages are just more computationally difficult than others. English has it very easy in this regard. @Amire80 gave a recent talk on this; the keyboard and font aspects are partly historical accident, but English morphology and orthography really are easier to handle for search, autocompletion, spell-checking, etc.

I'm also curious whether this is largely an inconvenience, or if it keeps people from being able to find and link articles. In English, if I want to link to the Dog article in the sentence, I like dogs., I would create a link to Dog, but then I have to lowercase the initial letter and add the plural -s. There can always be differences between the citation form of an entry, and the form used in running text. Obviously, the Polish or Finnish situation could have an even larger divergence—and other grammatical processes like noun incorporation (uncommon in major languages, but it does happen in languages we support) or stem-changing grammatical markers (like English strong verbs and some German plurals) would require more significant editing—but is it at least clear to people what they need to do when adding links in the Visual Editor in Polish?

@dcausse may also have more detailed info on the performance cost of allowing different kinds of matching and scoring schemes for autocompletion. I don't know whether it's plausible to do prefix matches on each word, for example. If so, then "Sądu Najwyższego Stanów Zjednoczonych" might try to match any title that contains words starting with Sąd-, Naj-, Sta-, and Zje-. On the other hand, while this might work for longer titles, it might be give too many results, and too many poor results, for single word titles. Cleverly scoring the results would help, but that can also be computationally expensive.

Ofc, I didn't assume that there could be an English-centric bias. It just works for languages that don't use many endings.

Apologies for misunderstanding. Thanks for the feedback!

can you point us to more info on the configuration of "tolerated typos" for Citoid?

I didn't mean the configuration of "tolerated typos" for Citoid - there's no such thing. I put on their local pages in italics, because I imagined that it could be allowed to define local settings on local pages. Citoid allows to set things on wikis locally.

is it at least clear to people what they need to do when adding links in the Visual Editor in Polish?
if it keeps people from being able to find and link articles

It depends on how tech-savvy a given user is. There are people who don't distinguish the target page from the text that works as a link, and they're OK with the fact that "[[Sądu Najwyższego]]", which means "[[of the Supreme Court]]", is a redlink.

whether it's plausible to do prefix matches on each word, for example. If so, then "Sądu Najwyższego Stanów Zjednoczonych" might try to match any title that contains words starting with Sąd-, Naj-, Sta-, and Zje-

On each, or - let's say - two or three first words. That would resolve a greater part of our problem. Yup, I like that idea!

Plugin a stemmer in the completion suggester is possible but I'm having difficulties to anticipate all the drawbacks that may occur doing so.
The reason is that it's an autocomplete search. Meaning that we do partial matching, basically in all the examples given above only fully written phrases have been studied but we have to keep in mind that we still suggest pages when the phrase/word is not fully written. In other words we will apply a stemming algorithm to partially typed words.
Perhaps the best approach would be to setup a small demo so that polish speakers could try this approach and tell us if it's worthwhile.

Concerning other approaches:
1/ Doing prefix matching on every word is possible but it has drawbacks too:

  • we'll lose the word ordering (for doing so we split the title into words and then index the prefixes, information about positions is lost unless we do a costly phrase query)
  • it'll certainly pollute the search results, because queries with short words may suggest unrelated results: starting to type to be could suggest Tony Bennet.

2/ Allowing more typos is certainly not the right approach for few reasons:

  • technical limitation: our backend only supports 2 typos
  • it'd be too slow, the search space would be too large
  • it'll pollute the results: the typo correction algorithm is not aware of any stemming rules.

The sole solution that is worth a try would be to make the completion suggester aware of some stemming rules but it's hard to anticipate how it'll react in an autocomplete usage (apply stemming rules to partially typed words)

EBjune subscribed.

We'll take a look when time permits

MPhamWMF subscribed.

Closing out low/est priority tasks over 6 months old with no activity within last 6 months in order to clean out the backlog of tickets we will not be addressing in the near term. Please feel free to reopen if you think a ticket is important, but bare in mind that given current priorities and resourcing, it is unlikely for the Search team to pick up these tasks for the indefinite future. We hope that the requested changes have either been addressed by or made irrelevant by work the team has done or is doing -- e.g. upgrading Elasticsearch to a newer version will solve various ES-related problems -- or will be subsumed by future work in a more generalized way.

RhinosF1 removed a project: Discovery-Search.
RhinosF1 subscribed.

Re-opening tasks and removing from team workboard per IRC feedback given yesterday and discussion with MPham.