Page MenuHomePhabricator

searching I as 1 in Kabardian Wikipedia
Open, MediumPublic

Description

As for now, it is hard to search in Kabardian wiki using 1 instead of I (this substitution is quite common). For example, when in Kabardian wiki one searches for 'к1элъ' (meaning 'кIэлъ') the autocomplete starts suggesting appropriate variants only after 3-4 symbols and the search engine cannot find such page at all.
Apparently, the problem has been solved for Adyghe wiki.

Event Timeline

Thanks for the report, @Isbms27. A few more details.

Many languages of the Caucasus area use the Cyrillic alphabet with the addition of the character known as Palochka. It looks like the Latin capital I, but it's a distinct and unrelated character. People who write in these languages often don't have this character on their keyboards, and instead they write one of the following characters:

  • 1 (digit 1)
  • I (capital Latin I, as in Idaho)
  • l (small Latin l, as in linkage)
  • І (Cyrillic I; rare, but possible)

Wikipedia editors sometimes write the replacement characters, too, but eventually the preferred character is the real Unicode Palochka, and this is fixed by other editors or by bots. However, many people who use the Wikipedia search box may use the replacement characters.

The search system should automatically attempt to search for the character Ӏ when Cyrillic words with 1, l, or I are entered.

For example:

  • к1элъ -> кIэлъ
  • кIэлъ -> кIэлъ
  • кlэлъ -> кIэлъ
  • кІэлъ -> кIэлъ

It should be safe to do it for all languages written in Cyrillic - ady, kbd, ce, av, and even Russian, Ukrainian, etc.

At the moment, Wikipedia's search box doesn't handle it well. For example, in kbd.wikipedia.org:

  • Writing к1 doesn't autocomplete to anything, while КӀ autocompletes to "КӀэрей Республикэ", "КӀах Адыгэбзэ", etc.
  • Writing к1элъ autocompletes to "КIэлъ", which is good, but it probably happens because it's similar enough, and not because of special handling of the Palochka character. But writing кIэлъ and pressing Enter produces zero results. It should produce КIэлъ as the result, or even go directly to the article.

I think that it's broken in both ady and kbd the same way.

EBjune triaged this task as Medium priority.Feb 6 2018, 6:08 PM

Great info, @Amire80! Thanks! It's certainly possible to map characters onto each other for the purposes of search and autocomplete. For Russian we map Ё/ё to Е/е (see T124592) in both search and for autocomplete. (More commonly we only conflate characters in search, but Russian really doesn't care about the dots on ё.)

Mapping all those І-like characters together is possible, but could have some weird consequences. Looking at Russian Wikipedia—which is the largest WP using Cyrillic—and the examples of кI- and plain I- in autocomplete, we can see that:

  • only к1- (with digit one) has any prefix matches, so there wouldn't be much competition there, but..
  • combining І/l/1/I would mean that on typing any one of those four characters, the following (top 3 current suggestions for each) would likely be intermixed as autocomplete suggestions: І (кириллица), Ірій, ІО, L, Linux, Led Zepplin, 1, 1992 год, 1991 год, I, Internet Movie Database, iPhone. There would be some sorting based on exactness of match, but I worry it could still be confusing out of the context of к or other Cyrillic character before it.

On the Kabardian wiki, only I (Latin capital i) gets more than one suggestion at the moment (and it seems to be far and away preferred as the first letter for titles).

Apparently, the problem has been solved for Adyghe wiki.

It looks to me like Adyghe WP has a lot of redirects with 1s. Autocomplete for К1- does show suggestions that all have explicit 1's in them.

... it probably happens because it's similar enough, and not because of special handling of the Palochka character.

Yeah, I think that's exactly it, and that's why you have to go a few characters beyond the 1 to get a good match. к1э shows suggested titles starting with Къэ- because they are ranking higher than those starting with КӀэ-. It's essentially matching к?э- at that point.

It should be safe to do it for all languages written in Cyrillic - ady, kbd, ce, av, and even Russian, Ukrainian, etc.

I would worry about languages that use both Cyrillic and Latin scripts, like Serbian. I also worry a bit about the impact on Wiktionaries, where there are likely to be entries in lots of scripts. (Note that we don't have different language processing by project—so Wikipedia and Wiktionary get the same treatment.)

So, my suggestion would be to try it for Kabardian (and do the usual tests of the impact of the change), and deploy it. If it works well, expand to the languages listed as using the character (ady, av, ce, lbe, lez—i.e., Adyghe, Avar, Chechen, Lak, Lezgian—but not inh/Ingush, since it is still in the incubator), and then revisit applying it to larger wikis, like Russian or Ukranian, and mixed-script wikis like Serbian.

There's also a longer term goal of refactoring the way language analysis is configured, which would make it easier to enable this for several languages using a shared configuration, so going a little slower would make the overall search language config less messy.

@TJones just to be clear, when you say:

my suggestion would be to try it for Kabardian (and do the usual tests of the impact of the change), and deploy it

do you mean change the mappings in the cirrus config for the Kabardian wiki?

@TJones just to be clear, when you say:

my suggestion would be to try it for Kabardian (and do the usual tests of the impact of the change), and deploy it

do you mean change the mappings in the cirrus config for the Kabardian wiki?

Yep, that's it. Mapping all of І/l/1/I to one character, and see what happens. I'd also want to look into how smart the mapping can be since "l992" (with a Latin lowercase L) seems weirder than necessary.

Nemo_bis renamed this task from searching I as 1 in Kabardian Wiki to searching I as 1 in Kabardian Wikipedia.Feb 20 2018, 5:27 PM