Page MenuHomePhabricator

Reorder aliases in the search suggestions on ukwiki so that Ukrainian aliases always come first
Closed, ResolvedPublic

Description

If possible, the search suggestions showing special page aliases in Russian shouldn't display. If not – they should be deprioritized, so that the search suggestions would appear in Ukrainian first. This is per T39314

Now if one tries to search for special pages starting with "У", one gets 6 suggestions in Russian, 2 in Ukrainian and 2 more in Russian, respectfully. Anyway, a Ukrainian wiki should display special page names suggestions in Ukrainian first. I think this is obvious.

Russian fallback is being removed from ukwiki, and special pages aliases had to be gone too, but the devs decided to keep them for backwards compatibility (essentially they shouldn't appear anywhere in ukwiki unless they're used somewhere in the wikitext). That's why we're requesting to remove them from suggestions.

Can this be done?

Event Timeline

Restricted Application added subscribers: Base, Aklapper. · View Herald Transcript
Piramidion renamed this task from Don't show special page aliases in russian in the search suggestions on ukwiki to Don't show special page aliases in Russian in the search suggestions on ukwiki.Nov 26 2016, 1:20 AM

I think a more correct formulation of the request would be to give a (much) lower match score to all title aliases coming from a fallback language (including English).

That's probably a sensible thing to do in all cases, I think: personally, on wikis in languages other than my own I like to get suggestions for fallback (English) special page names when I start typing them explicitly, but I'm sometimes surprised when I get a mix of English and Italian suggestions for on an Italian wiki.

Yes, it would be good to deprioritize all the fallback languages in the search suggestions, but I think that would be a bit different request as compared to the current one. Quote:

These are not fallbacks, but aliases. Ukrainian users will never see this, except when a Russian keyword is manually added to a wiki page.

Is this about suggestions displayed when you start to type Special: ? (the screenshot below is taken from uk wikipedia)

specialpages_fallback.png (507×576 px, 57 KB)

The request made in this ticket would be to change the ranking of these suggestions and always prefer the ones written in the wiki language?
So searching for Special: on ukwiki would display Special pages in ukrainian instead of english?
Would it be possible to have some examples of input search queries with actual vs desired behavior?

No, Ukrainian uses cyrillic script, so you don't have to do anything with English. The problem is with Ukrainian/Russian results.

This is what we see now, if we start to type Special:У (I highlighted the unwanted suggestions in Russian; the two that are not highlighted, are in Ukrainian):

special2.png (544×740 px, 66 KB)

Since the Russian fallback is being removed from all Ukrainian wikis, and we are not supposed to see any Russian messages in the interface, except when some aliases are used in the wikitext, the suggestions in Russian ideally shouldn't show. We'd like the suggestions to look something like this:
special4.png (144×737 px, 21 KB)

If this is achievable, it would be perfect. If not – we'd like to see Ukrainan results first, the Russian ones should go after them, like this:
special3.png (529×739 px, 66 KB)

TL;DR (i.e. "in short"):

The request made in this ticket would be to change the ranking of these suggestions and always prefer the ones written in the wiki language?

Yes.

Yes.

I think I made myself clear enough. If it is possible to remove Russian suggestions completely, we prefer this to be done. The other case is true only if the first one is impossible to accomplish for some technical reasons. If you want to enable this for all the other languages, start another ticket. And please do not upset our applecart. You've shown enough disrespect to our community's decisions already.

Thanks for the clarification, I'll have a look to see if it's possible to blacklist a particular language. If it's not I think we should create another ticket to discuss about the ranking behaviors regarding languages.

This thing looks tricky. The aliases are specified in languages/messages/MessagesUk.php in $specialPageAliases and there's no indication at all which alias belongs to which language - they are all in the same set. It would be easy to drop extra aliases and that'd fix the issue, but otherwise I see no good way of knowing which alias is "correct" and which is not.

So maybe a way to do this would be instead of putting the aliases all in $specialPageAliases to split $specialPageAliases into two variables, like $specialPageAliases and $specialPageAliasesBC or something, and use the latter only in specific contexts... not sure, I'll look a bit more into this.

If this could work somehow, I can assist with creating a list of Russian aliases based on that php file.

Looks like this part:

essentially they shouldn't appear anywhere in ukwiki unless they're used somewhere in the wikitext

is wrong, at least with the current code. The current code uses getSpecialPageAliases() to fetch the list of current page aliases, which all the Special: search and suggestions are based on. There is no difference between the items in the result, they are not marked with languages or anything, they are just lists of aliases per item. So in current code, all the aliases are basically the same. Changing that semantics may be dangerous as other code may rely on it.

What might be possible to do is to rerank the result according to the place in the alias list (SpecialPageFactory assumes the first alias is the "true" one and treats it as "local name" for the purpose of redirects). Not sure if that'd play well with other languages but since that's what SpecialPage::getLocalName does anyway, we may have a basis for it.

Change 324332 had a related patch set uploaded (by Smalyshev):
Rank aliases in search in order they appear in the messages file.

https://gerrit.wikimedia.org/r/324332

I'm not sure if I understood everything correctly, but I doubt this could do any difference. The suggestions are ranked alphabetically, and same aliases usually use names that start with different letters in Ukrainian and Russian. So in the example described above, nothing would change. Of course, that's if I got you right.

So you say there's no technical means to split Russian and Ukrainian aliases, so that the Russian ones wouldn't pop up in the search dialog?

Could you think of any ways of changing the $specialPageAliases code even to improve the ranking by the main language as compared to the fallback (described above by Nemo_bis)? To achieve this the split would be needed. And it would make it possible to blacklist Russian aliases as well.

So in the example described above, nothing would change.

This is not correct. The Ukrainian aliases would be ranked before Russian ones, since they are the first alias, and Russian ones are the second alias. Isn't that what you wanted?

So you say there's no technical means to split Russian and Ukrainian aliases, so that the Russian ones wouldn't pop up in the search dialog?

No. There's no technical means to split Russian and Ukrainian aliases, but there are technical means to order aliases in the same order they are in message file. Since Ukrainian alias is always the first, and Russian alias is second or further down, this places Ukrainian aliases with higher ranking.

Could you think of any ways of changing the $specialPageAliases code

It may be possible but since this code is used in many other places, I would be very reluctant to change semantics of the API functions that retrieve aliases. The semantics is that the aliases specified in $specialPageAliases are the ones used to do all alias stuff. There is, however, special meaning to the first alias, if I understand SpecialPage::getLocalName right - it is treated as "true local name". Basing on this, we can assign to it a higher search position than other aliases. But we can not remove other aliases, because that would break other functionality (and we have no idea which of the aliases are "Russian", they are not marked in any way). Unless, of course, you're ok with removing Russian ones completely, in that case the problem goes away naturally.

to improve the ranking by the main language as compared to the fallback

There's no fallback here. The aliases are in the same Uk message file, which means, to the code, they are all Ukrainian aliases.

And it would make it possible to blacklist Russian aliases as well.

No, because there is no "Russian aliases" as such. All aliases are in Uk message file. It's not possible to remove some of them without actually removing them from the message file. It is possible, however, to make search ranking prefer first alias to further ones.

I chatted with James F and he said that @Amire80 may be able to help offer some insight here. :-)

Unless, of course, you're ok with removing Russian ones completely, in that case the problem goes away naturally.

We're ok with that, but the language engineers and other devs are not (for those backwards compatibility reasons), unless we find some way to hunt up all the Russian aliases wikitext inclusions in all the Ukrainian wikis and replace them with Ukrainian ones.

Anyway, thanks for the explanations and your time. Hope this works.
Nevertheless I also hope that Amire80 has some other ideas about solving this issue.

Unless, of course, you're ok with removing Russian ones completely, in that case the problem goes away naturally.

We're ok with that, but the language engineers are not (for those backwards compatibility reasons), unless we find some way to hunt up all the Russian aliases wikitext inclusions in all the Ukrainian wikis and replace them with Ukrainian ones.

I can perhaps help a little with this using the search engine, but unfortunately we don't index links to special pages directly, so i have to use a rough heuristic of a regex search against the source text. This particular search query is too expensive to be allowed directly in Special:Search (although it could be broken up into pieces and done as a couple queries against each possible wiki). I ran a regex search against all the uk language wikis for all the special page aliases that are not the first in the list.

wikis queried: ukwiki, ukwikibooks, uawikimedia, ukwikinews, ukwikisource, ukwikivoyage, and ukwiktionary
regex query: (Special|Спеціальна):(Активные_участники|Системные_сообщения|Все_мои_файлы|Все_страницы|Недопустимое_название|Пустая_страница|Заблокировать|Источники_книг|Разорванные_перенаправления|Категории|Сменить_e\-mail|Сменить_почту|Сменить_пароль|Сравнение_страниц|Подтвердить_e\-mail|Подтвердить_почту|Вклад|Создать_учётную_запись|Создать_пользователя|Зарегистрироваться|Тупиковые_страницы|Удалённый_вклад|Двойные_перенаправления|Править_список_наблюдения|Письмо_участнику|Отправить_письмо|Развёртка_шаблонов|Экспорт|Выгрузка|Редко_редактируемые|Поиск_дубликатов_файлов|Путь_к_файлу|Импорт|Отменить_подтверждение_адреса|Тестирование_JavaScript|Блокування|Блокування_IP\-адрес|Список_блокировок|Блокировки|Поиск_ссылок|Список_администраторов|Список_ботов|Список_файлов|Список_изображений|Права_груп_користувачів|Права_групп_участников|Список_прав_групп|Список_перенаправлений|Список_файлов\-дубликатов|Список_участников|Заблокировать_БД|Заблокировать_базу_данных|Журналы|Журнал|Изолированные_страницы|Длинные_страницы|Объединение_историй|Поиск_по_MIME|Самые_категоризованные|Самые_используемые_файлы|Наибольшее_количество_интервики\-ссылок|Найбільше_посилань|Самые_используемые_страницы|Самые_используемые_категории|Самые_используемые_шаблоны|Наибольшее_количество_версий|Переименовать_страницу|Переименование|Переименовать|Мой_вклад|Мой_язык|Моя_страница|Моё_обсуждение|Мои_загрузки|Новые_файлы|Новые_страницы|Сброс_пароля|Постоянная_ссылка|Настройки|Указатель_по_началу_названия|Защищённые_страницы|Защищённые_названия|Случайная_страница|Случайная|Случайное_перенаправление|Свежие_правки|Связанные_правки|Удаление_правки|Поиск|Короткие_страницы|Спецстраницы|Метки|Разблокировка|Некатегоризованные_категории|Некатегоризованные_файлы|Некатегоризованные_страницы|Некатегоризованные_шаблоны|Восстановить|Восстановление|Разблокировка_БД|Неиспользуемые_категории|Неиспользуемые_файлы|Неиспользуемые_шаблоны|Загрузка|Скрытная_загрузка|Вход|Завершение_сеанса|Выход|Управление_правами|Версия|Требуемые_категории|Требуемые_файлы|Требуемые_страницы|Требуемые_шаблоны|Список_наблюдения|Ссылки_сюда|Без_интервики)

This gives 120 pages across the uk language wikis that perhaps would need to be edited. List at: P4542
Someone else might want to review what i've done to ensure this is actually collecting all the pages that need to be.

These are Ukrainian titles, not Russian:

Блокування
Блокування_IP-адрес
Права_груп_користувачів
Найбільше_посилань

That's part of the tricky-ness - some multiple aliases are genuine Ukrainian ones, not Russian duplicates. Generally if there are only 2 aliases, then the second one is Russian, but if there are more, some of them would be Ukrainian.

There is, however, special meaning to the first alias, if I understand SpecialPage::getLocalName right - it is treated as "true local name". Basing on this, we can assign to it a higher search position than other aliases.

This is reasonable.

Nemo_bis triaged this task as Medium priority.Nov 30 2016, 2:00 PM

Thanks Stas, I agree with you this is the best we can do in a reasonable amount of time.
@Piramidion: I tested the patch and it seems to do what you want (at least on my local mediawiki)

before:

before.png (369×571 px, 39 KB)

after:
after.png (359×571 px, 39 KB)

I hope it's just transitional and that the russian labels can be removed in the future after a cleanup of all the references in wikitext.

I thank all of you for devoting your time and efforts to resolving this issue.

@EBernhardson can you please do the same for the magic words as well (if that's possible)? I'd greatly appreciate it. It's that I'd like to kill two birds with one stone :)

And occasionally I'd like to ask: if I fix all the links in Wikimedia wikis, what to do with the third party wikis that use Ukrainian as the primary language? Just like @Base mentioned somewhere, I don't think there are many big Ukrainian third party wikis, if any. And personally I don't think they're that old to "remember" the time when the special pages had been translated into Russian (I'm not sure, but I think that's the main source for the Russian links in Ukrainian wikis). Besides, having only 120 search results in the largest (Wikimedia) wikis in Ukrainian, means that there would be only few links to special pages in Russian in the third party wikis, if any at all. Can this be considered neglectable (just like some outdated mediawiki messages are)? Or should we find a way to check those wikis as well? I mean, before removing all those aliases Mediawiki-wide?

I have checked several pages from the list P4542 and it seems that most of them are false positives: four Ukrainian aliases had been included in the search query and some of Russian aliases, like "Журнал", also result in false positives (pages containing Special:Журнали, which is in Ukrainian). I'll count the pages with true positives, but even now I think that backwards compatibility is inconsistent for keeping all those Russian aliases.

@EBernhardson can you please do the same for the magic words as well (if that's possible)? I'd greatly appreciate it. It's that I'd like to kill two birds with one stone :)

If this is super straightforward then we can try, but Discovery-ARCHIVED's primary focus is the search system, not general language fallback issues.

And occasionally I'd like to ask: if I fix all the links in Wikimedia wikis, what to do with the third party wikis that use Ukrainian as the primary language? Just like @Base mentioned somewhere, I don't think there are many big Ukrainian third party wikis, if any. And personally I don't think they're that old to "remember" the time when the special pages had been translated into Russian (I'm not sure, but I think that's the main source for the Russian links in Ukrainian wikis). Besides, having only 120 search results in the largest (Wikimedia) wikis in Ukrainian, means that there would be only few links to special pages in Russian in the third party wikis, if any at all. Can this be considered neglectable (just like some outdated mediawiki messages are)? Or should we find a way to check those wikis as well? I mean, before removing all those aliases Mediawiki-wide?

These Russian aliases won't be removed, they'll simply be reordered so that the Ukrainian ones are first and the Russian ones are below it. I don't think any notification to third-party users is required for that.

I re-ran the query and added highlights this time, indeed Спеціальна:Журнали is the most common result. I've adjusted the query to remove the aliases stas mentioned, and adjusted Журнал to Журнал[^и] to remove the false positives. I also threw in highlights to make it easier to review. The paste from before (P4542) has been updated. I don't know ukranian, but it looks like these are still a variety of false positives. The list is smaller (39 items) and easier to review now at least. It's probably worth mentioning though that the highlighting is not exhaustive, some piece of text is chosen as a representative sample and does not include all matches.

For actually removing the aliases, I have to refer back to the language team on what they would consider a complete enough solution.

Also i could certainly come up with something similar for magic words, but again i would need help actually coming up with the list of tokens to search for. Most items appear to have 3 items which i could guess is [uk, ru, en], but some have more. Because magic words are used in a variety of ways there will probably be more false positives.

@EBernhardson Thanks for the adjustment. I've checked them all, found only 10 positives, 2 of which I didn't fix due to the context. I'll post the result on the "remove the Russian fallback" thread. If the language engineers agree to remove those aliases, the current patch wouldn't be needed anymore.

@Deskana

These Russian aliases won't be removed, they'll simply be reordered

In this case yes, but our primary intention is to remove the aliases completely, it's just that the devs aren't ready to do this yet. But considering the results we've got (see above), we might be close to achieving our primary goal.

@EBernhardson This is the list of magic words in Russian to search for:

(#ПЕРЕНАПР|#перенапр|#перенаправление|__БЕЗ_ОГЛАВЛЕНИЯ__|__БЕЗ_ОГЛ__|__БЕЗ_ГАЛЕРЕИ__|__ОБЯЗАТЕЛЬНОЕ_ОГЛАВЛЕНИЕ__|__ОБЯЗ_ОГЛ__|__ОГЛАВЛЕНИЕ__|__ОГЛ__|__БЕЗ_РЕДАКТИРОВАНИЯ_РАЗДЕЛА__|ТЕКУЩИЙ_МЕСЯЦ|ТЕКУЩИЙ_МЕСЯЦ_2|ТЕКУЩИЙ_МЕСЯЦ_1|НАЗВАНИЕ_ТЕКУЩЕГО_МЕСЯЦА|НАЗВАНИЕ_ТЕКУЩЕГО_МЕСЯЦА_РОД|НАЗВАНИЕ_ТЕКУЩЕГО_МЕСЯЦА_АБР|ТЕКУЩИЙ_ДЕНЬ|ТЕКУЩИЙ_ДЕНЬ_2|НАЗВАНИЕ_ТЕКУЩЕГО_ДНЯ|ТЕКУЩИЙ_ГОД|ТЕКУЩЕЕ_ВРЕМЯ|ТЕКУЩИЙ_ЧАС|МЕСТНЫЙ_МЕСЯЦ|МЕСТНЫЙ_МЕСЯЦ_2|МЕСТНЫЙ_МЕСЯЦ_1|НАЗВАНИЕ_МЕСТНОГО_МЕСЯЦА|НАЗВАНИЕ_МЕСТНОГО_МЕСЯЦА_РОД|НАЗВАНИЕ_МЕСТНОГО_МЕСЯЦА_АБР|МЕСТНЫЙ_ДЕНЬ|МЕСТНЫЙ_ДЕНЬ_2|НАЗВАНИЕ_МЕСТНОГО_ДНЯ|МЕСТНЫЙ_ГОД|МЕСТНОЕ_ВРЕМЯ|МЕСТНЫЙ_ЧАС|КОЛИЧЕСТВО_СТРАНИЦ|КОЛИЧЕСТВО_СТАТЕЙ|КОЛИЧЕСТВО_ФАЙЛОВ|КОЛИЧЕСТВО_УЧАСТНИКОВ|КОЛИЧЕСТВО_АКТИВНЫХ_УЧАСТНИКОВ|КОЛИЧЕСТВО_ПРАВОК|НАЗВАНИЕ_СТРАНИЦЫ|НАЗВАНИЕ_СТРАНИЦЫ_2|ПРОСТРАНСТВО_ИМЁН|ПРОСТРАНСТВО_ИМЁН_2|ПРОСТРАНСТВО_ОБСУЖДЕНИЙ|ПРОСТРАНСТВО_ОБСУЖДЕНИЙ_2|ПРОСТРАНСТВО_СТАТЕЙ|ПРОСТРАНСТВО_СТАТЕЙ_2|ПОЛНОЕ_НАЗВАНИЕ_СТРАНИЦЫ|ПОЛНОЕ_НАЗВАНИЕ_СТРАНИЦЫ_2|НАЗВАНИЕ_ПОДСТРАНИЦЫ|НАЗВАНИЕ_ПОДСТРАНИЦЫ_2|ОСНОВА_НАЗВАНИЯ_СТРАНИЦЫ|ОСНОВА_НАЗВАНИЯ_СТРАНИЦЫ_2|НАЗВАНИЕ_СТРАНИЦЫ_ОБСУЖДЕНИЯ|НАЗВАНИЕ_СТРАНИЦЫ_ОБСУЖДЕНИЯ_2|НАЗВАНИЕ_СТРАНИЦЫ_СТАТЬИ|НАЗВАНИЕ_СТРАНИЦЫ_СТАТЬИ_2|СООБЩЕНИЕ:|СООБЩ:|ПОДСТАНОВКА:|ПОДСТ:|ЗАЩПОДСТ:|СООБЩ_БЕЗ_ВИКИ:|мини|миниатюра|справа|слева|обрамить|сверхусправа|граница|основание|сверху|текст-сверху|посередине|снизу|текст-снизу|ссылка|НАЗВАНИЕ_САЙТА|ПИ:|ПИК:|ЛОКАЛЬНЫЙ_АДРЕС:|ЛОКАЛЬНЫЙ_АДРЕС_2:|ПУТЬ_К_СТАТЬЕ|ИДЕНТИФИКАТОР_СТРАНИЦЫ|НАЗВАНИЕ_СЕРВЕРА|ПУТЬ_К_СКРИПТУ|ПУТЬ_К_СТИЛЮ|ПАДЕЖ:|ПОЛ:|__БЕЗ_ПРЕОБРАЗОВАНИЯ_ЗАГОЛОВКА__|__БЕЗ_ПРЕОБРАЗОВАНИЯ_ТЕКСТА__|ТЕКУЩАЯ_НЕДЕЛЯ|ТЕКУЩИЙ_ДЕНЬ_НЕДЕЛИ|МЕСТНАЯ_НЕДЕЛЯ|МЕСТНЫЙ_ДЕНЬ_НЕДЕЛИ|ИД_ВЕРСИИ|ДЕНЬ_ВЕРСИИ|ДЕНЬ_ВЕРСИИ_2|МЕСЯЦ_ВЕРСИИ|МЕСЯЦ_ВЕРСИИ_1|ГОД_ВЕРСИИ|ОТМЕТКА_ВРЕМЕНИ_ВЕРСИИ|ВЕРСИЯ_УЧАСТНИКА|МНОЖЕСТВЕННОЕ_ЧИСЛО:|ПОЛНЫЙ_АДРЕС:|ПОЛНЫЙ_АДРЕС_2:|ПЕРВАЯ_БУКВА_МАЛЕНЬКАЯ:|ПЕРВАЯ_БУКВА_БОЛЬШАЯ:|МАЛЕНЬКИМИ_БУКВАМИ:|БОЛЬШИМИ_БУКВАМИ:|НЕОБРАБ:|ПОКАЗАТЬ_ЗАГОЛОВОК|__ССЫЛКА_НА_НОВЫЙ_РАЗДЕЛ__|__БЕЗ_ССЫЛКИ_НА_НОВЫЙ_РАЗДЕЛ__|ТЕКУЩАЯ_ВЕРСИЯ|ЗАКОДИРОВАННЫЙ_АДРЕС:|КОДИРОВАТЬ_МЕТКУ|ОТМЕТКА_ТЕКУЩЕГО_ВРЕМЕНИ|ОТМЕТКА_МЕСТНОГО_ВРЕМЕНИ|НАПРАВЛЕНИЕ_ПИСЬМА|#ЯЗЫК:|ЯЗЫК_СОДЕРЖАНИЯ|СТРАНИЦ_В_ПРОСТРАНСТВЕ_ИМЁН:|КОЛИЧЕСТВО_АДМИНИСТРАТОРОВ|ФОРМАТИРОВАТЬ_ЧИСЛО|ЗАПОЛНИТЬ_СЛЕВА|ЗАПОЛНИТЬ_СПРАВА|служебная|СОРТИРОВКА_ПО_УМОЛЧАНИЮ|КЛЮЧ_СОРТИРОВКИ|ПУТЬ_К_ФАЙЛУ:|метка|тэг|__СКРЫТАЯ_КАТЕГОРИЯ__|СТРАНИЦ_В_КАТЕГОРИИ|РАЗМЕР_СТРАНИЦЫ|__ИНДЕКС__|__БЕЗ_ИНДЕКСА__|ЧИСЛО_В_ГРУППЕ|__СТАТИЧЕСКОЕ_ПЕРЕНАПРАВЛЕНИЕ__|УРОВЕНЬ_ЗАЩИТЫ|форматдаты|ПУТЬ|ВИКИ|ЗАПРОС|страницы|подкатегории|файлы)

I'm not sure if the code is correct. Perhaps something needs to be fixed.
Besides, an exception for "#ПЕРЕНАПРАВЛЕННЯ" should be added, otherwise we'll get thousands of false positives. And "справа" should be searched with a pipeline before or after it, since Ukrainian has this word too, but it often has a different meaning and is often used in Wikipedia texts.

Considering Deskana's comment, I should also mention that you don't have to do this. If you're busy or just not interested in doing this, it's OK and I think I'll be able to find someone else to run the search.

Change 324332 merged by jenkins-bot:
Rank aliases in search in order they appear in the messages file.

https://gerrit.wikimedia.org/r/324332

Since the patch is merged, I'm resolving this. If/when we'll be ready to remove the Russian aliases completely, we may want to open a new task. I think this patch makes sense in any case but please comment if you object.

Deskana renamed this task from Don't show special page aliases in Russian in the search suggestions on ukwiki to Reorder aliases in the search suggestions on ukwiki so that Ukrainian aliases always come first.Dec 1 2016, 6:55 PM

Thanks Stas! I retitled the task to document what actually happened for posterity.