Page MenuHomePhabricator

Create a list of extra common languages for ULS
Closed, ResolvedPublic

Description

We are getting repeated complaints about T123171: First languages in the Latin alphabet like ace and af are stuck in sidebar—languages that appear seemingly randomly. The real reason is usually that there are not enough languages to suggest, and the software picks the first few languages alphabetically.

This is usually not useful to most readers, because the languages indeed appear to be random and they are unlikely to be clicked. While there can be several way to suggest languages in a clever way, such as T70071 and T70077, they may take time to implement, and they are not immediately urgent.

However, there is one simple thing that can be done to improve this: Create a list of 20 or so "universally common languages", which will be shown in the initial compact links list if the usual algorithm runs out of suggestions (for example, prioritizing other Scandinavian languages is useful in the Norwegian Wikipedia even when not browsing from Norway).

So, for example, if French doesn't appear in the CLDR suggestions for the reader's country, Aramaic may be shown there, but the chance that French is more useful to the reader than Aramaic is much higher.

The easiest thing to do is to combine these two lists:

... and this will give us:
[ 'zh', 'en', 'hi', 'ur', 'es', 'ar', 'ru', 'id', 'ms', 'pt', 'fr', 'de', 'bn', 'ja', 'pnb', 'pa', 'jv', 'te', 'ta', 'ko', 'mr', 'tr', 'vi', 'it', 'fa', 'sv', 'nl', 'pl' ]

This should be a variable, named something like $wgUniversalLanguageSelectorFallbackCommonLanguages, because it might be useful to customize it for some sites.

To clarify, these languages must only be added to the list as the last fallback. If Afrikaans is actually one of the suggested languages, e.g. from CLDR, and French is not, then Afrikaans must take precedence over French.

Criteria for inclusion
Languages with:

  • Over 50 mln speakers
    • With "Lahnda" converted to two varieties of Punjabi, which are its written versions.
  • Has a Wikipedia with
    • 20,000 articles
    • depth of at least 5

Event Timeline

Amire80 triaged this task as High priority.May 16 2016, 9:54 AM
Amire80 moved this task from Backlog to Prioritised languages on the ULS-CompactLinks board.

Is there a way we could support this with the current info of langdb?

We can consider the languages that belong to Worldwide and at least one more region. The last requirement is to avoid languages that were classified as Worldwide just because they don't belong to a particular region (such as constructed languages).

Currently, that list would be [fr,es,pt,en] but it can be extended by adding the "Worldwide" region to some of the languages mentioned in the description.

This would make the Common languages inside the selector consistent with the short list outside of it.

Yes, possibly, this would be nice reuse of current infrastructure. I do expect more requests for project-wide customization for which we'll need a variable, but reusing WW can be a sensible beginning. Other thoughts?

Amire80 updated the task description. (Show Details)

This would make the Common languages inside the selector consistent with the short list outside of it.

Not fully: T135487#2304882.

So what is the verdict? As it-is this task is not ready for development.

If we go with the languages classified in world-wide section, they will come in alphabetical order by default. Hence I am supportive of adding a separate ordered list. If we place this to frequent/common languages method of ULS they will be used for both the sidebar list and the common languages section in the selection as last resort (making it much longer than it is now usually). Or we can just have it for the sidebar only.

Hence I am supportive of adding a separate ordered list.

If having a specific list seems useful, I'm ok with any internal organisation.

Or we can just have it for the sidebar only.

My proposal was to consider them only as fallback, not adding them as Common languages section.

The final goal is to use as last fallback a given list of languages that can be literal or inferred from an existing one (e.g., "worldwide" section, hopefully without extending it too much).

... and this will give us:
[ 'zh', 'en', 'hi', 'ur', 'es', 'ar', 'ru', 'id', 'ms', 'pt', 'fr', 'de', 'bn', 'ja', 'pnb', 'pa', 'jv', 'te', 'ta', 'ko', 'mr', 'tr', 'vi', 'it', 'fa', 'sv', 'nl', 'pl' ]

For the list it would be good to have a more clear definition of the criteria to avoid it to be considered arbitrary. For example Cebuano is number 3 in the list of top Wikipedias but not appearing in the proposed list. Would it make sense to base it only on the Ethnologue lists?

I am adding the proposed list now. It is easy to change if the criteria is changed. Also not making it configurable for now until there is clear need for that.

Change 289855 had a related patch set uploaded (by Nikerabbit):
Add some global fallbacks to compact language links

https://gerrit.wikimedia.org/r/289855

Change 289855 merged by jenkins-bot:
Add some global fallbacks to compact language links

https://gerrit.wikimedia.org/r/289855

From what I was told, many articles on Cebuano Wikipedia as well as some other Wikipedia with very high article-per-speaker ratio used bots to create articles from database, for instance [just for example] those bots could create a million article for 1st to 1 millionth asteroid automatically just by copying from database according to a user defined format. See https://ceb.wikipedia.org/w/index.php?limit=50&title=Espesyal%3AMga+Tampo&contribs=user&target=Lsjbot&namespace=&tagfilter=&newOnly=1&year=2016&month=-1 for example. I don't think this should be taken into consideration about what language would be useful to visitors.

From what I was told, many articles on Cebuano Wikipedia as well as some other Wikipedia with very high article-per-speaker ratio used bots to create articles from database

I think the key concept is number of speakers. That is why I was proposing using just the ethnologue list which captures that. Adding the list of wikipedias, based on article numbers, seems to just introduce potential noise and the need for an evaluation process that may be subjective.

From what I was told, many articles on Cebuano Wikipedia as well as some other Wikipedia with very high article-per-speaker ratio used bots to create articles from database

I think the key concept is number of speakers. That is why I was proposing using just the ethnologue list which captures that. Adding the list of wikipedias, based on article numbers, seems to just introduce potential noise and the need for an evaluation process that may be subjective.

That's precisely why I didn't add Cebuano to the list.

From what I was told, many articles on Cebuano Wikipedia as well as some other Wikipedia with very high article-per-speaker ratio used bots to create articles from database

I think the key concept is number of speakers. That is why I was proposing using just the ethnologue list which captures that. Adding the list of wikipedias, based on article numbers, seems to just introduce potential noise and the need for an evaluation process that may be subjective.

That's precisely why I didn't add Cebuano to the list.

All I'm asking is for a clear definition of the criteria of inclusion.
If it is "top X languages by speakers + top Y Wikipedias by article given a depth value > Z", I'd be happy. If it is based on "we just remove some languages we have heard about being populated with bots", I think it's a less good of a criteria.

Public service reminder: article count is not a sorting criterion since 2008. https://meta.wikimedia.org/wiki/Top_Ten_Wikipedias

From what I was told, many articles on Cebuano Wikipedia as well as some other Wikipedia with very high article-per-speaker ratio used bots to create articles from database

I think the key concept is number of speakers. That is why I was proposing using just the ethnologue list which captures that. Adding the list of wikipedias, based on article numbers, seems to just introduce potential noise and the need for an evaluation process that may be subjective.

That's precisely why I didn't add Cebuano to the list.

All I'm asking is for a clear definition of the criteria of inclusion.
If it is "top X languages by speakers + top Y Wikipedias by article given a depth value > Z", I'd be happy. If it is based on "we just remove some languages we have heard about being populated with bots", I think it's a less good of a criteria.

OK, I did go for some intuition, but I can do something retroactive:

  • Over 50 mln speakers
    • With "Lahnda" converted to two varieties of Punjabi, which are its written versions.
  • Has a Wikipedia with
    • 20,000 articles
    • depth of at least 5

By these criteria we'll have to add 'yue'. I have nothing against it, I just thought that it has much less articles for some reason and didn't bother to check—my bad.

Raising the depth to 10 will remove 'sv', and raising it to 20 will remove 'nl'. I'd be OK with it, too.

If anybody wants to make these changes, just go for it.

Arrbee moved this task from QA to Done on the Language-Q4-2016-Sprint 3 board.

Thanks @Amire80, I added the criteria to the ticket description in case we need to check new requests in the future.