We are interested in a ranking of language pairs as a function of the number of editors who know these languages proficiently per Extension:Babel.
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | leila | T171224 [Objective 9.1.1] Article expansion recommendations | |||
| Resolved | leila | T183039 Gather labels as ground truth for translation and synonym section classifiers | |||
| Resolved | • bmansurov | T184212 Gather labels as ground truth for section translation | |||
| Resolved | • bmansurov | T185160 Gather basic statistics on languages spoken by editors |
Event Timeline
Public datasets contain the babel table. Here is a Quarry query that shows the languages of uzwiki users. Here's the query just in case the above link doesn't work:
use uzwiki_p; SELECT user.user_name, babel.babel_lang AS lang, babel.babel_level AS level FROM babel LEFT JOIN user ON user.user_id = babel.babel_user ORDER BY user.user_name ASC;
Excerpt from the output:
| user_name | lang | level |
| Araz Yaquboglu | az | N |
| Araz Yaquboglu | tr | 2 |
| Araz Yaquboglu | ru | 1 |
| Araz Yaquboglu | en | 1 |
| Basiyra | uz | N |
| Basiyra | ru | 2 |
And here is a list of languages of uzwiki users who speak more than one language. The data is formatted as JSON. Below is the query:
SELECT user.user_name, CONCAT('[', GROUP_CONCAT('{', '"lang": "', babel.babel_lang, '",', '"level": "', babel.babel_level, '"', '}'), ']') AS babel FROM babel LEFT JOIN user ON user.user_id = babel.babel_user GROUP BY babel.babel_user HAVING COUNT(babel.babel_user) > 1 ORDER BY user.user_name ASC;
Excerpt from the output:
| user_name | babel |
| Araz Yaquboglu | [{"lang": "az","level": "N"}, {"lang": "en","level": "1"}, {"lang": "ru","level": "1"}, {"lang": "tr","level": "2"}] |
| Basiyra | [{"lang": "en","level": "2"}, {"lang": "ky","level": "2"}, {"lang": "ru","level": "2"}, {"lang": "uz","level": "N"}] |
| Dostojewskij | [{"lang": "de","level": "N"}, {"lang": "en","level": "1"}, {"lang": "es","level": "2"}, {"lang": "ru","level": "N"}] |
@leila, do you want me to generate the list of users and their language for the languages we identified? What fields do you want included?
this is great @bmansurov !
it would be great to have the overlap between all pair of languages, something like
en uz X
en fr Y
fr uz Z
etc.
X is the number of people that speaks 'en' and 'uz', Y is the number of people that speaks 'en' and 'fr' ...etc
@bmansurov I think X, Y, and Z in Diego's note are the number of people who have specified those two as the languages that they are proficient in. I think we should be a bit more specific by how you count those numbers though, as there are different levels of proficiency. So let's start with:
lan_1, lan_2, n
where n is the number of users who have specified their proficiency in lan_1 and lan_2 with 3, 4, 5, N or NULL (it seems xx-N is the same as xx, if I read the template documentation correctly) for both language pairs (basically, we want to see how many people speak both languages at advanced or higher level). Now looking at all language pairs may be too much, so you may want to do the search in a smaller set. Limiting ourselves to the set of languages we discussed on Tuesday can be too restrictive though. Let me know if you want us to be more specific about the language set. :)
Thanks, both, for the clarifications. Here's a script that does what we're looking for:
from collections import defaultdict import csv from itertools import combinations ACCEPTABLE_LEVELS = set(('3', '4', '5', 'N', None)) people = {} # {'name': {'lang': 'level'}} with open('uzwiki-babel.tsv', 'r') as infile: reader = csv.reader(infile, delimiter='\t') next(reader) for user, lang, level in reader: if user in people: people[user][lang] = level else: people[user] = {lang: level} language_pairs = defaultdict(lambda: 0) # {'en-uz': 5} for person, langs in people.items(): fluent_in = [k for k, v in langs.items() if v in ACCEPTABLE_LEVELS] for pair in combinations(fluent_in, 2): language_pairs[frozenset(pair)] += 1 # sort by both count and pair sorted_language_pairs = sorted( [(sorted(list(k)), v) for k, v in language_pairs.items()], key=lambda x: x[1], reverse=True ) # save to tsv as "en-uz 5" with open('uzwiki-babel-out.tsv', 'w') as outfile: writer = csv.writer(outfile, delimiter='\t') writer.writerow(['language pair', '# of users']) for pair, count in sorted_language_pairs: writer.writerow(("-".join(pair), count))
The script takes the output of the first query above and outputs the following data:
| language pair | # of users | | ru-uz | 6 | | en-uz | 5 | | en-ru | 4 | | ru-tr | 3 | | tr-uz | 3 | | en-fr | 3 | | de-en | 2 | | en-tr | 2 | | ky-uz | 2 | | ky-ru | 2 | | de-ru | 1 | | en-sv | 1 | | de-sv | 1 | | it-pt | 1 | | de-pt | 1 | | en-pt | 1 | | es-pt | 1 | | fr-pt | 1 | | de-it | 1 | | en-it | 1 | | es-it | 1 | | fr-it | 1 | | de-es | 1 | | de-fr | 1 | | en-es | 1 | | es-fr | 1 | | en-kaa | 1 | | en-kk | 1 | | en-tt | 1 | | en-uk | 1 | | en-ky | 1 | | kaa-uz | 1 | | kaa-kk | 1 | | kaa-tt | 1 | | kaa-uk | 1 | | kaa-ky | 1 | | kaa-ru | 1 | | kk-uz | 1 | | tt-uz | 1 | | uk-uz | 1 | | kk-tt | 1 | | kk-uk | 1 | | kk-ky | 1 | | kk-ru | 1 | | tt-uk | 1 | | ky-tt | 1 | | ru-tt | 1 | | ky-uk | 1 | | ru-uk | 1 | | ky-tr | 1 | | tg-uz | 1 | | ru-tg | 1 |
The data is taken from uzwiki only. I wonder if it makes more sense to consider those language pairs, in which one of the languages is Uzbek. If we wanted en-ru translators, then we'd either look for those in enwiki or ruwiki.
@bmansurov Diego and I discussed a bit more what you showed above, and did a few tests on our end.
I don't have a good sense of how long what I'm saying below will take, but let me start from the ideal case. Ideally, we check all projects (wikipedia languages and other projects), find all users of them who have used babel template in their user pages and get information about the languages they know and their proficiency in them. So, we will have something like the following as output:
(project, username, language, proficiency)
We use project for sanity checking later if needed. username we can use to combine redundant rows. For example, I may have announced in multiple languages the same level of information about my language proficiency. I also speculate that we may end up with (username,language) pairs that have different level of proficiency. We can get the max proficiency if that happens and drop the other rows. If we collect all this data in one place, then we can compute co-occurrences of people in two language and compute the number of users shared in a language pair. We can compute this for all the almost 45K pairs, or if that's too much, we can only keep the top n (n=30?) Wikipedia languages.
I'm fine doing it for all language pairs or just the top 30 — your call. Either way the effort on my end will almost be the same. We can also keep all proficiencies for analysis purposes only. I'll get back to you soon with data/code.
Here are the results for T185160#3913387.
All babel data
Sample
| project | username | language | proficiency |
| aawiki | Waihorace | en | 3 |
| aawiki | Waihorace | yue | N |
| aawiki | Waihorace | zh-hant | N |
| abwiki | Koavf | ab | 0 |
| abwiki | Njardarlogar | ab | 1 |
| abwiki | Kedth2108863 | ab | 1 |
| abwiki | Poti Berik | ab | 2 |
| abwiki | Bukamember1012 | ady | N |
| abwiki | Bukamember1012 | ae | 0 |
| abwiki | Velocitas | af | 3 |
Language pairs for users with proficiency 3 and up
Sample
| language pair | # of users |
| en-fr | 211 |
| en-it | 200 |
| de-en | 197 |
| en-es | 186 |
| de-ru | 184 |
| en-pl | 182 |
| en-ru | 175 |
| es-it | 164 |
| es-fr | 159 |
Lists of Wikis used in analysis
(taken from mediawiki-config)
SQL generator from dblist above
SQL used in generating all Babel data
(this is the out of the sql-generator.py)
Script to generate the language pairs
Notes
- To generate the language pairs, I didn't consider all combinations of 47,278 language pairs (given 308 wikis). Rather, I considered language pairs for each user and computed the sum afterwards.
- Some users may have the same babel info in multiple wikis. This may have resulted in duplicate counts in the final result. Let me know how we should deal with identical usernames in multiple databases. Should we consider them different users or the same?
They should be considered the same user. Since April 2015 an account covers the username on all Wikimedia projects.
An interesting data would be users with "different speaking ability between wikis" (probably, the result of an outdated babel on some of them), although I'm not sure what to do with those cases.
Please do check the other case that Platonides is also referring to, where a user may have multiple level of proficiency in a language. In that case, get a max (in a logical sense) over the set of proficiencies available to you for a (username,language) pair. This approximation is okay for our use-case. The idea is that it's harder to forget a language than learn it in the first place. And for our purposes, we can tolerate the errors that such an assumption can cause. If people don't know a language that well anymore, they will say it to us.
I've found a bug in the previous implementation and fixed it. I've also taken into account the suggestions from T185160#3913802 and T185160#3913885 and generated new data.
Sample:
| language pair | # of users |
|---|---|
| en-es | 5991 |
| en-ru | 5324 |
| en-it | 4261 |
| en-fr | 2796 |
| de-en | 2060 |
| ca-es | 1820 |
| ru-uk | 1803 |
| en-pt | 1644 |
| es-fr | 1181 |
To summarize the changes:
- Identical usernames in multiple wikis are considered the same;
- Identical languages of a user in multiple wikis are considered the same and the highest proficiency indicated is considered as that language's proficiency.
The code has also been moved to Github.
@diego Here's the list of languages by user count whose proficiencies are 3 and up.
Sample
| language | # of users |
|---|---|
| en | 21783 |
| ru | 13363 |
| es | 11007 |
| it | 9002 |
| fr | 4122 |
| de | 2959 |
| pt | 2648 |
| cs | 2073 |
| ca | 1985 |
@bmansurov super useful and we already discussed this. from my pov, this task is done. If I'm wrong, feel free to bring it back to in-progress. thanks.