Page MenuHomePhabricator

Identify most useful languages for future OpusMT models
Open, MediumPublic

Description

The OpusMT project supports many languages. Some relevant languages for Wikimedia may not have pretrained models or those may be outdated. For example, several outdated models could not be integrated in MinT (T333969).

This ticket is intended to compile a list of relevant languages where translation models based on the Opus project would be the most useful. This is based on multiple criteria.

Languages provided only by one translation service

Supported only by Google Translate

  • Corsican (co)
  • Divehi (dv)
  • Western Frisian (fy)
  • Hawaiian (haw)
  • Gan Chinese (gan) Currently supported by using Traditional Chinese with Google Translate T258919

Supported only by Yandex:

  • Chuvash (cv)
  • Eastern Mari (mhr)
  • Western Mari (mrj)
  • Yakut (sah)
  • Udmurt (udm)

Languages where MT has signs of low quality

MT available but not using it often

Based on the MT usage report these languages have a significant percentage of translations created without using machine translation which may be a sign of low quality:

  • Indonesian (id)
  • Māori (mi)

Community-reported low quality

Languages only supported with OpusMT lacking updated pre-trained models

More details in T333969: Enable Opus models for languages lacking other Machine Translation options

  • Cantonese (yue) Other models only support the wrong variant as per T304865#8684437
  • Moroccan Arabic (ary) Other models only support the wrong variant as per T339926#9071464
  • Gun (guw)
  • Sranan Tongo (srn)
  • Venda (ve)
  • Tahitian (ty)
  • Bislama (bi)
  • Manx (gv)
  • Walloon (wa)
  • Breton (br)
  • Finnish (fi)

Languages with translation activity not supported by any MT

There are Wikipedias in 152 languages without machine translation and some of these communities have been using Content Translation despite the lack of MT. The list below shows those languages where Content Translation has been used the most despite the lack of MT. This can be useful to prioritize languages for which enabling MT can make the most impact.

  1. Cornish (kw)
  2. Inuktitut (iu)
  3. Saraiki (skr)
  4. Kara-Kalpak (kaa)
  5. Cantonese (yue) Uses zh-yue code on Wikipedia
  6. N’Ko (nqo)
  7. Mazanderani (mzn)
  8. Breton (br)
  9. Tuvinian (tyv)
  10. Cree (cr)
  11. Aromanian (rup) Uses roa-rup code on Wikipedia
  12. Ossetic (os)
  13. Min Nan Chinese (nan) Uses zh-min-nan code on Wikipedia
  14. Scots (sco)
  15. Võro (vro) Uses fiu-vro code on Wikipedia
  16. Serbo-Croatian (sh)
  17. Ladin (lld)
  18. Nāhuatl (nah)
  19. Aragonese (an)
  20. Inari Sami (smn)
  21. Tachelhit (shi)
  22. Western Armenian (hyw)

This list has been compiled based on a query for the last 2 years of activity, screenshot with the number of translations published is shown below:

turnilo.wikimedia.org_ (3).png (3×2 px, 402 KB)

1ab
2ady
3ak
4als
5alt
6ami
7an
8ang
9arc
10atj
11av
12avk
13bar
14bat-smg
15be-x-old
16bi
17bpy
18br
19bxr
20cbk-zam
21cdo
22ce
23ch
24chy
25cr
26csb
27cu
28dag
29diq
30dsb
31dty
32eml
33ext
34fiu-vro
35frp
36frr
37gag
38gan
39gcr
40glk
41gor
42got
43guw
44gv
45hak
46hif
47hsb
48hyw
49ia
50ie
51ik
52inh
53io
54iu
55jam
56jbo
57kaa
58kbd
59kcg
60kl
61koi
62krc
63ksh
64kv
65kw
66lad
67lbe
68lez
69lfn
70lld
71mad
72map-bms
73mdf
74mnw
75mwl
76myv
77mzn
78na
79nah
80nap
81nds
82nds-nl
83new
84nia
85nov
86nqo
87nrm
88nv
89olo
90os
91pam
92pcd
93pdc
94pfl
95pi
96pih
97pms
98pnb
99pnt
100pwn
101rm
102rmy
103roa-rup
104roa-tara
105rue
106sco
107se
108sh
109shi
110skr
111smn
112srn
113stq
114szy
115tay
116tcy
117tet
118trv
119ty
120tyv
121ve
122vep
123vls
124vo
125wa
126wuu
127xal
128xmf
129za
130zea
131zh-classical
132zh-min-nan
133zh-yue

Details

Other Assignee
KCVelaga_WMF

Event Timeline

Pginer-WMF renamed this task from Identify OpusMT languages to re-train to Identify most useful languages for future OpusMT models.Aug 2 2023, 3:03 PM
Pginer-WMF triaged this task as Medium priority.

@KCVelaga I compiled a list of languages that can benefit the most if new models from OpusMt were released. Feel free to share any thought about other criteria that could help to identify languages where new/better models would have the biggest impact.

@Pginer-WMF the list and the criteria used look good to me. If the languages have high usage of cx despite having no MT support, enabling MT will benefit the editors.