Page MenuHomePhabricator

Verify that only valid languages are accepted
Closed, ResolvedPublic

Description

In the API modules for Wikibase there is a entry for valid languages, and it is defined like this

languages - By default the internationalized values are returned in all available languages. This parameter allows filtering these down to one or more languages by providing one or more language codes.

Values (separate with '|'): aa, ab, ace, af, ak, aln, als, am, an, ang, anp, ar, arc, arn, ary, arz, as, ast, av, avk, ay, az,
azb, ba, bar, bat-smg, bcc, bcl, be, be-tarask, be-x-old, bg, bh, bho, bi, bjn, bm, bn, bo, bpy,
bqi, br, brh, bs, bug, bxr, ca, cbk-zam, cdo, ce, ceb, ch, cho, chr, chy, ckb, co, cps, cr, crh,
crh-latn, crh-cyrl, cs, csb, cu, cv, cy, da, de, de-at, de-ch, de-formal, diq, dsb, dtp, dv, dz, ee,
egl, el, eml, en, en-ca, en-gb, eo, es, et, eu, ext, fa, ff, fi, fit, fiu-vro, fj, fo, fr, frc, frp,
frr, fur, fy, ga, gag, gan, gan-hans, gan-hant, gd, gl, glk, gn, got, grc, gsw, gu, gv, ha, hak,
haw, he, hi, hif, hif-latn, hil, ho, hr, hsb, ht, hu, hy, hz, ia, id, ie, ig, ii, ik, ike-cans,
ike-latn, ilo, inh, io, is, it, iu, ja, jam, jbo, jut, jv, ka, kaa, kab, kbd, kbd-cyrl, kg, khw, ki,
kiu, kj, kk, kk-arab, kk-cyrl, kk-latn, kk-cn, kk-kz, kk-tr, kl, km, kn, ko, ko-kp, koi, kr, krc,
kri, krj, ks, ks-arab, ks-deva, ksh, ku, ku-latn, ku-arab, kv, kw, ky, la, lad, lb, lbe, lez, lfn,
lg, li, lij, liv, lmo, ln, lo, loz, lt, ltg, lus, lv, lzh, lzz, mai, map-bms, mdf, mg, mh, mhr, mi,
min, mk, ml, mn, mo, mr, mrj, ms, mt, mus, mwl, my, myv, mzn, na, nah, nan, nap, nb, nds, nds-nl,
ne, new, ng, niu, nl, nl-informal, nn, no, nov, nrm, nso, nv, ny, oc, om, or, os, pa, pag, pam, pap,
pcd, pdc, pdt, pfl, pi, pih, pl, pms, pnb, pnt, prg, ps, pt, pt-br, qu, qug, rgn, rif, rm, rmy, rn,
ro, roa-rup, roa-tara, ru, rue, rup, ruq, ruq-cyrl, ruq-latn, rw, sa, sah, sat, sc, scn, sco, sd,
sdc, se, sei, sg, sgs, sh, shi, shi-tfng, shi-latn, si, simple, sk, sl, sli, sm, sma, sn, so, sq,
sr, sr-ec, sr-el, srn, ss, st, stq, su, sv, sw, szl, ta, tcy, te, tet, tg, tg-cyrl, tg-latn, th, ti,
tk, tl, tly, tn, to, tokipona, tpi, tr, tru, ts, tt, tt-cyrl, tt-latn, tum, tw, ty, tyv, udm, ug,
ug-arab, ug-latn, uk, ur, uz, ve, vec, vep, vi, vls, vmf, vo, vot, vro, wa, war, wo, wuu, xal, xh,
xmf, yi, yo, yue, za, zea, zh, zh-classical, zh-cn, zh-hans, zh-hant, zh-hk, zh-min-nan, zh-mo,
zh-my, zh-sg, zh-tw, zh-yue, zu

In this list there are entries that should not be used for language specific entries, like the Norwegian (no) entry. This is a metalanguage for Bokmål (nb) and Nynorsk (nn). I guess there are several others that is also wrong. Some of them could be redirected, but some should not be used at all. If the entry is used as a site-prefix, or in the site id, we should probably set up a redirect even if it is not strictly correct.


Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=42153

Details

Reference
bz46455

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 1:40 AM
bzimport set Reference to bz46455.
bzimport added a subscriber: Unknown Object (MLST).

Note that this is mainly about removing invalid entries from our own use of the languages, not about a more general solution. Its kind of a stop gap solution to avoid data being uploaded for non-existing languages.

Note that this comes from Utils::getLanguageCodes which uses \Language::fetchLanguageNames() which only creates a list of valid names, but says nothing about function.

From the top of my head, these should not be used at all:

  • simple (labels and descriptions on Wikidata should all be pretty simple by default; more of a community/policy issue, but technically I don't see the need for separate simple and en descriptions. But should be discussed by the community on Wikidata)
  • tokipona (why is this still available at all?)
  • ug (Wikipedia uses both Arab and Latin script, labels/descriptions should specify which one is used)

And these should be redirects:

  • als -> gsw (correct language code)
  • bat-smg -> sgs (correct language code)
  • be-x-old -> be-tarask
  • fiu-vro -> vro (correct language code)
  • no -> nb (for legacy Wikipedia reasons)
  • roa-rup -> rup (correct language code)
  • zh-classical -> lzh (correct language code)
  • zh-min-nan -> nan (correct language code)
  • zh-yue -> yue (correct language code)

Borderline cases, not sure which way these should redirect:

  • hif-latn <-> hif (Wikipedia seems to only use Latin script)
  • kbd-cyrl <-> kbd (Wikipedia seems to only use Cyrillic script)
  • ku-latn <-> ku (Wikipedia seems to only use Latin script)
  • tt-cyrl <-> tt (Wikipedia seems to only use Cyrillig script)

Not sure:

  • crh / crh-latn / crh-cyrl: probably crh should redirect to crh-latn, or the other way around. crhwiki seems to use Latin only.
  • gan / gan-hans / gan-hant: some sort of automatic conversion should be made available on Wikidata, as with other languages written in Han script(s)
  • kk / kk-arab / kk-cyrl / kk-latn / kk-cn / kk-kz / kk-tr: the Wikipedia has automatic conversion (from Latin?), should be made available on Wikidata.
  • ks / ks-arab / ks-deva: ks should probably redirect to ks-arab, or the other way around. kswiki seems to use Arabic only. Probably no automatic conversion available.
  • ruq / ruq-cyrl / ruq-latn: maybe ruq should be disabled and only ruq-cyrl and ruq-latn accepted as inputs?
  • shi / shi-tfng / shi-latn: don't know which one is more common
  • sr / sr-ec / sr-el: same as above
  • tg / tg-cyrl / tg-latn: automatic conversion?
  • zh and variants (except those mentioned earlier): automatic conversion exists on Wikipedia, should be reuseable on Wikidata

In most of these "Not sure" cases, the main language code should probably be disabled, and input should be specifically in either of the two/more variants, though the presence of automatic conversion for some may make things a bit more complicated.

  • Bug 44379 has been marked as a duplicate of this bug. ***

I don't see why would various scripts redirect to each other, this should rather be done using language fallback (bug 36430).

  • This bug has been marked as a duplicate of bug 37459 ***
Restricted Application added a subscriber: StudiesWorld. · View Herald Transcript