Page MenuHomePhabricator

Many language wiki templates (pl, it, en, cs) don't accept xx-XX style language codes
Closed, DeclinedPublic0 Estimated Story Points

Description

IT wiki:
You can see it here.
(I also put the same citations at en.wiki in case it's useful or interesting for you to see the different outcome.)
A solution to a similar issue is discussed in https://phabricator.wikimedia.org/T97256#1248815 .

PL wiki:
https://www.mediawiki.org/w/index.php?title=Topic:Sgikbv81nxsv7oy6&topic_showPostId=sqt0thxnx0gx69xc#flow-post-sqt0thxnx0gx69xc

EN wiki:
VE is setting the language in {{cite}} templates. It's setting it to en-US, en-GB and other flavours which are not recognized by {{cite}}. Also, these shouldn't be set if the language and wiki are the same languages. On enwiki, these errors end up in [[Category:CS1 maint: Unrecognized language]]

CS wiki:
Per community discussion here, spotted also with another problem described in T156548

Examples are:
https://en.wikipedia.org/w/index.php?title=Chris_Harris_(Automotive_Journalist)&action=edit&oldid=688324259
https://en.wikipedia.org/w/index.php?title=Adam_Waito&type=revision&diff=685374219&oldid=685373324
https://en.wikipedia.org/w/index.php?title=Aijia&type=revision&diff=681819929&oldid=681818020

Event Timeline

Elitre raised the priority of this task from to Needs Triage.
Elitre updated the task description. (Show Details)
Elitre added a project: Citoid.
Elitre subscribed.
Elitre set Security to None.
Elitre added a subscriber: Mvolz.

So, this is basically the result of us now scraping more data. Our language validator has always allowed xx-XX style language codes, we just weren't getting them as often so it wasn't as noticeable.

We're currently not entirely certain how to resolve this; each template has its own way of validating language codes, and we don't want to overfit to a particular template. We'd like to conform to a given standard but are not sure what that would be. We're currently basically using https://en.wikipedia.org/wiki/IETF_language_tag (as noted by @mobrovac in chat) but not very strictly.

Mvolz renamed this task from "Unknown language" error on it.wp for sources in Italian to Many language wiki templates (pl, it, en) don't accept xx-XX style language codes.Nov 3 2015, 2:49 PM
Mvolz updated the task description. (Show Details)
Mvolz updated the task description. (Show Details)

@Mvolz, so who can help you to move this forward, anything I can do here? Is "not scraping more data until a fix is found" a possible solution? Any advice we can give to communities to "fix" this on their side if possible, other than the workaround linked above? Thank you!

@Elitre, re moving things forward- I think we are basically still undecided on what to do.

A possible fix, which is safe probably for most templates, is to stick to two-three letter language codes, but there have been complaints about the existing language codes being too limiting. But that's something I'm willing to do- @mobrovac?

Mvolz triaged this task as Medium priority.

A possible fix, which is safe probably for most templates, is to stick to two-three letter language codes, but there have been complaints about the existing language codes being too limiting.

Let's pick a standard and enforce it?

I think we agree on that, just which standard, is the issue.

nl.wp doesn't recognize xx-XX languages codes like nl-NL, the language templates accept only two letter codes (sometimes three letter codes)

Mvolz removed Mvolz as the assignee of this task.Sep 30 2016, 2:39 PM
Jdforrester-WMF lowered the priority of this task from Medium to Low.Oct 4 2016, 7:14 PM
Jdforrester-WMF subscribed.

So, the options are:

  • Ignore this.
  • Modify the citoid service to send less information, except magically when it's wanted (like pt-BR vs. pt); no idea how we'd all agree on a shared set for everyone.
  • The above, but inside the Citoid extension, so all clients of the service would have to replicate the same logic (but more flexible to adjust on a per-wiki basis).
  • Fix the templates to work with these valid codes.

Or am I missing something? Option 4 seems the obvious winner…

On fiwiki we have {{IETF-kielisymboli}} that converts "en-EN" to show like "en" would show. And if the site language is also Finnish, then this won't show it. This is using the codeLangue3 function in Module:FrLangue.

German WP has no problems at all with any language code.

If we are told explicitly that a book is written in German we store this information, but don't show that in articles and do not bother readers, but expose it in microformats.

May I advertise Multilingual lua library, e,g, getBase function? It falls back to root language for those who cannot deal with extended codes right now. Later it may be configured to support variants unknown to CLDR. Publications written in multiple languages are supported, too.

Mvolz renamed this task from Many language wiki templates (pl, it, en) don't accept xx-XX style language codes to Many language wiki templates (pl, it, en, cs) don't accept xx-XX style language codes.Jan 28 2017, 1:12 PM
Mvolz added a subscriber: Dvorapa.

Czech Wikipedia users complained again (details)

This can be fixed, community-side, by editing the citation templates. There are a few references to this in this ticket and in at least another one. You can contact the people who left such comments if you need further details.

@Elitre There is one problem. Czech Wikipedia community follows standard ISO 639-1, which only accepts two letter language definition (used also in Wikipedia subdomains)

  • Text sequences, in www and HTML, are tagged by codes according to RFC 5646Tags for Identifying Languages
  • There Primary Language Subtag declares “Three-character primary language subtags in the IANA registry were defined” etc.
  • IANA Language Subtag Registry knows more than 8100 three-letter-codes (seek for Subtag: aaa) since 2009.
  • HTML refers to a certain “BCP 47” which is nothing else than obsoleted RFC 5646 with the same story on three-letter-codes.
  • HTML is the ultimative specification on resolving wikitext at client side.

Conclusio: Any limitation to two-letter-codes is not appropriate and needs to be extended.

  • https://ace.wikipedia.org/ is the first Wikipedia subdomain of many others with three letters. The two-letter-code-story is nonsense today, had a certain importance a decade ago.

The English Wikipedia CS1/2 modules currently support by default what MediaWiki supports (mw.language.fetchLanguageNames). The module will trim to the first 2/3 letter code in language_parameter in the module proper. You can see this by experimenting with an English Wikipedia page that it does accept e.g. nl-NL with an output of "(in Dutch)". There are some overrides listed in lang_name_remap in the configuration file, but that's not directly relevant here.

I generally agree that the correct fix is for the communities to get their modules up to date with the English modules if they are seeing errors for longer codes. (There may be other issues with that of course that come with what are likely severely out-of-date modules. [No, not a problem fixed by global modules--we'd just kill all development of the English modules that way as everyone would need to agree that certain parameters were deprecated or not deprecated or.......])

I would recommend that this task be declined entirely or at best an issue for the CommTech team to handle, and for wikis still affected to come to the English Wikipedia talk page. @Trappist_the_monk is very helpful with use-of-CS1/2-on-other-wikis kinds of questions/issues.

What the module does not do at this time is display that this is Dutch Dutch (nl-NL) (or, er, Dutch--better example is en-GB = British English). Is that what is being requested? There are some TODO comments in the module code if this is what is being requested. There's probably some work that could be done to hook into Module:Lang which would support this better. However, I think that request is a different task.

The description includes Also, these shouldn't be set if the language and wiki are the same languages. This is not true any longer (see end comments in the linked T156548). English Wikipedia at least will take the value but do nothing with it rather than dump it into a maintenance category. This is done automatically without any need for configuration.

The English Wikipedia CS1/2 modules currently support by default what MediaWiki supports (mw.language.fetchLanguageNames). The module will trim to the first 2/3 letter code in language_parameter in the module proper. You can see this by experimenting with an English Wikipedia page that it does accept e.g. nl-NL with an output of "(in Dutch)". There are some overrides listed in lang_name_remap in the configuration file, but that's not directly relevant here.

I generally agree that the correct fix is for the communities to get their modules up to date with the English modules if they are seeing errors for longer codes. (There may be other issues with that of course that come with what are likely severely out-of-date modules. [No, not a problem fixed by global modules--we'd just kill all development of the English modules that way as everyone would need to agree that certain parameters were deprecated or not deprecated or.......])

I would recommend that this task be declined entirely or at best an issue for the CommTech team to handle, and for wikis still affected to come to the English Wikipedia talk page. @Trappist_the_monk is very helpful with use-of-CS1/2-on-other-wikis kinds of questions/issues.

What the module does not do at this time is display that this is Dutch Dutch (nl-NL) (or, er, Dutch--better example is en-GB = British English). Is that what is being requested? There are some TODO comments in the module code if this is what is being requested. There's probably some work that could be done to hook into Module:Lang which would support this better. However, I think that request is a different task.

My understanding was that that actually *isn't* wanted...

The description includes Also, these shouldn't be set if the language and wiki are the same languages. This is not true any longer (see end comments in the linked T156548). English Wikipedia at least will take the value but do nothing with it rather than dump it into a maintenance category. This is done automatically without any need for configuration.

Yes, my understanding is that is the preferred option. This ticket is so old that it's been fixed since on wiki. I agree that we should decline it since it seems to be a problem that, if we let sit long enough, encourages better practice by using more modern language codes as per @PerfektesChaos ;).