Page MenuHomePhabricator

Language code "sms" not recognized in Commons
Closed, ResolvedPublic

Description

A user was attempting to add a file caption in the Skolt Sami language, ISO 639-3 code sms. The universal language selector does not provide sms as a language option, the closest found is ses.

Can sms be added to the ULS?

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Zache added a comment.May 2 2019, 8:10 AM

I believe it was @Amire80 who suggested that we should instead be using language-data (which is also used by ULS) for all these cases. I don't know how one would go about doing that though.

This means to add them to CLDR? Based on wikidatada related discussions problem with adding new languages to CLDR is that not all languages are accepted to CLRD. Process seems also to be pretty slow. These are reasons why they are defined via wmgExtraLanguageNames in first place.

Language-data does not depend on languages being in CLDR. It aims to be a comprehensive database of languages (those having a standard language code), their autonyms and scripts.

I don't think there is way around the fact that we need to define explicitly the list of languages appropriate for any given context. The list is different for MediaWiki interface language selection, page content language selection, wikibase labels and so on. Often we also need to define language code map to ensure legacy codes and other discrepancies work correctly. From this point of view, the problem is that it is not clear where the list of valid language for given context is configured, or the opposite what contents a given language setting affects. There should be a bijection from context to configuration setting. Currently it's not and it's very confusing. If we add new context specific configuration variables, we also need to keep up the documentation (such as new wiki addition) so that new languages get added to the appropriate places.

Keegan added a comment.May 2 2019, 6:00 PM

I'm not sure where this feature request will go, but the user points out that snm, Inari Sami, likely does not work either and would like it to.

Raymond added a subscriber: Raymond.May 2 2019, 6:40 PM
Yupik added a subscriber: Yupik.EditedMay 3 2019, 10:52 AM

I'm not sure where this feature request will go, but the user points out that snm, Inari Sami, likely does not work either and would like it to.

I tried adding in snm today and it didn't work. It doesn't show up in the caption at all and it only shows up as the language code in the summary.

Adding Sami language support to Wikidata was T217430 which is marked as Resolved, so I would expect it to Just Work™ in WBMI.

Short test: None of the languages (besides nys for whatever reason) added via wmgExtraLanguageNames can be used as language in WBMI.
Furthermore they do not work together with the parser function {{#language:xx}}:

But they works on Test-Wikidata:

@Jdforrester-WMF ; Would it be a solution to add commonswiki as another project in wmgExtraLanguageNames?

Nikki added a comment.May 3 2019, 7:48 PM

Short test: None of the languages (besides nys for whatever reason) added via wmgExtraLanguageNames can be used as language in WBMI.

nys works because it was later added to Names.php, making the entry in wmgExtraLanguageNames redundant.

oh, wow, yes, ExtraLanguageNames is not an appropriate disposition for this stuff (as it only works on one wiki, Wikidata, so e.g. clients of Wikidata won't recognise it). New language support for actual content should always go in MediaWiki itself in the normal manner.

oh, wow, yes, ExtraLanguageNames is not an appropriate disposition for this stuff (as it only works on one wiki, Wikidata, so e.g. clients of Wikidata won't recognise it). New language support for actual content should always go in MediaWiki itself in the normal manner.

That means putting them in LocalNames in the CLDR extension?

oh, wow, yes, ExtraLanguageNames is not an appropriate disposition for this stuff (as it only works on one wiki, Wikidata, so e.g. clients of Wikidata won't recognise it). New language support for actual content should always go in MediaWiki itself in the normal manner.

That means putting them in LocalNames in the CLDR extension?

This was already done by me weeks ago. But this does not enable them as allowed language on wikis.

Then that something probably should be changed to use codes coming from the CLDR extension. It might be as easy as changing one parameter to fetchLanguageNames.

Yupik added a comment.May 7 2019, 5:54 AM

Is it safe to assume that everything I've tried to upload before this issue gets fixed will need to be reentered?

As I have been and will be uploading a number of Saami-related photos in upcoming months, the sooner this gets resolved so I can enter this info when uploading, the better since I'd prefer not to have to go back and retype all the lost information.

Yupik added a comment.EditedMay 7 2019, 6:01 AM

The subtask of renaming sää´mǩiõll to the proper term has been taken moved to T223544, but the main task is still not working properly.

Nikki added a comment.May 7 2019, 9:14 AM

On a side note, who can I ask to fix the palatalization marker in Wikidata in various places:

I think it's coming from language-data.

Yupik added a comment.May 7 2019, 11:52 AM

On a side note, who can I ask to fix the palatalization marker in Wikidata in various places:

I think it's coming from language-data.

Thanks, I fixed it there and put in a pull request for it. Hopefully that was the problem.

Yupik added a comment.EditedMay 10 2019, 10:53 PM

On a side note, who can I ask to fix the palatalization marker in Wikidata in various places:

I think it's coming from language-data.

Fixing it there didn't fix it on Wikidata, unfortunately. No clue where it's coming from.

EDIT: Actually, never mind. My correction to that file has not made its way to the master branch for whatever reason so the master branch file still has the wrong marker in it, which is why it doesn't work yet.

Then that something probably should be changed to use codes coming from the CLDR extension. It might be as easy as changing one parameter to fetchLanguageNames.

Would it be possible to get this fixed asap? We have a Saami edit-a-thon coming up next month as part of IYIL2019 and it'd be mighty embarrassing to have to say that yeah, in principle you can use your own language, but in reality, you can't.

Then that something probably should be changed to use codes coming from the CLDR extension. It might be as easy as changing one parameter to fetchLanguageNames.

Would it be possible to get this fixed asap? We have a Saami edit-a-thon coming up next month as part of IYIL2019 and it'd be mighty embarrassing to have to say that yeah, in principle you can use your own language, but in reality, you can't.

Seconding @Yupik 's comment. We need this fixed ad tested for an upcoming event. I will add a link to the event board as soon as I create it.

Yupik moved this task from Incoming to In progress on the WMFI board.
Susannaanas added a comment.EditedMay 22 2019, 8:25 AM

Summary

We would like to ask the SDC team to make the following Saami languages available in SDC and preferably also create documentation of how languages are configured for SDC.

  • The language list and their current status on Wikidata can be found in the related ticket. Of these we kindly request to add all those with adequate data (autonyms). If the process gets documented it will be then easier to add the remaining ones once the language info is complete.
  • There has been wide attention on this task during the hackathon. People referenced in this and the related ticket can be asked for further information about actions taken to enable them on Wikidata.

Status
We can add a status report of what is and is not functioning in SDC currently, if needed.

Use case
The capability is needed in an upcoming workshop on 5 June where participants will be asked to caption and tag Commons images about Saami culture in Saami languages.

According to @Ramsey-WMF this is not something that can be fixed by the StructuredDataOnCommons team as the issue lies elsewhere. Pinging a few other people: @Ladsgroup @Amire80 @Nikerabbit who may have a lead on this.

Yupik added a comment.May 22 2019, 9:28 AM

There are multiple issues with this ticket. In addition to the screenshot in https://phabricator.wikimedia.org/T222309#5163388, we also have this issue, where it's not loading the name of the language for either sms or smn in the summary box:

Yupik added a comment.May 22 2019, 9:32 AM

Similar to the issue in T222309#5163388, but trying to upload a caption once the image has been uploaded does not work for sms or smn when trying to select the language by ISO 639-3 code, its autonym, or any of its translations into other languages:

sms

smn

Hey @Lydia_Pintscher could you have a look at this and see if we can help?

According to @Ramsey-WMF this is not something that can be fixed by the StructuredDataOnCommons team as the issue lies elsewhere. Pinging a few other people: @Ladsgroup @Amire80 @Nikerabbit who may have a lead on this.

The discussion about the parts with which I can help takes place at T223524.

Yupik added a comment.May 22 2019, 9:35 AM

If I'm looking at the correct and most up-to-date file, both sms and smn are included in the ULS. There smn is correctly written, but sms is not. The same change as in T223544 needs to be made to the name of the language there.

I have found a new issue: Language::fetchLanguageNames( Language::AS_AUTONYM, Language::ALL ) does not return sms, while for example Language::fetchLanguageNames( 'en', Language::ALL ) does.

The former is usually used to get a list of all known languages. This can be fixed in the cldr extension.

Yupik added a comment.May 22 2019, 9:52 AM

That's a good catch @Nikerabbit.

Yupik renamed this task from Language code "sms" not recognized to Language code "sms" not recognized in Commons.May 22 2019, 9:53 AM

I've looked a bit too deeply into the WikibaseMediaInfo code, and maybe I see the area where the problem happens, but it should be checked by @Cparle and @egardner who appear to have written the relevant parts. It's also possible that I'm totally off the mark.

The first language from which this task started is sms. It has been in ULS for more than two years, so this is not a problem (if its name is spelled incorrectly, it's also not a blocker). The key here appears to be what languages does ULS show in the panel. It's the languages: parameter in ULS initialization. This appears in resources/filepage/UlsWidget.js:

	this.uls = this.dropdown.$handle.uls( {
		onSelect: function ( language ) {
			ulsWidget.setValue( language );
			// eslint-disable-next-line no-jquery/no-event-shorthand
			ulsWidget.dropdown.$handle.focus();
		},
		languages: languages,
		onVisible: function () {
			// Re-position the ULS *after* the widget has been rendered, so that we can be
			// sure it's in the right place
			var offset = ulsWidget.$element.offset();
			if ( this.$menu.css( 'direction' ) === 'rtl' ) {
				offset.left = offset.left - parseInt( this.$menu.css( 'width' ) ) + ulsWidget.$element.width();
			}
			this.$menu.css( offset );
		}
	} );

The variable languages is initialized in resources/filepage/CaptionsPanel.js:

CaptionsPanel.prototype.getAvailableLanguages = function (
	excludeLanguages, includeLanguage
) {
	var languages = {};
	$.extend( languages, mw.config.get( 'wgULSLanguages' ) );
	( excludeLanguages || [] ).forEach( function ( languageCode ) {
		if ( languageCode !== includeLanguage ) {
			delete languages[ languageCode ];
		}
	} );
	return languages;
};

Unfortunately, that's where my WikibaseMediaInfo code digging hit the limit, because I don't have it installed and I'm too busy with Other Stuff, however, please do check that whatever you use here doesn't exclude sms.

Thanks for digging @Amire80!

FYI: wgULSLanguages is a JavaScript variable generated by the ULS extension. It lists the languages that are available as interface languages in MediaWiki. sms doesn't have enough translations to be available as an interface language. The question is how and where includeLanguage (singular?) is defined. Alternatively, a new variable could be introduced that lists the languages allowed for captions.

FYI: wgULSLanguages is a JavaScript variable generated by the ULS extension. It lists the languages that are available as interface languages in MediaWiki. sms doesn't have enough translations to be available as an interface language. The question is how and where includeLanguage (singular?) is defined. Alternatively, a new variable could be introduced that lists the languages allowed for captions.

Is there any reason not to get all the languages that are available in langdb? :)

Is there any reason not to get all the languages that are available in langdb? :)

Differences in language code use in langdb vs. mediawiki (als<->gsw, simple<->en-simple). Filtering out redirects. Aggregate language codes (I forgot the right term) that cover multiple languages or areas (e.g. es-419).

Cparle added a comment.EditedMay 22 2019, 2:13 PM

The question is how and where includeLanguage (singular?) is defined.

includeLanguage and excludeLanguages are just mechanisms for limiting the languages in each ULS so that the user can only enter one caption value for each language. It doesn't have anything to do with the set of all languages

wgULSLanguages is a JavaScript variable generated by the ULS extension. It lists the languages that are available as interface languages in MediaWiki. sms doesn't have enough translations to be available as an interface language.

Ok this then is the problem - the ULS in this case is being used to select a content language from a list of interface languages. Is a ULS not an appropriate tool to use here? Or is there some way to get the content languages from it instead?

ULS is perfectly appropriate, but as Nikerabbit says, what needs to be done is to take almost all languages in langdb, filter out a few unneeded ones, and pass that as the languages parameter in ULS initialization. Perhaps this deserves a new langdb function, I'll take a closer look.

Cool, thanks @Amire80 let me know if I can do anything

Zache added a comment.May 24 2019, 6:40 AM

btw, there is already wbTermsLanguages variable in mw.config.get() variables initialized in WikibaseMediaInfoHooks.php which contains languages supported by the backend. Not sure though if it is gonna be deprecated as it seems not to be in use anymore.

Change 512351 had a related patch set uploaded (by Cparle; owner: Cparle):
[mediawiki/extensions/WikibaseMediaInfo@master] Get languages for ULS from allowed languages for WB terms

https://gerrit.wikimedia.org/r/512351

Cparle added a comment.EditedMay 24 2019, 11:30 AM

Good spot @Zache - changed to use wbTermsLanguages and sms now appears on my local dev environment

edit: or maybe not ... it's ses, not sms. However wbTermsLanguages is set to WikibaseRepo::getDefaultInstance()->getTermsLanguages()->getLanguages(), which looks like it returns (or is intended to return) content languages, which is what we're looking for. Don't know why sms isn't in the list though @Amire80 ?

Change 512351 abandoned by Cparle:
Get languages for ULS from allowed languages for WB terms

https://gerrit.wikimedia.org/r/512351

Zache added a comment.EditedMay 24 2019, 11:49 AM

sms needs to be configured to the backend also as supported language. For quick testing defining via wmgExtraLanguageNames or wgExtraLanguageNames is fastest.

However, wbTermsLanguages contains autonyms and for the select box we want to prefer the language name in user interface language and if it is missing then failback to something (ie. label in ULS)

Ok, so if I replace this

$.extend( languages, mw.config.get( 'wgULSLanguages' ) );

with this

$.extend( languages, mw.config.get( 'wbTermsLanguages' ), mw.config.get( 'wgULSLanguages' ) );

and add this to LocalSettings.php

$wgExtraLanguageNames[ 'sms'] = 'some new language';

and then, on a File page, click 'Add a caption' and type 'sms' into the language input this is what I see

Is that what we want?

Change 512351 restored by Cparle:
Get languages for ULS from allowed languages for WB terms

https://gerrit.wikimedia.org/r/512351

Zache added a comment.May 24 2019, 5:00 PM

Is that what we want?

Yes

Yupik added a comment.May 24 2019, 5:11 PM

(In the parent task, we have the same problem for other Saami languages too. If those could be done as well, I'd really appreciate it.)

Ok - there's a patch in there ready for review now, but I don't know what needs to be done to add the language to commons itself. @Nikerabbit @Amire80 maybe one of your guys does? Or maybe it'll just appear when the patch is deployed?

Yupik moved this task from In progress to Patch for Review on the WMFI board.May 25 2019, 2:42 AM

Let's check if I have understood the system correctly @Nikerabbit

I think that it just needs to be added to Names.php. The requirement for adding is verified autonym which is nuõrttsääʹmǩiõll per T223544. An alternative for the Names.php would be adding the language to the wmgExtraLanguageNames .

Other places required but already added are (per Siebrands commits for romani languages in T223524 ):

Change 512351 merged by jenkins-bot:
[mediawiki/extensions/WikibaseMediaInfo@master] Get languages for ULS from allowed languages for WB terms

https://gerrit.wikimedia.org/r/512351

The policy has been that only languages that have sufficient interface translations are added to Names.php.

So if we do not want to use wmgExtraLanguageNames for SDC (per T222309#5156677) then it needs to be fixed in code too?

One solution could be to change WikibaseContentLanguages::getDefaultTermsLanguages() to return the same language codes than WikibaseContentLanguages::getDefaultMonolingualTextLanguages() minus language codes which we want to use in monolingual texts but not in terms like 'und', 'mis', 'mul', 'zxx'.

This works for SDC's WikibaseMediaInfo. However Wikibase's (=Wikidata) terms box shows only lang codes even in the cases where the CLDR have labels and labels are working in item creation form.

Thinking out loud: I wonder how many sets of supported languages we need? It's of course desirable that each context can be defined separately, but that is also error prone.

Contexts I can think of quickly:

  • User interface language
  • Wiki (default) content language (or rather list of sister language projects)
  • Page (source) content language
  • Page (target) content languages for translations
  • Wikibase labels
  • Wikibase monolingual texts
  • SDC captions
  • Babel extension user boxes

Are there more? Which of these could share the same set? Which of this could share the same set + few special additional languages? E.g. languages for translation targets would be same as available page content languages + Message documentation (qqq).

Where would the sets be defined? User interface languages are currently defined as languages in Names.php which have localisation.

Zache added a comment.May 28 2019, 6:03 PM

AFAIK. SDC captions are same thing than Wikibase labels. It would be good thing if supported languages for monolingual texts and terms would be same. Terms = captions, labels, descriptions, aliases.

Nikki added a comment.May 28 2019, 7:17 PM

AFAIK. SDC captions are same thing than Wikibase labels. It would be good thing if supported languages for monolingual texts and terms would be same. Terms = captions, labels, descriptions, aliases.

If they were the same, this ticket wouldn't exist. sms works for Wikidata labels but not for Commons captions.

Thinking out loud: I wonder how many sets of supported languages we need? It's of course desirable that each context can be defined separately, but that is also error prone.
Contexts I can think of quickly:

  • User interface language
  • Wiki (default) content language (or rather list of sister language projects)
  • Page (source) content language
  • Page (target) content languages for translations
  • Wikibase labels
  • Wikibase monolingual texts
  • SDC captions
  • Babel extension user boxes

Are there more? Which of these could share the same set? Which of this could share the same set + few special additional languages? E.g. languages for translation targets would be same as available page content languages + Message documentation (qqq).

There's also Wikibase lexemes.

Which ones could share the same set depends on who you ask. As far as I can tell, the Wikidata community would like fewer restrictions on allowed languages and don't really understand why labels, monolingual text and lexemes are separate sets, while the Wikidata development team want to keep them separate (see T210293#5155097 for example).

There are some interface languages which Wikidata doesn't want as content languages (e.g. T51024) so it would be nice to separate UI language from content language.

On Wikidata, it would make sense for people to be able to pick the UI language from more than just languages with UI translations, because the UI language also affects which labels are displayed. For example, https://www.wikidata.org/wiki/Q1089774?uselang=sms displays the country name in Skolt Sami even though the UI isn't translated, but there's no way (as far as I can tell) to make that setting persistent. Alternatively, it would be nice to select the preferred label languages independently of the UI language, but I suspect the developers wouldn't want that because it probably makes caching difficult.

If they were the same, this ticket wouldn't exist. sms works for Wikidata labels but not for Commons captions.

This is the patch that changes the languages for Commons captions to be the same as the languages for Wikidata labels https://gerrit.wikimedia.org/r/512351

It should make it to live Commons on May 30

Zache added a comment.May 30 2019, 5:59 PM

Hmm, code should be live in Commons. sms is still missing from user interface because language is not configured to commons.

Would somebody like to test to add wmgExtraLanguageNamesparameters from Wikidata's setup to beta Commons settings so we can see that they are working there?

Change 513565 had a related patch set uploaded (by Cparle; owner: Cparle):
[operations/mediawiki-config@master] Add 'sms' langcode to beta commons

https://gerrit.wikimedia.org/r/513565

Yupik added a comment.Jun 1 2019, 10:18 PM

I'm still getting the sms and smn version in the summary box in Commons:

And the caption box still doesn't recognize sms or smn at all.

There's a bug in core where if a language code is not passed to Language:fetchLanguageNames() not all languages are returned (because a hook gets skipped). Mentioned here https://gerrit.wikimedia.org/r/c/mediawiki/core/+/510705/4/includes/api/ApiQueryLanguageinfo.php#84

This seems to be the underlying cause of the problem.

Change 513565 merged by jenkins-bot:
[operations/mediawiki-config@master] Add 'smn' and 'sms' langcodes to beta commons

https://gerrit.wikimedia.org/r/513565

now sms is added to beta commons config it works in captions in file info page. However it still doesn't work in upload wizard (missing in both: caption and description language selector)

UploadWizard languages are cached for 24 hours, hopefully it'll show up tomorrow

@Zache smn and sms are both appearing on UploadWizard on beta now

Yupik added a comment.Jun 13 2019, 2:29 AM

@Cparle: thank you, this is great news! Any ideas on how long it will take them to make Commons proper?

Change 516760 had a related patch set uploaded (by Cparle; owner: Cparle):
[operations/mediawiki-config@master] Add 'sms' and 'smn' langcodes to commons for use in captions

https://gerrit.wikimedia.org/r/516760

@Yupik hoping we can get the patch above deployed next week

Yupik added a comment.Jun 13 2019, 2:10 PM

Wonderful, thank you (en), takkâ (smn) ja späʹsseb (sms)! :)

Happy to be able to help! :)

Change 516760 merged by jenkins-bot:
[operations/mediawiki-config@master] Add 'sms' and 'smn' langcodes to commons for use in captions

https://gerrit.wikimedia.org/r/516760

Mentioned in SAL (#wikimedia-operations) [2019-06-18T11:12:54Z] <awight@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:516760|Add 'sms' and 'smn' langcodes to commons for use in captions (T222309)]] (duration: 00m 48s)

Live on commons now. Ok to close the ticket @Yupik @Zache @Keegan ?

Zache added a comment.Jun 19 2019, 9:04 AM

Live on commons now. Ok to close the ticket @Yupik @Zache @Keegan ?

Just a question to @Nikerabbit and @Jdforrester-WMF first. Is it ok to copy the rest of the Wikidata wmgExtraLanguageNames values to Commons values as it works now? So that the supported Wikidata and SDoC term languages would be the same. If so, I will create a new ticket for that.

Yupik added a comment.Jun 19 2019, 4:54 PM

This is wonderful, thank you all so much! This opens up so many new possibilities for us!

Yupik awarded a token.Jun 19 2019, 5:15 PM
Yupik closed this task as Resolved.Jul 21 2019, 3:16 AM