Page MenuHomePhabricator

[Story] Never show anything from a language that is not currently Wikidata conform
Open, HighPublic8 Story Points

Description

Motivation
There are two cases, where the Wikibase interface is tempted to show fingerprints in the termbox for languages that are not supported anymore:

Case 1

  • The ULS extension sends user preferred languages, which are not all included in the language set allowed for the wiki. This can happen, e.g. because someone has a not accepted language in their babel box. In this case, there is never any content for the language.

Example of current buggy behavior
A user has fil as a language in their babel box, resulting in a weird looking and behaving first line of the in more languages section. Note the red text for missing label and description is shifted to the left because the language is totally missing.

Case 2

  • A language was supported at some point, but it was then removed from the list of expected languages. As of now, there is no content for current revisions in such languages. Thus, the language only appears in the most recent revision with case 1, and it may appear in previous revisions back when the language was still supported and there was content for it.

Example (and the only one so far) of language that is not anymore supported
Support for Toki Pona stopped quite a while ago. In Spring 2019, all of its contents was deleted from items where some existed.

As a Wikidata reader or editor
I want to only see languages that are actually supported by Wikidata
so that I can edit everything I see

Acceptance Criteria

  • Item and property pages (including diff views) do not show labels, descriptions, nor aliases for language codes currently not recognized by Wikibase
  • Data returned by action API (wbgetentities, wbsearchentities) , and by Special:EntityData API for items and properties do not include labels, descriptions, nor aliases for language codes currently not recognized by Wikibase, for all revisions of item/property
  • Dumps in all formats (JSON, RDF, all flavours) do not include item and property labels, descriptions, nor aliases for language codes currently not recognized by Wikibase

Notes

More info about how the selection of languages for users currently work in Wikibase
Termbox is showing languages considered "preferred by the user" (T213720) in the "more languages" sections.
These languages are sourced from their babel box (config.get( 'wbUserSpecifiedLanguages' )) or ULS (uls.getFrequentLanguageList()) - the latter being influenced by e.g. the country you are surfing the web from, user agent languages, languages previously used on mediawiki projects. They can, apparently, contain language codes that do exist but are not considered to be full fledged MediaWikiContentLanguages (wb terminology) but delegate to another language code instead.

Details

Related Gerrit Patches:
mediawiki/extensions/UniversalLanguageSelector : wmf/1.34.0-wmf.19Revert "Return target of redirect languages in mw.uls.getFrequentLanguageList"
mediawiki/extensions/UniversalLanguageSelector : masterRevert "Return target of redirect languages in mw.uls.getFrequentLanguageList"
mediawiki/extensions/UniversalLanguageSelector : masterReturn target of redirect languages in mw.uls.getFrequentLanguageList
wikibase/termbox : masterLanguageNameInUserLanguage: lose in favor of a getter

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Lea_WMDE updated the task description. (Show Details)
Lea_WMDE removed the point value for this task.
Lea_WMDE moved this task from Incoming to Ready to estimate on the Wikidata-Campsite board.
Tarrow added a subscriber: Tarrow.May 7 2019, 11:56 AM

@Lea_WMDE Is the meant to have been added to the campsite? As I read it this task is only about changing the new (currently mobile) Termbox that the hike is working on.

Tarrow updated the task description. (Show Details)May 7 2019, 12:08 PM
WMDE-leszek updated the task description. (Show Details)May 7 2019, 12:26 PM
Tarrow updated the task description. (Show Details)May 7 2019, 12:49 PM

I'm unable to replicate this bug in desktop "classic" termbox. Perhaps you could have a look @Pablo-WMDE ?

My steps were:

  • add {{#babel: fil}} to my user page
  • load an item

Outcome: see a correct tagalog line in the termbox. No problems adding content to the tagalog fingerprint

@Lydia_Pintscher @Lea_WMDE Could we please have a quick chat about this, if this is supposed to get any attention at all, to avoid people repeatedly trying to figure out what this is all about. Mind you this is affecting termbox v1 and v2 alike.

@Lydia_Pintscher @Lea_WMDE Could we please have a quick chat about this, if this is supposed to get any attention at all, to avoid people repeatedly trying to figure out what this is all about. Mind you this is affecting termbox v1 and v2 alike.

yes please, I'm not sure yet if I will be able to join the daily, but if not I'll try to find you later.

Tarrow updated the task description. (Show Details)
Tarrow added a project: Wikidata-Termbox.
Pablo-WMDE updated the task description. (Show Details)May 8 2019, 1:08 PM

Change 488090 merged by jenkins-bot:
[wikibase/termbox@master] LanguageNameInUserLanguage: lose in favor of a getter

https://gerrit.wikimedia.org/r/488090

Moved to in prep in task breakdown because we decided it wasn't ready to pick-up

Sorry for not putting this in here earlier.
My and @Lydia_Pintscher had a brief chat about the desired behaviour of the termbox (or broader, Wikibase in general), and concluded following:

  1. If the ULS or any source of language code list provides a language code (and possible related data in this language) that is not recognized by Wikibase, this entry must not be presented to the user.
  2. When data load from the storage contains entries using a language code that is not recognized by Wikibase, those entries must not be presented to the user (e.g. for reason X there a label related to the invalid language code stored in the DB, JSON data returned by the API should never expose this invalid label
  3. When data containing data with the language code that is not recognized by Wikibase is requested to be stored (e.g. edit made via UI, or bot API, etc), the API getting the request should refuse to save the data and show the error informing on the invalid language code in the request.

@Lydia_Pintscher could you please confirm that I got all above right and I am not making up the truth?

@Lea_WMDE in the termbox case points 1 and 3 are particularly relevant.
It seems that for this very task, there is an acceptance criterion missing (although it actually is the current title of this task), that if the ULS provides language code which is not recognized by Wikibase, it should be ignored, and there should not be a row in the termbox related to this language code. This would fulfill the behaviour requirement 1.
Re requirement 3, the termbox v2 is compliant as long as the API providing item data is compliant with the requirement 2. This is considered to be the case now as T200432 has been resolved. Some changes to Wikibase API presenting data would be still recommended but these are out of scope of the termbox work, and also not considered critically urgent with the current state of the language data in Wikidata storage layer. Hence, there is no changes needed to termbox v2 with regards to editing/persistence.

Lea_WMDE renamed this task from Don't show fingerprints with unidentifyable language codes in termbox to Filter out languages from ULS that Wikibase doesn't know about.May 29 2019, 3:04 PM
Lea_WMDE updated the task description. (Show Details)
Lea_WMDE updated the task description. (Show Details)Jun 3 2019, 12:42 PM
Lea_WMDE updated the task description. (Show Details)
Lea_WMDE renamed this task from Filter out languages from ULS that Wikibase doesn't know about to Never show anything from a language that is not currently Wikidata conform.Jun 14 2019, 10:40 AM
Lea_WMDE edited projects, added Wikidata-Campsite; removed Wikidata-Termbox.
Lea_WMDE updated the task description. (Show Details)
alaa_wmde set the point value for this task to 8.Jun 25 2019, 12:10 PM

Change 523176 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/extensions/UniversalLanguageSelector@master] Return target of redirect languages in mw.uls.getFrequentLanguageList

https://gerrit.wikimedia.org/r/523176

I would rather fix ULS, this patch fixes ULS.

It isn't clear how ULS redirect language patch is related here. Can we add more elaborate commit message and/or a comment here. I also believe that that patch alone won't be enough, right? and it is just necessary for the later work for this task?

It isn't clear how ULS redirect language patch is related here. Can we add more elaborate commit message and/or a comment here. I also believe that that patch alone won't be enough, right? and it is just necessary for the later work for this task?

Sorry for the confusion. The underlying issue here is when there is a redirect language (like "fil" that redirects to "tl") , mw.uls.getFrequentLanguageList() that is called in our codebase gives the redirect and not its target causing Wikibase to show the wrong and invalid language.

Change 523176 merged by jenkins-bot:
[mediawiki/extensions/UniversalLanguageSelector@master] Return target of redirect languages in mw.uls.getFrequentLanguageList

https://gerrit.wikimedia.org/r/523176

@Ladsgroup does the ULS fix alone enough for this task? if yes, should we move this to test/verify. if no, should we move it to ready to pick up (in case no one will be working on it today)?

alaa_wmde renamed this task from Never show anything from a language that is not currently Wikidata conform to [Story] Never show anything from a language that is not currently Wikidata conform.Jul 22 2019, 10:04 AM

@Ladsgroup does the ULS fix alone enough for this task? if yes, should we move this to test/verify. if no, should we move it to ready to pick up (in case no one will be working on it today)?

My patch fixes the Case 1 since uls.getFrequentLanguageList() won't return invalid language after the deployment but I don't know what to do regarding the Case 2, it won't show up in the top of termbox (old and new) but probably it will show up if you click on "Show all entered language" in old termbox.

  1. When data load from the storage contains entries using a language code that is not recognized by Wikibase, those entries must not be presented to the user (e.g. for reason X there a label related to the invalid language code stored in the DB,

I seem to remember us talking about this at length but I didn't see it clearly here. Does this mean that we expect historic revisions of entities that contain languages that were valid at the time to now silently not show them?

My two cents to the recent discussion:

  1. The change https://gerrit.wikimedia.org/r/523176 is interesting, in particular the fact it has been accepted by ULS maintainers. In T222790 we've been asking them if the "old" behaviour was intended or a mistake, and haven't got the clear answer from the respective team yet. I also note (as an "interesting" observation, I do not claim it is correct and have no opinion on this whatsoever) that "reverse-interpreting" the changes to the said ULS method I referred to in T222790#5198012 seemed to lead to the conclusion that the "old" behaviour ("fil" included in result) was actually intended (at some point in time). It looks now the Language team might have identified the desired behaviour. In this case it would be great to have an answer posted in T222790 a well. Apologies from my side for not mentioning T222790 in the task description here, it would have been helpful.
  2. Re fixing ULS, and not changing Wikibase. I do not oppose such approach. When this story has been being defined the approach considered was a bit different though. Please bear with me when I try to rephrase the approach to the issue here that I had in mind when writing the task. I do not claim it is better, just bringing it up for your consideration here @Ladsgroup and @alaa_wmde. One could look at Wikibase using ULS as a black box, which provides language codes (and possible other language information). In design terms one could think of the "language information source" interface, which Wikibase knows and uses. There could be multiple services serving as a "language information source" for Wikibase, one based on ULS could be easily imagined. Services like ULS should in my eyes be complete agnostic to the fact there is a software like Wikibase that uses them for the particular need. Wikibase does have its own definition of the "valid/allowed/correct" language code, which, again in my opinion, should stay inside Wikibase, i.e. other services, be it ULS, or anything else should not conform to Wikibase conditions, as they simple couldn't even know they exist. Regardless what "service" is used as a "language information provided" I'd argue Wikibase should not just accept its output but always apply its own filtering etc.

The point I am trying to make is that a possible fixing ULS might solve the particular "fil" vs "tl" language code issue but gives no guarantees to Wikibase that similar problem wouldn't arise in the future. Building in a "filter" into Wikibase ,could, in my opinion provide such security.
Again, I am not arguing that my perspective on this is the right one. This comment is my retroactive attempt to communicate what we've found out so far in the Termbox work, with the hope it is useful for people working on solving this very task.

  1. When data load from the storage contains entries using a language code that is not recognized by Wikibase, those entries must not be presented to the user (e.g. for reason X there a label related to the invalid language code stored in the DB,

I seem to remember us talking about this at length but I didn't see it clearly here. Does this mean that we expect historic revisions of entities that contain languages that were valid at the time to now silently not show them?

I think historic revisions and languages that are no longer valid / existing in the wikibase codebase / enabled on wikdata should have the same behavior as statements or even wikitext links to pages or entities that no longer exist.

For example, P132 in this revision https://www.wikidata.org/w/index.php?title=Q64&oldid=91705947

Just because the property doesn't exist any more, we are not hiding the data, we don't want to hid history and how the data of the current item evolved, that just makes things confusing.
The same should probably apply for terms in a finger print, we just need to define a uniform way to display terms in a lang that are no longer supported.

@Addshore agreed that we should ultimately not hide data in older revisions. There is T225789 for handling that behavior based on what @Lea_WMDE and I discussed.

@WMDE-leszek 100% agree. I actually was just asking out of curiosity (that that small change would fix the issues mentioned here). We should definitely go down the route of having Wikibase allow to manage a layer of supported languages on top of languages providers that it integrates with.

We just spoke in the office and it seems that the decision (as I heard it from @WMDE-leszek and @Lea_WMDE) is to remove currently invalid language even from historic revisions. We won't change the database but we will stop showing any invalid language. e.g. from API requests, in the UI (perhaps in non-XML dumps?), in Special:EntityData, in ttl etc..

They will perhaps be reintroduced after T225789 is done once what we will do is decided upon.

alaa_wmde raised the priority of this task from Normal to High.Aug 20 2019, 2:52 PM

We just spoke in the office and it seems that the decision (as I heard it from @WMDE-leszek and @Lea_WMDE) is to remove currently invalid language even from historic revisions. We won't change the database but we will stop showing any invalid language. e.g. from API requests, in the UI (perhaps in non-XML dumps?), in Special:EntityData, in ttl etc..
They will perhaps be reintroduced after T225789 is done once what we will do is decided upon.

clarification: "remove" as in: not show, not allow to read, write, change, etc. No altering the "stored" data of historic revisions is intended here.

@WMDE-leszek just wanted to check that this story only applies to labels, descriptions and aliases.

Or do we also need to do something for monolingual texts in invalid languages?
What about Lexemes or EntitySchemas?

Fair point, will adjust the description.

@WMDE-leszek just wanted to check that this story only applies to labels, descriptions and aliases.

Only labels, descriptions, and aliases here.

Or do we also need to do something for monolingual texts in invalid languages?

out of scope here

What about Lexemes or EntitySchemas?

ditto

Great! That's what we thought in the room.

I guess there was and intention at task breakdown time of having other subtasks of this? e.g. to remove invalid languages from all the places with can think of if it doesn't come for free from what we do for termbox. e.g. in the ttl/json etc.. I won't go ahead and make those unless it turns out we need them. (i.e. hoping that it comes for free with the termbox solution)

WMDE-leszek updated the task description. (Show Details)Aug 21 2019, 7:31 AM

Wikidata gets language codes to display inside collapsed view from ULS method getFrequentLanguageList(), which utilizes country code languages, among other things. There is a separate task (T222790) related to this one, where it was requested from Language-Team to clarify expected behavior around languages with redirects. Returning the target of language redirect codes was introduced as fix for this ticket, in 76551ed4a7fccbaf87cd850674406ae316f4f956.

In the past, editors from Serbia would not see Serbian as option at all, or see variant codes for Cyrillic and Latin - T121747. This was mitigated if you had sr language code in list of your previous languages selected in ULS. Now, after the patch, sr redirects to sr-cyrl and we end up with having options to enter entity details (label, description, aliases) in both Cyrillic and Latin. That means, we no longer have ability to work around the limitation for Serbian.


When you try inserting any data for that two language codes, you get an error message: Could not save due to an error. Unrecognized value for parameter "language": sr-cyrl.
In the past, there was an option to expand list of languages and show all of them. Now, we see selected 4 languages and have option to see all other which have label entered. That means new entries cannot be added in Serbian.

Do you have statistics about number of edits per language on Wikidata? Did contributions in Serbian drop significantly in the past month? Are there other ways of entering Wikidata labels that I don't know of?

Change 532276 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/extensions/UniversalLanguageSelector@master] Revert "Return target of redirect languages in mw.uls.getFrequentLanguageList"

https://gerrit.wikimedia.org/r/532276

I reverted the patch for now (not merged yet)

Change 532276 merged by jenkins-bot:
[mediawiki/extensions/UniversalLanguageSelector@master] Revert "Return target of redirect languages in mw.uls.getFrequentLanguageList"

https://gerrit.wikimedia.org/r/532276

Change 532341 had a related patch set uploaded (by Alaa Sarhan; owner: Ladsgroup):
[mediawiki/extensions/UniversalLanguageSelector@wmf/1.34.0-wmf.19] Revert "Return target of redirect languages in mw.uls.getFrequentLanguageList"

https://gerrit.wikimedia.org/r/532341

This comment was removed by alaa_wmde.

Change 532341 merged by jenkins-bot:
[mediawiki/extensions/UniversalLanguageSelector@wmf/1.34.0-wmf.19] Revert "Return target of redirect languages in mw.uls.getFrequentLanguageList"

https://gerrit.wikimedia.org/r/532341

Mentioned in SAL (#wikimedia-operations) [2019-08-26T11:34:25Z] <ladsgroup@deploy1001> Synchronized php-1.34.0-wmf.19/extensions/UniversalLanguageSelector: SWAT: [[gerrit:532341|Revert "Return target of redirect languages in mw.uls.getFrequentLanguageList" (T217770 T121747)]] (duration: 00m 46s)

I confirm that adding new labels for Serbian is possible again. Like I wrote earlier, one needs to have sr in their list of previous ULS languages.