Page MenuHomePhabricator

Add termbox language code mul to reduce redundancy in Wikidata Labels and Aliases
Open, HighPublicGoal

Description

Milestones:

User story:
As a Wikidata editor,
I want to avoid repeating identical labels in hundreds of languages
in order to reduce the amount of redundant content that needs to be maintained on Wikidata.

Problem:
We have many labels that are by principle identical across different languages (see examples section). This has some bad consequences:

  • editors having to create and maintain redundant content (copying the same thing to most/all languages creates massive amounts of edits and is a huge waste of resources)
  • need of storing redundant information that burdens our systems (e.g. the Query Service)

Solution:
Introduce a new language code that all languages fall back to. This will be particularly helpful for Unicode characters, Scientific articles, and Codes as well as for Names in Latin scripture (as we do not have an elaborate fallback system for that scripture yet). We will test if this solution (only one new language code) is good enough, or if we need more specific language codes after all to model a useful fallback chain.

This task

  • Adding "mul" as a new monolingual language code.
  • Have other languages fall back to it (Translatewiki fallback chain > "mul" > "en")

Community takes over

  • Community creates guidelines and help pages on how to use the new code, e.g.
    • What if one Latin-script language may prefer a form (e.g. "Philip L. Brown"), another Latin-language script another form (e.g. "Philip Larry Brown" or "Philip Brown")?
    • In what cases should the Latin-language label be used for "mul" instead of the native label (while still making sure that re-users can identify the native label via property)?
    • etc.
  • Community gives feedback after some months about how the new code and guidelines work
    • Based on the feedback we might iterate on the approach if necessary.

Ideas for the future

  • start to show a warning if someone wants to add the mul-label in a different language
  • include the experience in a possible future solution for multilingual descriptions (Abstract Descriptions)
  • re-evaluate if the final fallback to “en” is still appropriate

Mockup:

image.png (537×1 px, 170 KB)

Examples:
This will be useful in many different places:

Names

Unicode characters

Codes

Scientific articles

Translatewiki fallback chain:

Examples:
ami > zh-tw, zh-hant, zh-hans
zh-tw > zh-hant, zh-hans
zh-hant > zh-hans
zh-hans > []

de-at > de
de > []

en-gb > en
en > []

Hard-coded fallback chain:

old

  • Translatewiki fallback chain > "en"

new

  • Translatewiki fallback chain > "mul" > "en"

Community communication:

  • The interested Community needs to be aware of the new code and of the necessity to create guidelines and help pages on how to use it.
  • We need to be available for the Community when they create guidelines and to collect feedback.

Original:
This task is to add support for a "mul" language code for labels and aliases. For any benefits of this code to be properly reaped, all language codes should ultimately fall back to "mul"—which I believe would be achieved by adding it as a fallback for the "en" code.

(If it is more desirable, codes for "mul-latn", "mul-cyrl", etc. could be created, in which case e.g. only those codes using the Latin script would fall back to "mul-latn".)

Possibly related tasks: T258242 T256003 T43807

Related Objects

StatusSubtypeAssignedTask
OpenGoalNone
ResolvedRelease Lucas_Werkmeister_WMDE
Resolved Lucas_Werkmeister_WMDE
Resolved Lucas_Werkmeister_WMDE
Resolved Lucas_Werkmeister_WMDE
Resolved Lucas_Werkmeister_WMDE
OpenNone
OpenReleaseNone
OpenNone
OpenReleaseNone
OpenBUG REPORTNone
ResolvedBUG REPORTManuel
ResolvedBUG REPORTManuel
OpenBUG REPORTNone
OpenNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

For clarity's sake, maybe we should hold discussions about editorial choices on Wikidata directly. Also langcom members should stated that they are expressing the langcom view and reference the policy or procedure they are using to base their argument on. It may also be preferable if, when a user who is a langcom member, requests a new code, they don't express any langcom view on the code. Alternatively, it may seem that the process is somewhat random or arbitrary.

At this stage, it seems clear that:

  • the request for the new code is complete with samples
  • langcom is aware of the request and hasn't raised any concerns within the scope of their review activity

To clarify, this is not Langcom approval. What I'm saying is that it's not relevant to the usual kind of thing that Langcom is doing. Precisely because of this, it's the product developers and designers that should review this and not Langcom. By any account, this is not a request for a language, but for a special case.

So langcom review isn't needed here? Ok.

A reason we started adding identical labels to different languages for family name / given name items is to avoid that users try to translate the label, e.g. family name "Taylor"@en shouldn't have "Tailleur"@fr "Schneider"@de. In languages with different scripts, transliterations are ok.

For these items, if there is a constraint, it should definitely not suggest to add a different labels for these items.

So if identical labels were to be deleted (or no longer added), there should be a feature to prevent (re-)addition of any label to the languages.

A solution I proposed to this once was monostring_item (2015). Not sure how that could practically work out today though.

While I've undone edits on the last three of Nikki's examples (since those examples are frankly quite ridiculous), I otherwise agree with what is stated in that comment.

Why? In absence of the proposed mul language code, having the same label duplicated in all languages that use it is the expected behaviour.

Why? In absence of the proposed mul language code, having the same label duplicated in all languages that use it is the expected behaviour.

The assumption with the original edits appears to have been that all languages used those three Latin-script text values when referring to the objects described by those items. My undoings to those examples are an attempt to dispel that assumption.

The following is a comment made in another forum regarding this ticket by User:Nikki, who has allowed me to repost it here after some copy-editing in good faith:

Regarding the fallback chain: English should fall back to mul too (as Mahir originally wrote), or otherwise we would have to duplicate everything from mul under English as well (and having everything except English fall back to mul has an icky "English is special" vibe).

Regarding which script subtags to add: I think it would make sense to start with only mul and revisit whether (and which) script-specific codes would be useful later. I think there is a clear use case for a script-independent code which applies to any language (e.g. all the examples I provided are things which are by definition that string regardless of language—some use Latin characters but they're still valid for all languages, e.g. the ISO country code for Switzerland is still "CH" in Arabic or Russian, in the same way that the symbol for pi is π even in English) but it's less clear how useful individual script-specific codes would be.

I did a bit of analysis and there are 521 language codes usable for labels. 343 are for Latin, 55 for Cyrillic, 34 for Arabic, 16 for Devanagari, 10 for Traditional Chinese and 7 for Simplified Chinese. The other 36 scripts are associated with fewer than 5 language codes. 2/3 of all the the language codes are thus for Latin.

Regarding mul-latn: I'm not sure what the point of having both mul and mul-latn would be. Doesn't that imply that there's a situation where mul would be different from mul-latn and that people would be specifically requesting mul and nothing else (since anything else would fall back to English and mul-latn first)?

Regarding mul-cyrl and mul-arab: in theory there's enough language codes using those scripts that mul might make sense, but I looked up a bunch of leaders of countries which use Cyrillic or Arabic script (as items I thought would be likely to have pages in those languages) and the majority only had 3-5 identical Cyrillic or Arabic labels, so I haven't yet found evidence that it's common for a lot of languages to share the same Cyrillic or Arabic label in practice.

Regarding mul-deva: I didn't look into Devanagari, but I would be very surprised if the situation there is any different from mul-cyrl and mul-arab.

Regarding mul-hans and mul-hant: I don't think there's any obvious benefit to having those. All of the Simplified/Traditional Chinese language codes are for languages in the zh macrolanguage. zh-classical and zh-yue shouldn't be used—a bot already replaces them with lzh and yue. zh, zh-cn, zh-my, zh-sg, gan-hans and wuu already fall back to zh-hans and gan, gan-hant, lzh, zh-hk, zh-mo and zh-tw already fall back to zh-hant. Only nan-hani and yue don't have any fallbacks defined…I don't know why yue doesn't, but nan-hani is a code added for Wikidata and should really fall back to nan or zh-hant. Either way, zh-hans/zh-hant are approximately equal to mul-hans/mul-hant and even if we define a distinction between them, I'm very sceptical that we could maintain a distinction.

Interesting data, the question is what conclusions to draw from it for practical usecases at Wikidata.

Initially the property "unit symbol" had string datatype, possibly because of a similar reasoning as the above. (Latin script) unit symbols were deemed useful for any language.

After property proposal for what would be "unit symbol" (P5061), this was changed to monolingual string as the units for languages written in Cyrillic don't use Latin script symbols. Eventually, we would have "t"@en for all Latin script languages and "т"@ru, "т"@uk (sample from Q191118).

If "mul" will be "mul-latn", as other languages wont generally have a use for "mul", the question is if "mul" should include non-Latin script text at all (or not).

  • If it does include non-Latin script text, it may include strings that are not useful for Latin script languages.
  • If it shouldn't include non-Latin script text, it's not clear what the use should be for languages that aren't written in Latin script if any. Also, it may be similar to the current use of "en" and it's not entirely clear how to explain to contributors that "mul" isn't "mul-cyrl".

The conclusion for usecases above may be that we should create "mul-latn" only.

Maybe some insight could be gained from languages where we currently have three codes, e.g. sr, sr-Latn, sr-Cyr where "sr" can contain either. I wonder how and where it's actually used.

@Lucas_Werkmeister_WMDE and I read through everything and spent some more time thinking this through. Thank you everyone for providing all the input, especially @Nikki. That really helped.
One thing to still clarify or the discussion (which I also wasn't aware of): The language fallback chains do not have to make up a tidy tree. It can (and does in places) have cycles and a language can have several fallback languages. For example avk falls back to fr, es, ru; mdf and myv fall back to one another (before both fall back to ru).

Based on @Nikki's analysis it makes a lot of sense to us to only do mul for now and see how that goes. Then the remaining question is about the exact fallbacks we want. The options we see right now are:

  • the fallback to mul happens after the fallback to en at the very end of all fallback chains (example chain de-at -> de -> en -> mul)
  • the fallback to mul happens before the implicit fallback to en but after any explicit fallback (even explicit fallback to en from e.g en-gb) (example: en-gb -> en -> mul or de-at -> de -> mul -> en)

The question boils down to if for non-English languages, a fallback to e.g. Amphispiza bilineata is preferable to “Black-throated Sparrow” or not.

Do you have thoughts on these options?

I think "the fallback to mul happens before the implicit fallback to en but after any explicit fallback" is preferrable (since I think Amphispiza bilineata is preferable to “Black-throated Sparrow”).

Just one more point (probably already expressed): I think that

  • when "mul" has a certain X label, it should be technically impossible to add the same X label to any other language code
  • when "mul" has a certain X label and another language code, e.g. "de", has a Y label, it should be technically impossible to add the alias X to "de" (= in general, if a language code has a label different from "mul" label, the "mul" label should be automatically considered an alias for this language)
  • when a certain label X gets added to "mul", all identical X labels should be removed from other language codes

@Epidosis What should happen when mul is in Cyrillic ? You seem to assume mul-latn.

I'm somewhat worried that all these deletions will lead to the people adding "Giovanni"@it to "John"@en and we are back to where we were years ago.

If mul is in Cyrillic (e.g. it will probably be in https://www.wikidata.org/wiki/Q29652874), it will happen as follows in my opinion:

  • "mul" label = Афанасий
  • if a language code has no label, it will default to "Афанасий"
  • if a language code has a transliterated label (e.g. "en" label = Afanasy), for this language code one of the aliases is by default "Афанасий"

This seems fine to me.

Regarding deletions: of course there might be some misuses, but in fact I think that deleting a lot of duplicate data in labels (and aliases, in my opinion) is the reason for which "mul" is most useful, so in my opinion it is worth trying.

@Epidosis for that sample, it seems clear-cut, but for Q191118 one could imagine Cyrillic "т" as "mul" label.

Interesting question about the fall-back chain from Lydia. Maybe we should give more thought about the language code applicable to taxon names. Its code could then be part of the fall-back chain and the value displayed when available. There is some discussion on Wikidata what the code should be. Let's assume it's "la-x-taxon". We could have languages fall back to that and it could easily be read.

Personally, I rather see a feature that prevents re-additions for items with specific types (P31=..) before we experiment with deletions. Maybe some sort of protection setting could work too. Sample: if there is mul-latn, there shouldn't be en,it,es, .. This would work for name items, but also disambiguation items, possible others.

In a case like https://www.wikidata.org/wiki/Q191118, I tend to think a general label is not applicable (each language has its name for "tonne"); having "mul-lat" and "mul-cyr", I would set "t" as "mul-lat" alias and "т" as "mul-cyr" alias; having only "mul", I would set both "t" and "т" as "mul" aliases. Anyway, it is reasonable to start applying "mul" (or similar) with specific types of items (I tried to list them above; starting with some of them, as disambiguations, could work well).

The idea of an apposite code for Latin name of taxa seems very interesting to me.

There are at least 3 categories of items which strongly need this:

  1. persons (https://www.wikidata.org/wiki/Special:Search/haswbstatement:P31=Q5, as of now 9.2M): in most cases the same label and the same aliases are repeated in different languages (e.g. in wikidata.org/w/index.php?title=Q19667413&action=history I can count 6 same-label additions: fr, nl, sl, ca, ast, sq; many other items are similar)
    • in the case of people, "mul-<script>" is required: names are the same only considering languages with the same alphabet, I'm mostly thinking about Latin alphabet
    • in some cases there could be the following problem: one Latin-script language may prefer a form (e.g. "Philip L. Brown"), another Latin-language script another form (e.g. "Philip Larry Brown" or "Philip Brown"); while the group of labels and aliases is the same for all same-script languages, which is the label and which is the alias may vary from language to language; of course, this problem occurs only when there is more than one form of the name, but in many cases this doesn't happen
  2. given names and family names (https://w.wiki/3zWT, which counts Q202444 and Q101352 including subclasses, as of now 590k): in all cases the same label are repeated in different same-script languages (e.g. https://www.wikidata.org/wiki/Q21448867)
  3. scientific articles (https://www.wikidata.org/wiki/Special:Search/haswbstatement:P31=Q13442814, as of now 37.3M): in most cases the same label is repeated in different languages (e.g. https://www.wikidata.org/wiki/Q27860672)

Considering also

  1. asteroids (https://www.wikidata.org/wiki/Special:Search/haswbstatement:P31=Q3863, as of now 247k)
  2. galaxies (https://www.wikidata.org/wiki/Special:Search/haswbstatement:P31=Q318, as of now 2.1M)
  3. taxa (https://www.wikidata.org/wiki/Special:Search/haswbstatement:P31=Q16521, as of now 3.1M)

and still leaving out disambiguation pages (https://www.wikidata.org/wiki/Special:Search/haswbstatement:P31=Q4167410, as of now 1.3M), we obtain 52.6M items, out of 94.9M items, and probably the count could be further increased.

Ideally, as I noted, "mul" and "mul-<script>" should allow exceptions in some way, but their necessity is very clear. I would suggest at least medium priority.

@Epidosis

The problem with "mul" is that we couldn't easily have the software test if there should be ru/uk label (as for "John"), but no la/fr/en label on items.

In other words, when we have "mul-latn" and name item, we could avoid adding la/fr/en, but not ru/uk.

I don't mind testing it with disambiguation pages, but in general, I still think T139912 is worthwhile. Maybe asteroids could do (not too many).

https://www.wikidata.org/wiki/Q59238742 might be tricky. @en and @mul-latn would have the same label (unless we decide that such items should only have @it and @en)

In any case, I like the idea that all these items would get much lighter.

OTH, name items only need one label and transliterations could go into statements. So the setting would be "mul" + name > no other labels.

@Esc3300 please refrain from editing the task description while the discussion is ongoing. It is inappropriate to masquerade your personal opinion as the hard-won consensus that we are trying to achieve here.

Mahir256 removed a subscriber: Esc3300.
Mahir256 added a subscriber: Esc3300.
This comment was removed by Mahir256.

@LucasWerkmeister Can you outline which of the deleted points you consider problematic? Lydia generally wants to summary in task descriptions to be up-to date. If you just remove stuff, we are missing out on information.

@Lucas_Werkmeister_WMDE that should have been. I removed Mahir256 as they seem to be doing nonsensical edits to subscribers.

Odd that some of known issues shouldn't be stated in the task description.

@Lucas_Werkmeister_WMDE that should have been. I removed Mahir256 as they seem to be doing nonsensical edits to subscribers.

The presence of the user I am removing—not just here, but in any other discussion forum really—has made @Nikki (and others, both actually and potentially) sufficiently uncomfortable directly opining here and in those other fora that others like myself relay their opinions here for them. Unless that user wishes to impugn the credibility or emotional strength of Nikki and those other individuals, I contend that my actions in this regard are entirely sensical.

Changes for task description are:

Before my editAfter my edit (later deleted without an explanation or justificationReason
editors having to create and maintain redundant content (copying the same thing to most/all languages creates massive amounts of edits and is a huge waste of resources)editors having to create and maintain redundant content (copying the same thing to most/all languages could create massive amounts of edits and is a huge waste of resources)all descriptions and labels can be added in a single edit
ProblemProblem[header repeated for clarity]
user tend to fill in empty label fields, especially when a description in the language is present
empty label fields may result in suboptimal string additionsIt should be easy to find diffs for such edits on name items.
fall-back is generally ill understoodpeople wouldn't fill in labels if the fallback was understood
ExampleExample[header repeated/expanded for clarity]
persons (https://www.wikidata.org/wiki/Special:Search/haswbstatement:P31=Q5, as of now 9.2M) have in most cases the same label and the same aliases repeated in different languages, e.g. https://www.wikidata.org/wiki/Q42persons (https://www.wikidata.org/wiki/Special:Search/haswbstatement:P31=Q5, as of now 9.2M) have in most cases the same label and the same aliases repeated in different languages, e.g. https://www.wikidata.org/wiki/Q42 Labels generally differ by script (Latin script and all others)
given names and family names (https://w.wiki/3zWT, which counts Q202444 and Q101352 including subclasses, as of now 590k): in all cases the same label are repeated in different same-script languages, e.g. https://www.wikidata.org/wiki/Q21448867.given names and family names (https://w.wiki/3zWT, which counts Q202444 and Q101352 including subclasses, as of now 590k): in all cases the same label are repeated in different same-script languages, e.g. https://www.wikidata.org/wiki/Q21448867 . This to avoid that translations are added (e.g. "John"@en and "Giovanni"@it shouldn't be on the same item).
taxa (https://www.wikidata.org/wiki/Special:Search/haswbstatement:P31=Q16521, as of now 3.1M) the species "Neotrogla curvata" - has "Neotrogla curvata" as the label 411 times.taxa (https://www.wikidata.org/wiki/Special:Search/haswbstatement:P31=Q16521, as of now 3.1M) the species "Neotrogla curvata" - has "Neotrogla curvata" as the label 411 times. Latinized names should be generally available as fallback.
CodesCodes and abbreviations[header repeated for clarity]
metric ton - should have "t" as alias in Latin script languages, "т" as alias for Cyrillic languagesis there an issue with this sample?
Scientific articlesScientific articles[header repeated for clarity]
(https://www.wikidata.org/wiki/Special:Search/haswbstatement:P31=Q13442814, as of now 42M): in many cases the same label is repeated in different languages (e.g. https://www.wikidata.org/wiki/Q27860672). In some cases, there could be articles with parallel titles in different languages (e.g. https://www.wikidata.org/wiki/Q59238742(https://www.wikidata.org/wiki/Special:Search/haswbstatement:P31=Q13442814, as of now 42M): in many cases the same label is repeated in different languages (e.g. https://www.wikidata.org/wiki/Q27860672). Generally the original title is available (or a translation to English). Original non-English titles are frequently missing. In some cases, there could be articles with parallel titles in different languages (e.g. https://www.wikidata.org/wiki/Q59238742. One title for @en , one for @it,
Open questionsOpen questions[header repeated for clarity]
What are all the mul-<script> codes that we should start with?What are all the mul-<script> codes that we should start with? mul-latn seems the most frequentI think everybody agrees about the frequency (except Mahir)
Can items still be found when no label is present in the language?A developer statement that this wont an issue should be sufficient to deleted it from the task description. @Lucas_Werkmeister_WMDE can you confirm?
Search results are currently (also) ranked by the number of labels, how to ensure ranking still works?A developer statement that this wont an issue should be sufficient to deleted it from the task description. @Lucas_Werkmeister_WMDE can you confirm?
Should the "mul-latn" label be displayed in a grayed out form when a description is present?
How will this work in LUA infoboxes? Currently users copy en labels to ca/cs/da/es/nb even when the fallback works.A developer statement that this wont an issue should be sufficient to deleted it from the task description. @Lucas_Werkmeister_WMDE can you confirm?
How to prevent that now empty label fields aren't filled with inappropriate label (loss of data quality)?

Can you explain the reminder of your deletions? Above what you deleted from the description.

@Esc3300 For clarification: Lucas and I spent a lot of time yesterday on getting everything to a point where we believe it is sensible and the remaining questions are clarified. It'd be good to concentrate the discussion on those remaining points now because otherwise we can not move this forward. As there is a strong desire from several editors to get this done I want to push this to the point where we can actually pick it up.

please refrain from editing the task description while the discussion is ongoing. It is inappropriate to masquerade your personal opinion as the hard-won consensus that we are trying to achieve here.

This message (especially the latter part) is enough of a reason to undo the changes made to the ticket since Nikki's comments were added, irrespective of what opinions I may hold of any of it (which should not be assumed as was done in the diff that stains this task). The sea lion I am removing from this ticket is also free to impugn Lucas's or Lydia's credibility or emotional strength as well.

@Esc3300 For clarification: Lucas and I spent a lot of time yesterday on getting everything to a point where we believe it is sensible and the remaining questions are clarified. It'd be good to concentrate the discussion on those remaining points now

Ok. What's the proposal for the various points in how it may backfire? And finally which script do you want to start with?

please refrain from editing the task description while the discussion is ongoing. It is inappropriate to masquerade your personal opinion as the hard-won consensus that we are trying to achieve here.

This message (especially the latter part) is enough of a reason

@Mahir Can you explain which parts the later part covers? If not, please refrain from making such comments in phabricator or elsewhere.

For those who would like a clarification,

please refrain from editing the task description while the discussion is ongoing.

this is the former part of Lucas's message

It is inappropriate to masquerade your personal opinion as the hard-won consensus that we are trying to achieve here.

and this is the latter part.

(More on the "sea lion" term.)

Apparently there is a disagreement between Lucas and his manager about description editing.

Can you at least explain which parts you consider my personal opinion and which ones are not supported by a consensus (ideally with a link to the relevant discussion)?

There is no disagreement.
We are spending a lot of time discussing things that currently don't move this forward and do not help get to a meaningful consensus. So one final try. We need input on the final remaining discussion points as I laid out in T285156#7384455. Let's please concentrate on those now so that we can then update the task description once we heard everyone.

If this is the only open point, can you summarize how the open points mentioned in the task description had been addressed ?

Sure.

  • Could this solution somehow backfire? -> several answers in this thread that we will weigh and see if they warrant any action
  • What are all the mul-<script> codes that we should start with? -> none, we are just going with mul for now as I said in my comment
  • How exactly should be the fallback chain for these mul codes? -> no fallback within the mul codes because we only have one. fallback to and from other languages is in my remaining questions
  • Could this solution somehow backfire? -> several answers in this thread that we will weigh and see if they warrant any action

Can you propose something?

Step #3 mentions constraints. What will they be?

I understand that you are keen to get this done, but compared other new language codes, we are still moving quite fast. I think we all don't want this to go into a dead end.

Thank you all for your input on this! We will put this in development right after the no deploy weeks. Special thanks go to @Nikki and @Mahir256, for driving and enlightening this issue, and to @Amire80 and @Epidosis, for your valuable input!

@Esc3300: You also gave helpful input and we appreciate the effort! At the same time, your style of engagement and your continued disagreement with the direction that we took in the deliberation seems to have ultimately led to some demotivating arguments and loops in the discussion. I am sad to see that all of this resulted in a bad climate and a frustrating experience for some of the discussion's participants. It is essential for Lydia and me that - especially for hard decisions like these - we still maintain an open and welcoming climate for all people involved, as well as a worthwhile and productive discussion. This is why we would like to ask you for your help in fostering more open and welcoming discussions that respect our process in the future.

Manuel changed the subtype of this task from "Task" to "Goal".Jul 14 2022, 12:47 PM
Manuel renamed this task from Add termbox language code mul to Add termbox language code mul to reduce redundancy in Wikidata Labels and Aliases.Jul 14 2022, 12:49 PM
Manuel updated the task description. (Show Details)