Page MenuHomePhabricator

[RFC] Unit Localization
Open, HighPublic

Description

This RFC is for gathering input on the question how units for quantities should be represented in the user interface, where the information needed for that should come from, and how unit input during editing should work.

The status quo is:

  • Quantity values use an arbitrary string to identify the unit of the quantity ("no unit" is represented as "1")
  • Wikibase restricts units to HTTP(S) URIs.

The intention is to allow any Wikidata Items as a unit on Wikidata. Third party installations should be free to:

  • use Wikidata (!) Items
  • use some other vocabulary for Units, like QUDT (as a fixed list in the config).

When formatting quantities with units, we need (at least) two options:

  1. use the full, localized name of the unit ("meter", "ounce")
  2. use the unit symbol ("m", "oz"). These also need to be localizable (e.g. Russian uses км/ч for km/h).

When using Wikidata items, we could use the item's label for the "full" name, and P558 for specifying the "symbol" - but P558 has type "string", so it's not localizable. We would need a multilingual text property. Also, we would have to base this functionality on a user defined property, which seems a bit brittle.

QuantityValueFormatter will need an option that controls if and how the unit is shown. Another option will be needed for unit conversion later.

When inputting a unit, we need to consider (at least) the two modes i mentioned above (Wikidata item or fixed list from config). It would probably be fine to only implement the case we need for Wikidata right now, but it would be great if that would be usable by third party installs. This means, I think:

  • Include an EntitySelector in the ValueView for Quanity values (The EntitySelector should pick Wikidata items, not local entities, on third party installs)
  • Make the EntitySelector generate full URIs based on the selected entity
  • Pass the URI from the Unit/EntitySelector to parsevalue as an option (just like we do for the reference globe calendar)
  • Make QuantityValueParser use units passed as options. Make parsing of "inline units" optional (or remove it).

The text in the EntitySelector would primarily be the entity label, as usual. However, it would be nice if we could show the unit symbol (see above) alongside the label. It would also be nice to have some indication about whether a unit is convertible or not (i.e. if it's a "standard" unit).

Event Timeline

daniel raised the priority of this task from to Needs Triage.
daniel updated the task description. (Show Details)
daniel subscribed.
Lydia_Pintscher added a project: Wikidata.
Lydia_Pintscher set Security to None.
Lydia_Pintscher added a subscriber: aude.

Also, we would have to base this functionality on a user defined property, which seems a bit brittle.

How is that more brittle than having units as items?

Just some thoughts on that:

In my opinion, the "full" name should be defined by a property as well. Using the label would add an an assumption on top of the concept of label whose original purpose is nothing more than to identify an entity. Additionally, if there are units with alternative names, those would be hard to retrieve from list of aliases. Now that we are discussing configuring special properties: Will we please convert label, description and aliases to statements then?

The concept explains that third-party installations may use Wikidata items for units. That would require configuration and logic specific to Wikidata. I am pretty sure that could be done in a more generic way. Still, for example, I would actually like to be able to set up and configure entity types using settings. That would allow setting up an entity type "unit" per installation. Having such an entity configured on Wikidata, external entity selectors would not require a special API endpoint but could just query for that entity type. External installations would either use their own "unit" entity type or query Wikidata if no local "unit" type is configured.
I know that would create constraints on entities (which have been forbidden by PM) but defining special statements would just do the same in a more unstructured way.

Regarding UI: I would really like to be able to just enter a string (including value and unit) and have the "system" recognize my input. Having to select a unit using some drop-down is vintage. However, if the recognition fails or the preview is not as expected, the user should be allowed to select a unit from some entity selector derivative. Anyway, I imagine the UI concept being really tricky and, personally, I would not start implementing without having any prototypes (Source: experiences from mostly still unfinished implementations of previous data types).

I know I made this point several times, but I haven't seen it addressed or debunked yet, so I'll continue to do so:

I think we limit ourselves by not allowing us to write PHP code specific for Wikidata. I think we make Wikibase worse by doing so. I think, by doing so, we make Wikidata not as great as it could be. This also leads us to not follow the open/closed principle on a module level. Instead, we introduce a vast amount of configuration options which allow modification of Wikibase while at the same time limiting users to what we imagined and bothered to implement.

From this follows (for Wikibase):

  • QuantityValue parsing above a base line ( explode( ' ', … ) ) should be supplied by Wikidata
  • QuantityValue validation above what Wikibase itself needs to work should be supplied by Wikidata
  • QuantityValue formatting above a base line ( wfMessage( 'wikibase-value-with-unit', $dv->amount, $dv->unit )->escaped() ) should be supplied by Wikidata
  • QuantityValue editing above a base line ( '<input name="valueAndUnit" />' ) should be supplied by Wikidata (if required)

In terms of Wikidata, I suggest the following:

  • Translatable statements on items (using two specific properties) specify the short and long unit name for displaying in quantity values
  • A statement on items marks them as being a unit (It takes a lot of work to implement filtering on top of this, but validating is simple)

This would lead to the following implementations:

  • WikidataItemQuantityValueUnitParser: fetch unit item (by long unit name propval, short unit name propval, then label), store item URI
  • WikidataItemQuantityValueUnitValidator: fetch unit item (by URI), make sure it is an unit item
  • WikidataItemQuantityValueUnitFormatter: fetch unit item (by URI), show most appropriate name (long / short / label)
  • For editing, Wikibase baseline seems to be enough

In my opinion, specifically developing for Wikidata would be opening Pandora's box. We already have rather awkward Wikidata-specifics in the the code (Calendar types reference Wikidata items, for example) and, from my point of view, dealing with gadgets is already a critical aspect (we should, for example, find a way to reflect the AuthorityControl gadget's functionality in Wikibase core as soon as possible which, actually, points into a similar direction). I am not sure whether every third party would like to access Wikidata (or any web service at all) for running their Wikibase installation and I remain advocating having an independent and flexible stand-alone software. When configuring Wikidata, why should those configuration options and the corresponding logic be separated from Wikibase? That points into the direction of having a Wikidata release and a Wikibase release, eventually.

The idea of localized units is great, however I wonder about the following things. Let's say we express quantities in all kinds of different units, so if we have a property for the area of the country, one country would have it in square meters, another in square kilometers, yet another in square feet or hectares. Now we want to see which country has the biggest area. That would be quite a challenge. So if we want Wikidata to be machine-friendly data collection that could be processed by automated tools, how do we ensure the quantities of some kind have units that can be automatically handled together efficiently - i.e. even if we know how to convert square kilometers to square feet and vice versa, how do we efficiently index such things?

In my opinion, specifically developing for Wikidata would be opening Pandora's box. We already have rather awkward Wikidata-specifics in the the code (Calendar types reference Wikidata items, for example) and, from my point of view, dealing with gadgets is already a critical aspect (we should, for example, find a way to reflect the AuthorityControl gadget's functionality in Wikibase core as soon as possible which, actually, points into a similar direction). I am not sure whether every third party would like to access Wikidata (or any web service at all) for running their Wikibase installation and I remain advocating having an independent and flexible stand-alone software. When configuring Wikidata, why should those configuration options and the corresponding logic be separated from Wikibase? That points into the direction of having a Wikidata release and a Wikibase release, eventually.

Just to be clear: I want to have Wikibase independent of Wikidata. I think my proposal is an improvement in that regard, since we could move Wikidata-specific code out of Wikibase into that Wikidata code base. For example, we would not have to implement AuthorityControl in WikibaseRepo, instead, we could Wikibase flexible enough to allow Wikidata to implement AuthorityControl in a sane way in PHP and JS.

Will we please convert label, description and aliases to statements then?

Yes, good idea. Although that would be a quite involved change, so it is not likely. Converting site links to statements would be a bit more difficult.

That would allow setting up an entity type "unit" per installation. Having such an entity configured on Wikidata, external entity selectors would not require a special API endpoint but could just query for that entity type. External installations would either use their own "unit" entity type or query Wikidata if no local "unit" type is configured.
I know that would create constraints on entities (which have been forbidden by PM) but defining special statements would just do the same in a more unstructured way.

You could still use an item as unit that does not have those special statements.

For example, we would not have to implement AuthorityControl in WikibaseRepo, instead, we could Wikibase flexible enough to allow Wikidata to implement AuthorityControl in a sane way in PHP and JS.

How is adding links based on values of certain properties to external URLs specific to Wikidata? Except which property on properties is used to do the mapping, which is configuration.

@JanZerebecki If you just want to printf an URL from a string value, there's not much to it. However, there's so much more we can do: validate the input, provide specific input methods, add a preview to the output (for example for commons images), …

@adrianheine All these examples are not specific to Wikidata but would work differently for different targets (no matter if the statement is on Wikidata or somewhere else) and for the same target like e.g. Commons it would be nice to have it work even on Wikibase installations outside of the WMF cluster. But yes it is probably wise to have such a thing as some form of components to be able to add other targets, without need to merge them in the same git repository.

Outcomes of the discussion with @daniel @JanZerebecki @Lydia_Pintscher @adrianheine and @thiemowmde

First use case: Units are Wikidata Items
AGREEMENT: use URIs, no free-text units for now (Units are represented as URIs internally (usually URIs of Wikidata items))
AGREEMENT: no new entity-type for units
Example:
item for “meter” (Q11573) with labels (“metre”), etc.. and it has a statement with a special property “unit symbol” (P558, currently string, should later be multilingual)
AGREEMENT: put value of “unit symbol” into terms table (handling for special property should not be tangled to Wikibase too much)
NO AGREEMENT: have separate module for dealing with units VS. have it in Wikibase -> further discussion needed

Question: link unit-symbol to unit-item?
Via terms table? Via PropertySuggester?

comment by @aude: neither seems very nice, but don’t have another suggestion right now and perhaps the question isn’t very clear to me.

Question: where to put formatting rules?
first version: one rule per language (format string in message file, e.g. “$1 $2”)

comment by @aude: the messages (i18n) system is well-suited for this (and think also performance-wise).

Question: How to render units?
e.g. m/s vs. <math>\frac{m}{s}</math>: have a simple version (= Unicode string) first

AGREEMENT: PropertySuggester needs to be adjusted/extended.

comment by @aude: details please...

Question: Requirements for a new derivation of the entity selector:
should return URIs
selector should be able to deal with non-local sources

Would be awesome to document the reasoning behind the agreements. So, instead of flexible entity types, there will be "special" properties? As that is a very fundamental decision, it should be evaluated properly as to other use cases (spontaneously, T74524 comes to my mind).

Btw. there are already special items configured in: $wgWBClientSettings['badgeClassNames']

Jonas renamed this task from RFC: Unit Localization to [RFC] Unit Localization.Aug 28 2015, 2:07 PM

I'm not sure how many units this is supposed to cover, but CLDR provides localisation for units: http://www.unicode.org/cldr/charts/latest/by_type/index.html (Measurement Systems | Duration | Length | Area | Volume | Speed and Acceleration | Mass and Weight | Energy and Power | Electrical and Frequency | Weather | Digital | Coordinates | Other Units | Compound Units).

Maybe worth trying to get a bot that would import at least the obvious cases from CLDR?

Why a bot? We import CLDR data in the code, usually. Mostly with the cldr extension.

I mean, if you desire to keep the units on the wiki you could for instance add a parser function for value and unit rendering to the CLDR extension (and allow using wikitext in labels or whatever). There are many templates to handle this rather tedious task, so it would be used anyway.

Or you could add some specific property type or something, that Wikibase will then handle in a special case using CLDR data.

As a pet project I've been very slowly migrating grammar data from PHP and JS code to generic JSON: T115217. (I haven't touched it in a few months, and thanks to @Smalyshev's email from today I recalled it.)

Since the strings for unit names are not MediaWiki messages, but data, the grammar transformations for unit names cannot be in MediaWiki itself and probably not in Wikibase either, but maybe the JSON format can be added as data to the Wikidata installation, and the code that processes them in MediaWiki core can be used for actual display. It's not the most robust solution, and of course it's very far from being full lexical data, but it should cover this particular use case in a rather generic way, and then it can be developed further.

But first T115217 has to be resolved—the patches have been in for a few months.

T115217 looks interesting but it covers only a small subset of the grammar rules - namely, ones that are needed to name the languages, most of which follow small number of patterns. If we want names of units - which, in our case, may be names of virtually any objects - the rules would have to be expanded considerably, and I'm not sure regexps are going to cut it anymore...

Of course it's not only for names of languages. Names of languages is just a first step. The same format can be used for any word, including names units.

Yes, it's possible that regular expressions won't cut it for some words or some languages, but if this can be done in a way that will cover most words in most languages before we have a beautiful morphological engine that generates all grammar forms for all languages, it can be good enough.

Sorry for going off topic, feel free to skip:

Of course it's not only for names of languages. Names of languages is just a first step. The same format can be used for any word, including names units.

Actually, no. It's not currently usable for language names outside your existing efforts. It is barely sufficient for {{SITENAME}} inflection, as is highlighted by the fact that we allow site admins to easily override the inflections when they are wrong.

This will not cover most of world languages in any adequate quality anytime soon. Researchers are spending years to build morphological engines which are far from perfect. Regular expressions are not a tool that allows creating such complex systems (but finite state methods in general are) in a maintainable way.

It is good that we are moving our existing grammatical rules out of PHP code, but I think we are currently at a sweet spot between complexity of the system and the benefit it provides. Extending its usage further will make it increasingly difficult to use (languages are not equal here) until we start interfacing with purpose build morphological tools hiding the complexity in a more maintainable way.


CLDR is a good source of localisation data and many projects will benefit when CLDR data is used, and more importantly, improved.

This will not cover most of world languages in any adequate quality anytime soon. Researchers are spending years to build morphological engines which are far from perfect. Regular expressions are not a tool that allows creating such complex systems (but finite state methods in general are) in a maintainable way.

Actually, what you need is a regular language, which can be a FSM in some form - or a regex. Still note that some regex languages are not regular languages.

@Nikerabbit I think you are right that we won't be able to design regexp formula that allows to grammatically display arbitrary unit for arbitrary language any time soon.

However, there is a middle ground here - we don't have to go all the way from unit label to а grammatical attached unit label programmatically. We can let people help us.

CLDR takes us part of the way - but it seems to be not enough. I.e. it has translation for {0} kilometers into Russian. The problem is, this translation is different depending on the number inside {0}. And CLDR templates on that page you quoted do not cover that. They have long-few and long-many, but how many is few and how many is many? It looks like it's not enough. And of course what if it's apples, not kilometers?

There is another approach, more flexible: ICU message formats, http://userguide.icu-project.org/formatparse/messages
I'm not sure whether CLDR has language rules with accord to these, but these seem to be able to cover most of the complexities I know of. And, if we support those, we can make people add rules for units not covered by CLDR.

P.S. it'd be really nice to get this specific discussion to T141597, but I don't mind any discussion as long as we can get some progress :)

And CLDR templates on that page you quoted do not cover that. They have long-few and long-many, but how many is few and how many is many?

CLDR is certainly hard when you first look at it, but "few" and "many" have a specific meaning, see http://cldr.unicode.org/index/cldr-spec/plural-rules#TOC-Choosing-Plural-Category-Names ; each locale applies it as appropriate.

There is another approach, more flexible: ICU message formats, http://userguide.icu-project.org/formatparse/messages
I'm not sure whether CLDR has language rules with accord to these

Nowadays, ICU is mostly an interface to CLDR data. If ICU provides a library that we can use instead of elaborating CLDR data ourselves, that's of course a good thing.

I think automatic transformation based on regular expressions can only work for a very narrow set of applications. We need something that is more robust and scales better in terms of community involvement.

I think we could sidestep the grammar issue by using unit symbols ("m" for meter, "s" for second, etc - see T77983). These still need to be localized to some degree, but don't need plurals (as far as I know).

We would have to get the symbols from statements, and they would have to be multilingual values (or multiple mono-lingual values), but that is still much less complicated than trying to apply plural rules.

An alternative is to use MediaWiki i18n messages instead of entity labels. E.g. if the unit is Q11573, we could check if MediaWiki:wikibase-unit-Q11573 exists, and if it does, use it. We'd get internationalization including support for plurals for free.

We could actually combine all of these approaches in a fallback chain:

  • first check for a system message
  • then check for a symbol statement
  • then use the label
  • and if all fails, use the ID.

Will you use the same form when measuring the distance to the sea as the distance to a mountain? (I know, it is a trick question.)

Yeah, Daniels proposal could work for most of the cases.