Page MenuHomePhabricator

Full stop in messages such as Wikibase-time-precision-century is incorrect in English
Open, LowestPublic

Description

The full stop in messages such as "$1. century" is useful in German, because its spelling rules mandate a full stop after numbers in some cases, but it is probably not needed in English.

Because of Wikibase's nonstandard date handling, some dates labelled "20. century" may actually be interpreted as a time in the period 2000–2099 by Wikibase components other than the user interface, and by software other than Wikibase.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
kaldari reopened this task as Open.EditedOct 19 2017, 7:16 AM
kaldari subscribed.

This bug still seems to be present on Wikidata. For example, at https://www.wikidata.org/wiki/Q28790416 I see "1. millennium", "6. century", "7. century" (with my language set to English). In English, this format doesn't make any sense and looks like an error. It should say "1st millennium", "6th century", "7th century", etc. As Daniel mentioned, "5 August" is fine in English. "5 century", however, is never correct. And "5. century" is even worse.

kaldari renamed this task from Full stop in messages such as Wikibase-time-precision-century is not needed in English to Full stop in messages such as Wikibase-time-precision-century is incorrect in English.Oct 19 2017, 7:25 AM
thiemowmde lowered the priority of this task from Medium to Lowest.Oct 19 2017, 7:45 AM

The message for this is https://translatewiki.net/wiki/MediaWiki:Wikibase-time-precision-century/en. We could remove the dot if this makes the situation better. We can not simply change it to "$1st century" because it would say "2st century" then. We do have the PLURAL function, but we would need to list all cases (e.g. 1 and 21 and 31 and so on must end with "st", but not 11). PLURAL simply can not do this. This must be special-case code just for the English language. And what do the other 300 do then?

Please suggest a workaround we can justify in terms of resources it needs.

PLURAL simply can not do this

The problem is that this is an ordinal number instead of a cardinal, and we have no tooling for that in our translations (as far as i know). PHP has:

$locale = 'en_US';
$nf = new NumberFormatter($locale, NumberFormatter::ORDINAL);
echo $nf->format($number);

Not sure if there are languages that have ordinals, that don't use ordinals for century notation..... if not, then it might be good enough to use this straight from php specifically for the date messages that require this..

CLDR has mappings for ordinal numbers in at least 85 languages. I'm pretty sure that's what the PHP intl library is using.

So in other words, we should change it to $1 century, and eventually pass $numberFormatter->format( $number ) as the parameter (instead of just $number).

@thiemowmde: Does this still need further discussion or investigation, or can it be moved to "ready to go" now?

What difference does this make for you, or for anybody else? If a volunteer wants to start working on this, he should feel free to do so, no matter in which column a ticket is. The Wikidata board is just a rough overview anyway, and does not really dictate what people do.

The concern I do have with utilizing PHPs build-in NumberFormatter (http://php.net/manual/en/class.numberformatter.php) is that it might not be suitable for all languages we need to support.

What difference does this make for you, or for anybody else?

The values are nonsense to anyone who doesn't read German.

The concern I do have with utilizing PHPs build-in NumberFormatter (http://php.net/manual/en/class.numberformatter.php) is that it might not be suitable for all languages we need to support.

Understood, but right now, we only support 1 language. Surely 85 languages is an improvement.

Sure, improvements are welcome.

However, I kindly ask you to watch your tone. Neither do we only support 1 language, nor are the values "nonsense". They might be formatted in an awkward, unusual way in many languages. But most readers do have enough imagination to make sense of the string "15. century" if they see it, even with this formatting.

However, I kindly ask you to watch your tone.

Sorry, I legitimately did not intend to express any rudeness. To me, "15. century" is nonsense. I would not guess that it meant "15th century" without context. When I first saw examples of this I thought it was some kind of date parsing bug. I guess the more important issue is that non-German speakers cannot enter century and millenium values into claims as it will only allow you to publish them if you use the German notation.

Neither do we only support 1 language

I meant specifically for the century and millenium notation. Are there other languages that use the dot notation? Sorry for my ignorance. I wasn't trying to be rude.

Anyway, I'm happy to help work on this. I just wanted to see if it's OK to move forward or if it needs more discussion. Didn't mean to start an argument :)

Thanks a lot for the kind response. I believe the suggested solution is definitely a way forward: Remove the dot from the message, and use some code that fills the $1 placeholder with "1." or "1st" or whatever is needed. Said code could be NumberFormatter, something from MediaWiki core (if it exists), or even something handcrafted. As long as it makes the situation better for more than one language, it's an improvement.

I want to say that this bug caused me to refrain from using centuries in this property for years because I was uncertain whether “12. century” really meant 12th century or 1200s ... and I studied German in school, so I am familiar with the German notation. Maybe I’m an edge case here, but if I struggled with this there must be other new users who struggle as well.

Would the 'simple' solution here not be to allow the '.' to be a configurable part of the message/ a separate message?
This could be slightly more complex as if we want to have 'st' 'nd' 'rd' and th' for english for example it's not quite as simple as just appending the message to all numbers, but still very achievable.

A very flexible solution that should work with many languages would be to pass an ordinal version of the number as an additional parameter $2 to all messages (no matter if the message needs an ordinal number or not). This way translators are free to use either the regular or ordinal number whenever their language requires it. For example, a German translator can either use $1. Jahrhundert or $2 Jahrhundert, and both would produce the same output.

The biggest concern here is that whatever code is added to MwTimeIsoFormatter, the same feature set must be added to MwTimeIsoParser. In other words, we can only use the existing NumberFormatter if we are able to parser/unlocalize every single string it outputs. If this is not possible, the best we can do for the moment is to introduce our own OrdinalNumberFormatter and OrdinalNumberParser. They don't need to support all languages right from the start. English, German, and maybe French might be a good start for a first version.

Here is something to start with: http://www.typophile.com/node/42577. Look out for Russian. Numbers have a gender there. Uh.

To be honest, these examples in the link you posted make me skeptical about the suggested solution… how can we produce some code that produces “whatever is needed”, when there is no such thing as the ordinal form of a number in certain languages, and it can depend on gender, declension, and other things? I suspect the best we can do is to have one function for each word that we need an ordinal for – for instance, the words for “century” and “millennium” might have different genders in a language where the ordinal depends on the gender, so we would have separate functions for these ordinals (in languages where it doesn’t matter, most of these functions would then delegate to each other).

Some contributors in French are very perturbated by this notation, even thinking that the entry (5. siècle} was wrong, and trying to add Q8095 instead...

a solution that would allow to translate the ordinal part of the value would be good.

MediaWiki core holds some JavaScript code for making ordinal numbers in many languages.

@Lucas_Werkmeister_WMDE, indeed. My idea is to start simple. Whatever we do, basically everything will be an improvement compared to the current situation. So why not start with the obvious, low-hanging fruit, and take care of the more complicated details later.

This is a slightly modified post from the project chat on March 18, 2018, see
Wikidata:Project_chat#interface_for_data_entering_is_confusing_for_time_value_with_century_precision

As I see it, it is a much bigger problem than the "18. century", "18 century" or "18st century" that seems to be the focus here.

On Help:Dates#Precision it is very clearly stated that a time value of '+1800-00-00T00:00:00Z' with a precision 7 should be interpreted as the 1800s or a time between years 1800 and 1899. But when you edit Wikidata "18. century" is shown for that time value. As "the 18:th century" (very similar to "18. century", "18 century", "18st century" or "18th century") in English (as well as some other languages) means 1700–1799 it is very likely that you enter a time value of '+1900-00-00T00:00:00Z' with precision 7 just to get it displayed as "19. century" e.g. for a person born in the 19th century, i.e. 1800-1899. Today I see that user &beer&love has entered a birth date (P569) with "20. century" (i.e. time value '+2000-00-00T00:00:00Z') for a lot of people born between 1900 and 1999. See e.g. this edit of Secundino González (Q5639303).
https://www.wikidata.org/w/index.php?title=Q5639303&diff=652282000&oldid=625543088

This leads to my suggestion: The interface should be changed so that it says "1800s" (like on the help page) instead of "18. century" in order to reduce the risk of people entering this value for dates between 1700 and 1799? Maybe somehting else, like "1800-1899", should be choosen in order to tell if 1800s should be read as 1800–1809 or 1800–1899. --Larske (disk) 21:10, 18 March 2018 (UTC)

I think Help:Dates#Precision (permalink) is actually incorrect in this case. There is no year 0 – the first century starts with the year 1 and ends with the year 100, the second century starts with the year 101 and ends with the year 200, and so on. Therefore, the year 1800 still lies within the 18th century (which spans from 1701 until 1800 – not 1700 until 1799!), and Wikidata is completely correct in displaying this as “18. century”.

Furthermore, “1800s” already has a different meaning: it refers to the decade from 1800 to 1809 (or perhaps 1801 to 1810? I’m not sure), just like “1810s”, “1820s”, “1980s” etc.

Screenshot-2018-3-22 Wikidata Sandbox.png (136×448 px, 9 KB)

I disagree with Lucas_Werkmeister_WMDE. Help:Dates#Precision is correct. It agrees with[[ https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format | the RDF Dump Format ]] documentation and the the JSON documentation.

Also, the date shares quite a few ideas with ISO 8601, and that reduces precision by just dropping characters. For example, in ISO 8601, "19" means the century from 1900 to and including 1999. Having a century in Wikidata begin with a year like 1901 would reduce the opportunity to share code with anything that handles ISO 8601.

As for the idea that "1800s" means the decade from 1800 to 1809, the Chicago Manual of Style 16th ed. page 476 states

Note that the first decade of any century cannot be treated in the same way as any other decades. "The 1900s," for example, could easily be taken to refer to the whole of the twentieth century.

This is being discussed at Help talk:Dates on Wikidata.

I agree with Jc3s5h. Wikidata dates are represented in JSON with ISO 8601 strings with associated precisions, and a "century" precision value looks like:

time: "+2000-00-00T00:00:00Z"
precision: 7

The insignificant digits are always set to zero, apparently, and I'd expect the range of this value to be 2000-01-01 to 2099-12-31. Formatting it as "20. century" is confusing, to say the least. I suggest displaying it instead as "20XX", or "2000-2099".

The insignificant digits are not always set to zero – you can manually set any date you like and then adjust the precision (example edit). And when you do that, the input shows you how your edit will be interpreted:

Screenshot-2018-6-5 Wikidata Sandbox.png (163×912 px, 17 KB)

The insignificant digits are not always set to zero – you can manually set any date you like and then adjust the precision (example edit). And when you do that, the input shows you how your edit will be interpreted:

Screenshot-2018-6-5 Wikidata Sandbox.png (163×912 px, 17 KB)

It shows how it will be displayed by the user interface, which is different from how it should be interpreted by the specs, and potentially different from how it will be interpreted by all the other software that's been created inside and outside of Wikimedia Foundation projects.

Yeah, I was wrong about the values getting set to zero, the insignificant digits are retained internally.

Can y'all please open separate bugs for these other issues and not just dump them all in this bug? Thanks.

It looks like y'all are actually discussing T73459, so please feel free to continue there.

It looks like y'all are actually discussing T73459, so please feel free to continue there.

I think the solution for both T73459 and this bug would have to involve avoiding the word "century" on both input and output, so the discussion applies to both bugs.

Can y'all please open separate bugs for these other issues and not just dump them all in this bug? Thanks.

A solution must output an expression that will accurately indicate the possible 100 year periods that can be represented in the data model. Using the notation that an "(" excludes the endpoint of an interval and "[" includes the endpoint of an interval:

.
.
.
[1 Jan 101 BC, 31 Dec 2 BC]
[1 Jan 1 BC, 31 Dec AD 99]
[1 Jan AD 100, 1 Jan AD 199]
.
.
.

I have not seen a solution proposed that unambiguously expresses these ranges.

@Jc3s5h: This bug has nothing to do with accurately expressing ranges. This bug is about the fact that "5. century" is not valid English and is confusing to English speakers and should be changed to "5th century" (or if that's too hard, "5 century"). If the word "century" is abandoned, then this bug is moot, but that doesn't mean that this is the right place to talk about accurately representing 100-year ranges.

I

@Jc3s5h: This bug has nothing to do with accurately expressing ranges. This bug is about the fact that "5. century" is not valid English and is confusing to English speakers and should be changed to "5th century" (or if that's too hard, "5 century"). If the word "century" is abandoned, then this bug is moot, but that doesn't mean that this is the right place to talk about accurately representing 100-year ranges.

I disagree with the notion that one can't discuss, within this bug, whether any solution with the word "century" in it is wrong.

@Jc3s5h: I created a new bug for you here: T196674. Now I respectfully ask that you stop spamming this bug with unrelated discussion. Thank you.

@Jc3s5h: I created a new bug for you here: T196674. Now I respectfully ask that you stop spamming this bug with unrelated discussion. Thank you.

No.

The topic of this task is problems in English language due to using full stops in messages. Messages which use the word century was used as one example. The topic of this task is not the use of the word century in messages. Please follow the etiquette to keep discussion constructive and respectful. Thanks for your understanding.

I agree that we should have correct translation of phases like "19th century" to English and other languages. On Commons we have dealt with this over a decade ago and you can find translations to various languages at c:Module:I18n/complex_date.

(My earlier description edits were incorrect and were based on a bad interpretation of the project chat discussion with complaints about external tools. The thing that causes the issue is that some software doesn't realize that Wikibase has nonstandard date handling and has slightly different ranges for centuries and millennia.)

(My earlier description edits were incorrect and were based on a bad interpretation of the project chat discussion with complaints about external tools. The thing that causes the issue is that some software doesn't realize that Wikibase has nonstandard date handling and has slightly different ranges for centuries and millennia.)

I would say rather that most of Wikibase follows much of ISO 8601:2004, and in particular, follows that standard with respect to centuries, for example, when 4 digit years are being used, the precision is century, and the first two digits of the year are 20, then the date could be anywhere from 1 January 2000 to and including 31 December 2099. So most of Wikibase follows a widely used standard. But the user interface is inconsistent with the rest of the Wikibase code and documentation.

External code does not interact through the user interface. Whether the external code correctly follows the Wikibase standards, as documented in the data models, would depend on whether the external code was developed by reading the data models, or empirically by looking at data present in the database.

If the data that the external developers used the empirical method, and the data they happened to look at had incorrect information due to the faulty user interface, the external code might be faulty as well.

Phabricator tickets are not a good place to debate the definitions of phrases like "20th century". The meaning of that term was defined a long time ago and should not depend on Wikidata or other software. I am no expert on the matter but articles on individual centuries on Wikipedias (I checked English and German ones) list "20th century" as timeperiod between January 1, 1901 and ended on December 31, 2000, and that did not change in last 15 years. It might be counter intuitive and not compatible with some software, but it seems like that is the understanding of the term.

when 4 digit years are being used, the precision is century, and the first two digits of the year are 20, then the date could be anywhere from 1 January 2000 to and including 31 December 2099

@Jc3s5h The user interface says "20. century" for +2000-00-00T00:00:00/7, and if a user types that phrase into a time field, that is the value that gets stored. A similar thing is done for millennia. I haven't read the Wikibase documentation, so I am only aware of these things from observation.

In any case, this appears to be the wrong ticket for discussing this problem.

I would say rather that most of Wikibase follows much of ISO 8601:2004, and in particular, follows that standard with respect to centuries, for example, when 4 digit years are being used, the precision is century, and the first two digits of the year are 20, then the date could be anywhere from 1 January 2000 to and including 31 December 2099. So most of Wikibase follows a widely used standard.

I am strongly convinced that this is simply not true. You claim that “the date could be anywhere from 1 January 2000 to and including 31 December 2099” – where in the Wikibase software do you see this? I am not aware of any place where Wikibase currently tells you the earliest and latest calculated time for a timestamp – as far as Wikibase core is concerned, it simply stores a timestamp and a precision, and the user interface is the only place that interprets the precision.

user interface is the only place that interprets the precision

Is the standard behaviour documented anywhere? Wikibase/DataModel does not appear to contain any information about how the behaviour of the user interface is different for centuries and millennia.

I would say the data model must be obeyed, and every aspect of the user interface must obey the data model. The presence of the word "century" in the user interface is a lie. This ticket should be closed as requesting the developers to refine a lie.

Phabricator tickets are not a good place to debate the definitions of phrases like "20th century". The meaning of that term was defined a long time ago and should not depend on Wikidata or other software. I am no expert on the matter but articles on individual centuries on Wikipedias (I checked English and German ones) list "20th century" as timeperiod between January 1, 1901 and ended on December 31, 2000, and that did not change in last 15 years. It might be counter intuitive and not compatible with some software, but it seems like that is the understanding of the term.

We should not debate the meaning of "century" except to decide if it is compatible with the data model, which must be obeyed. Since many (not everyone, but many respectable sources) interpret the 21st century to begin 1 January 2001 and end 31 December 2100, the word "century" is not compatible with the data model and should not be used anywhere in the user interface.

the word "century" is not compatible with the data model and should not be used anywhere in the user interface.

@Jc3s5h: As I've mentioned before, there are separate tickets for discussing both the century data model problem (T73459) and the use of "century" in the interface (T196674). Please stop spamming this ticket with those discussions. This ticket is purely about how to properly localize the century strings.

Did someone did something?

At least in French, we now have nice "9e siècle" but not in English or Breton :(

Did someone did something?

At least in French, we now have nice "9e siècle" but not in English or Breton :(

I don't know if you're talking about that, but I changed translations some times ago on TranslateWiki. See https://translatewiki.net/wiki/Template:Related/Wikibase-time-precision

Note that this demonstrates the problem Thiemo mentioned in T95553#3916063: we currently can’t parse «9e siècle».

Note that the presentation "9e siècle" in French is readable but defintely not the prefered one:

  • it should really use roman digits (preferably in small-capitals for centuries, but plain-capitals like for millenia are OK)
  • the ordinal suffix "e" (or "er" only for 1st) is normally in superscript.

Also note that not all languages are counting or representing centuries with ordinals, some just use the starting year.

You should look at Wikimedia Commons which includes templates for correctly presenting a Millennium, Century, or Decennium (i.e. Decade: the template name Decade is used for something else, i.e. generating an horizontal navigation bar for the ten decades roughly composing the same century, or for navigating to other decades in the previous or next century).
These template pages show already examples for many languages, including the use of ordinal numbers or roman digits where appropriate and their correct formatting). Note that for some languages (e.g. Swedish), the ordinal numbering of decades, centuries, or millenia is not used; instead a year number comes with by an extra qualifier or suffix, just like "1980s" for a decade in English).

And compare them as well to what is used equivalently in each Wikipedia (look at their navbox templates and calendars), and also for naming their pages and categories (roman digits are just using plain ASCII letters, and there's no superscript for these titles, but these same pages may fix this presentation using {{TITLEPAGE:...}} by adding formatting elements and custom styles and eventually change the initial lettercase if capitalization is undesired, e.g. for "iOS").

For inputing these dates in an input field, it should be acceptable to enter them without the style formatting (such as superscripts), however the correct styling should be applied once displayed (i.e. without the input focus for editing inside the input field).

So the formatting rules for dates are more complex than everything discussed above (which is oversimplifying the problem with many false assumptions) !


As well, if a precision is entered for any input date, the effective value to store should be automatically rounded to the middle of the range containing that date.

So if you enter any date between "1901" and "2000" (inclusive), and then select the "century" precision, any entered value will reinterpreted as a date somewhere in the 20th century.

That date will then be rounded to "1951-01-01T00:00:00Z" for storing the actual value in Wikidata, because this is the exact middle of the 20th century (ignoring only the small deviations caused by the number of leap years and leap seconds, which can be at most one day and two seconds between the two parts of the century, this deviation being far below the given precision). The stored precision will be ± 50 years.

Alternatively the storage could use a "datemin/datemax" like for ranges (this could apply as well for any other numeric quantities), in that case we'd get:

  • datemin =1901-01-01T00:00:00Z (inclusive)
  • datemax = 2001-01-01T00:00:00Z (exclusive)

However this is not specifying a range: normal range of dates are actually specifying two dates, not one, each bound with their own precision (each bound then defines its own fuzzy range, the two fuzzy ranges are of course allowed to wrap to represent the normal range with its precision; each bound may have different precision, e.g. unknown precisely for the starting date with a century precision,, but known precisely with a day precision or better; all that is assumed then is that the starting date falls in a fuzzy range that cannot go beyond the ending date known precisely).

Note that years 1901-1950 fall entirely in the first half of the 20th century, and years 1951-2000 fall entirely in the second part.

Rendering that date could then be "1951-01-01T00:00:00Z ± 50 years", or better as "20th century" if properly formatted according to the target language.

  • The precision "±500years" means that we can format using the millenium format, but normally only if the central date is the exact middle expected for the millenium, i.e. the central date is "n501-01-01Z". (this is also valid for the 1st millenium BC or AD, none are including a "year 0", unless you use the astronomic calendar which has a single era, where "astronomic year 0" actually maps to "year 1 AD" within the proleptic Gregorian calendar).
  • The precision "±50years" means that we can format using the century format, but normally only if the central date is the exact middle expected for the century, i.e. the central date is "nn51-01-01Z".
  • The precision "±5years" means that we can format using a decennium format, but normally only if the central date is "nnn6-01-01Z".

The message for this is https://translatewiki.net/wiki/MediaWiki:Wikibase-time-precision-century/en. We could remove the dot if this makes the situation better.

@thiemowmde - Yes, removing the dot would make it look less like an error. "4 century" is slightly better than "4. century".

This comment was removed by Verdy_p.

Change 678307 had a related patch set uploaded (by Jon Harald Søby; author: Jon Harald Søby):

[mediawiki/core@master] Add ordinal transformation as GRAMMAR for English

https://gerrit.wikimedia.org/r/678307

Test wiki created on Patch Demo by Jon Harald Søby using patch(es) linked to this task:

https://patchdemo.wmflabs.org/wikis/34a3f67b1f/w/

Test wiki on Patch demo by Jon Harald Søby using patch(es) linked to this task was deleted:

https://patchdemo.wmflabs.org/wikis/34a3f67b1f/w/

From today's bug triage hour: It seems that we can make this work by using @jhsoby's patch and adding a Wikibase patch to support parsing the new GRAMMAR feature. It will be similar to the transformation of PLURAL into a regex in MwTimeIsoParser but because we don't have all the possible alternatives (1st, 2nd, etc) available in the message the parsing will need to be more lenient (perhaps based on some unicode categories around the number). We only intent to support ordinals that still show a number, not word ordinals (e.g. first, second, ..).