Page MenuHomePhabricator

Internationalise citoid dates
Open, HighPublic1 Estimated Story Points

Description

  • Put out years in year only format (i.e. YYY)
  • Put out all dates in a readable format (i.e. May 2010) in the date field to address the polluted data issue ASAP.

This is a possible way forward for internationalising dates:

  • Translate dates on our end.

OR

Note: Discussion is also happening on ENWP here: https://en.wikipedia.org/wiki/Help_talk:Citation_Style_1#ISBNs_in_mw:Citoid

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

For the YYYY-MM date format to be accepted at en.wiki, you will have to get it accepted at MOS:NUM.

That may be true for output; but is not true for input. Per Postel's Law, templates should accept YYYY-MM dates (valid in ISO 8601) when entered, regardless of how we decide they should display such values.

Some comments on the status of EDTF.

  • The most recent Library of Congress page about EDTF describes itself as "Official Web Site" and the main heading is "EDTF Background". It states "EDTF functionality has now been integrated into ISO 8601-2019, the latest revision of ISO 8601, published in March 2019." Based on what I've read on the various Library of Congress pages about EDTF, and having contributed to the drafts over time, I think this means that all the things you could do with EDTF can be done with ISO 8601-2019, but the syntax may be different.

[[ https://www.loc.gov/standards/datetime/ | The 2019 EDTF specification ]] mentioned above is, of course, older. It's unclear if it has much of a future, even in terms of staying on the Library of Congress website.

Using ISO 8601-2019 has three problems.

  1. It's expensive, so the volunteer editors and developers associated with the Wikimedia Foundation are very likely to rely on unreliable summaries.
  2. It only supports the Gregorian calendar, so using it for Julian calendar dates would be an approximation. Since the standards discussed are external to the Wikimedia Foundation, we can't modify them to say this approximation is acceptable.
  3. Usages in one area of Wikimedia Foundation stuff tend to creep into other areas. This process of Citoid's non-standard YYYY-MM creeping into the English Wikipedia is an example of this creep. Nobody in this discussion has the power to prevent YYYY-MM-XX being used to represent a Julian calendar death date, where the difference of a dozen or so days is more important that when representing a magazine publication date.

Year and month specified, day unspecified in a year-month-day expression (day precision)
Example 4 ‘1985-04-XX’

Does that not provide a way to do month year dates that don't require English?

1985-04-XX means "an unknown individual day in April 1985"; 1985-04 means "the month of April 1985".

I'd be happy to consider switching to a different way of representing partial dates that doesn't require the use of English, but unfortunately no one has suggested a standard that doesn't that I've noticed @Trappist_the_monk you suggested IETF but I had a look and couldn't find a partial date representation there - did I miss it maybe?

Isn't that what most of the whole long discussion has been about? The current IETF at Level 1 §Unspecified digit(s) from the right at item 3 reads:

Year and month specified, day unspecified in a year-month-day expression (day precision)
Example 4 ‘1985-04-XX’

Does that not provide a way to do month year dates that don't require English?

Yes, but that quote is from EDTF - That's EDTF level 1. https://www.loc.gov/standards/datetime/

(I was just confused by the IETF statement because I couldn't find anything like that in IETF, i.e. https://tools.ietf.org/html/rfc3339 )

I'm happy to do EDTF level 1 abridged dates instead if that's preferable to the level 0 format.

For the YYYY-MM date format to be accepted at en.wiki, you will have to get it accepted at MOS:NUM.

That may be true for output; but is not true for input. Per Postel's Law, templates should accept YYYY-MM dates (valid in ISO 8601) when entered, regardless of how we decide they should display such values.

Nobody reads Help:Citaition Style 1 or related documentation. So editors will inevitably write 2011-12 when they mean 2011-2012. There is no way to tell if 2011-12 means 2011-2012 or December 2011. So it is an error and should always be flagged as such.

Year and month specified, day unspecified in a year-month-day expression (day precision)
Example 4 ‘1985-04-XX’

Does that not provide a way to do month year dates that don't require English?

1985-04-XX means "an unknown individual day in April 1985"; 1985-04 means "the month of April 1985".

In EDTF 2019, 1985-04-XX means the individual day in April 1985 is unspecified, not unknown. In the case of a newer magazine issue, in the form of a PDF, if the cover said it was the April 2021 issue and the PDF properties said it was last changed on 6 March 2021, the correct date, for citation purposes, would be April 2021.

Reputedly ISO 8601-2019 uses the same syntax. I'll tell you exactly what it means in that specification after you buy a copy and send it to me.

(I was just confused by the IETF statement because I couldn't find anything like that in IETF, i.e. https://tools.ietf.org/html/rfc3339 )

Yeah, my error; I type IETF much more often than I type EDTF...

I'm happy to do EDTF level 1 abridged dates instead if that's preferable to the level 0 format.

Let us stick to a single term: 'unspecified digits'; 'abridged' implies a truncation or shortening which isn't the case. The date 2021-02-XX is just as long as 2021-02-24.

There is a very good reason for NOT generating and not accepting dates like 2009-10 in MOS:DATE and in citation template input: Because we don't know whether that means October 2009 or 2009–2010. If the citoid software has a date in this format for which it does know the correct disambiguation, then it is in that software that the conversion to a valid format must be made, before the disambiguation information is lost.

Change 674692 had a related patch set uploaded (by Mvolz; author: Mvolz):
[mediawiki/services/citoid@master] Change ambiguous days to XX

https://gerrit.wikimedia.org/r/674692

@Mvolz Thanks for adding the user-notice tag. Is this just about the ambiguous days (the latest patch), or the task in general?

Or, rather: How would you suggest phrasing it in a couple of simple sentences for Tech News?

@Mvolz Thanks for adding the user-notice tag. Is this just about the ambiguous days (the latest patch), or the task in general?

Or, rather: How would you suggest phrasing it in a couple of simple sentences for Tech News?

Just about the patch!

Draft for when it gets deployed:

"The citoid api will now use Extended Date and Time Format level 1 instead of level 0 for publication dates where there is a month but no day available. So, for example, for December 2008 it will now return 2008-12-XX instead of 2008-12. The change was made because in some cases, the level 0 dates could be confused with year ranges (i.e. 2008-2012, rather than December 2008). Dates with only the year will be continued to returned as just the year (i.e. 2008, and not 2008-XX-XX) More information about the Extended Date and Time Format is available from the Library of Congress here: https://www.loc.gov/standards/datetime/"

Change 674692 merged by jenkins-bot:
[mediawiki/services/citoid@master] Change ambiguous days to XX

https://gerrit.wikimedia.org/r/674692

More information about the Extended Date and Time Format is available from the Library of Congress here: https://www.loc.gov/standards/datetime/"

Phabricator mangled the link; try: https://www.loc.gov/standards/datetime/

editors will inevitably write 2011-12 when they mean 2011-2012. There is no way to tell if 2011-12 means 2011-2012 or December 2011. So it is an error and should always be flagged as such.

I don't think this is correct. It might be a potential error, but it is not necessarily an actual error.

Additionally, I think this potential error affects approximately one in sixteen possible dates in the YYYY-MM format. Consider:

YYYY-MMinterpretation
YYYY-00not a valid month number, therefore a multi-year range
YYYY-13MM ≥13 = not a valid month number, therefore a multi-year range
YY01-01MM 01 to 12 are ambiguous when YYYY ends in 00 to 12
YY13-01year–month, because 13 > 01

I don't think that the citation template should be trying to spot typos.

editors will inevitably write 2011-12 when they mean 2011-2012. There is no way to tell if 2011-12 means 2011-2012 or December 2011. So it is an error and should always be flagged as such.

I don't think this is correct. It might be a potential error, but it is not necessarily an actual error.

Additionally, I think this potential error affects approximately one in sixteen possible dates in the YYYY-MM format. Consider:

YYYY-MMinterpretation
YYYY-00not a valid month number, therefore a multi-year range
YYYY-13MM ≥13 = not a valid month number, therefore a multi-year range
YY01-01MM 01 to 12 are ambiguous when YYYY ends in 00 to 12
YY13-01year–month, because 13 > 01

I don't think that the citation template should be trying to spot typos.

This is a discussion for en.WP's MOS talk page, not phabricator. I think the most recent discussion (one happens every year or so) is at https://en.wikipedia.org/wiki/Wikipedia_talk:Manual_of_Style/Dates_and_numbers/Archive_160#ISO_8601_YYYY-MM_Calendar_Date_Format

Also, your "one in 16" argument would be reasonable if we were not currently in the first quarter of the century. Statistically, there will be a disproportionate number of 2000 through 2012 years cited in Wikipedia.

editors will inevitably write 2011-12 when they mean 2011-2012. There is no way to tell if 2011-12 means 2011-2012 or December 2011. So it is an error and should always be flagged as such.

I don't think this is correct. It might be a potential error, but it is not necessarily an actual error.

Additionally, I think this potential error affects approximately one in sixteen possible dates in the YYYY-MM format. Consider:

YYYY-MMinterpretation
YYYY-00not a valid month number, therefore a multi-year range
YYYY-13MM ≥13 = not a valid month number, therefore a multi-year range
YY01-01MM 01 to 12 are ambiguous when YYYY ends in 00 to 12
YY13-01year–month, because 13 > 01

I don't think that the citation template should be trying to spot typos.

I agree it's not a common error, but given that Module:CS1 has been throwing citation errors for 3 years rather than being willing to accept it as valid input as a result of this reasoning, I think YYYY-MM-XX represents an acceptable compromise. At this late point, I'm more interested in hearing if there was a compelling reason /not/ to do that. (We also continue to have relaxed validation which means this won't be all dates, just the ones currently represented in YYYY-MM format. If this works for people we could potentially tighten up validation later.)

@Mvolz Thanks for adding the user-notice tag. Is this just about the ambiguous days (the latest patch), or the task in general?

Or, rather: How would you suggest phrasing it in a couple of simple sentences for Tech News?

This could potentially be deployed tomorrow (thursday) or we could wait a week for it to go into Tech News first - it's been a while, how is this generally done?

@Mvolz As a general rule we prefer to announce things in beforehand, but it's not a blocker for minor changes that won't affect editing much. If things risk being confusing or create issues for editors, it's generally better to wait a week so we get the chance to announce it (next issue will be delivered on Monday).

@Mvolz As a general rule we prefer to announce things in beforehand, but it's not a blocker for minor changes that won't affect editing much. If things risk being confusing or create issues for editors, it's generally better to wait a week so we get the chance to announce it (next issue will be delivered on Monday).

Ok, I'll wait to deploy. I think it's better to do it before as well. Plan on next Thursday (April 8th) then.

This is now deployed. It does not make anything worse, because dates that cause CS1 errors still cause CS1 errors :).

@Trappist_the_monk any chance of unearthing the old CS1 code to make these more human readable? Here's an example DOI for testing: 10.1016/S0305-0491(98)00022-4

As always, this can be undone if we don't like it. If we do like it, we could consider doing this to all dates instead of returning dates in English (which is currently the case for some dates, since we don't do much validation.)

@Trappist_the_monk any chance of unearthing the old CS1 code to make these more human readable? Here's an example DOI for testing: 10.1016/S0305-0491(98)00022-4

Already done in cs1|2 sandbox; module suite update is scheduled for this weekend. See https://en.wikipedia.org/wiki/Help_talk:Citation_Style_1#edtf_date_formats_as_cs1|2_date_parameter_values_(2)

Are you sure? I tried that doi in the 2010 wikitext editor. The first time I got |date=1 May 1998. The second try (and subsequent tries) give me |date=undefined NaN.

JFYI: MediaWiki parser function #time does not support 2020-03-XX format (and would return Invalid time error from it). If any template uses it for formatting time in citations, they would have to remove the -XX manually (in template or by hand) to get it to work.

@Trappist_the_monk any chance of unearthing the old CS1 code to make these more human readable? Here's an example DOI for testing: 10.1016/S0305-0491(98)00022-4

Already done in cs1|2 sandbox; module suite update is scheduled for this weekend. See https://en.wikipedia.org/wiki/Help_talk:Citation_Style_1#edtf_date_formats_as_cs1|2_date_parameter_values_(2)

Great, thanks!

Are you sure? I tried that doi in the 2010 wikitext editor. The first time I got |date=1 May 1998. The second try (and subsequent tries) give me |date=undefined NaN.

Well, that's troubling. I'm not sure what wikitext 2010 editor uses for dois. (@kaldari might know more?)

This is from the api:

https://en.wikipedia.org/api/rest_v1/data/citation/mediawiki/10.1016%2FS0305-0491%2898%2900022-4

And I confirmed it works with VE.

JFYI: MediaWiki parser function #time does not support 2020-03-XX format (and would return Invalid time error from it). If any template uses it for formatting time in citations, they would have to remove the -XX manually (in template or by hand) to get it to work.

Interesting, does it work with 2020-03?

I'm facepalming now that I didn't test this in other language wikis more thoroughly first because I've just discovered German templates for example accept 2020-03 but not 2020-03-XX :/. I'm thinking the tech news announcement wasn't quite adequate @Whatamidoing-WMF - thoughts?

Interesting, does it work with 2020-03?

Yes. That’s why I noted about the potential for removal of -XX.

I'll try to fix the RefToolbar code (for the Wikitext 2010 editor) later today...

@Mvolz - Also here is my opinion on what the correct solution to the original problem is:

  • Have Citoid return all dates in regular ISO 8601 format, e.g. 2020-12 for December 2020.
  • Have VisualEditor convert the date into a standard JavaScript date object, e.g. new Date('2020-12'), which unlike CS1 does adhere to ISO 8601 AFAICT.
  • Have VisualEditor convert the JavaScript date object into a localized date for that wiki (which should be relatively easy since it has access to all the local MediaWiki JavaScript functions).

I'm not sure why everyone in this Phabricator discussion is thinking that Citoid interacts directly with CS1 as that isn't true.

JFYI: MediaWiki parser function #time does not support 2020-03-XX format (and would return Invalid time error from it). If any template uses it for formatting time in citations, they would have to remove the -XX manually (in template or by hand) to get it to work.

The error is actually coming from the conversion into a standard JavaScript Date object, not MediaWiki parser function #time:

var DT = new Date(data.date);

The above code errors with 2020-03-XX, but is fine with 2020-03.

The error is actually coming from the conversion into a standard JavaScript Date object, not MediaWiki parser function #time:

var DT = new Date(data.date);

The above code errors with 2020-03-XX, but is fine with 2020-03.

To be clear, I wasn’t talking anywhere about RefToolbar, just about the possibility that if some template code uses it for date formatting, those templates would break. So, my comment wasn’t related to Trappist the monk’s, I left it because of a tech news item.

@Mvolz - Also here is my opinion on what the correct solution to the original problem is:

  • Have VisualEditor convert the JavaScript date object into a localized date for that wiki (which should be relatively easy since it has access to all the local MediaWiki JavaScript functions).

I'm not sure why everyone in this Phabricator discussion is thinking that Citoid interacts directly with CS1 as that isn't true.

The Citoid extension treats all parameters completely agnostically, and relies entirely on template data. Template data does have a "time" type, but the template data here says it's a string, because validation for CS1 is different. It is possible to hard-code CS1 specific things into the extension but we haven't done anything like that, I think the intention was for it to be as flexible as possible (i.e. not ref toolbar) because that's fragile. Of course we are hard coding things into the citoid service, but this is in the goal of eventually standardising it. Right now it does not return a consistent format, a lot of the dates are in English and so forth, and that's because we hadn't settled on a standard.

In my ideal world en wiki would have accepted YYYY-MM as a valid date format like (apparently) everywhere else but we didn't see that happen over the course of 3 whole years so...

That said we could easily just roll this change back if the consensus on it changes.

The error is actually coming from the conversion into a standard JavaScript Date object, not MediaWiki parser function #time:

var DT = new Date(data.date);

The above code errors with 2020-03-XX, but is fine with 2020-03.

To be clear, I wasn’t talking anywhere about RefToolbar, just about the possibility that if some template code uses it for date formatting, those templates would break. So, my comment wasn’t related to Trappist the monk’s, I left it because of a tech news item.

In any case, we can't guarantee the date will be valid at all for this field in citoid. This is all scraped data, we try to clean up some common errors, but there will inevitably be non valid things in this field like "Summer 2010" and what not.

@stjn - Sorry, that makes sense!

I put in an edit request to fix RefToolbar on en.wiki: https://en.wikipedia.org/wiki/Wikipedia_talk:RefToolbar#Interface-protected_edit_request_on_19_April_2021. Since I don't work at the WMF any more I don't have permissions to make the change myself.

@Mvolz - Note that the fix linked to above will also need to be made on all wikis that use RefToolbar (there are at least 20). Or we can rollback https://gerrit.wikimedia.org/r/674692, which would be my preference.

@Mvolz - Also here is my opinion on what the correct solution to the original problem is:

  • Have Citoid return all dates in regular ISO 8601 format, e.g. 2020-12 for December 2020.

The reasons we have issue with it on en.WP are well-founded.

  • Have VisualEditor convert the date into a standard JavaScript date object, e.g. new Date('2020-12'), which unlike CS1 does adhere to ISO 8601 AFAICT.

CS1 also does because this -XX form is for unspecified day in the latest standard (see "Level 2", page 11 in the URL above), which even better matches the intent of Year-Month dates found in the wild that Citoid is pulling from, anyway. VisualEditor and/or Date should be updated to adhere to the new standard as well. (I think Date is out of ECMAScript so that's a different story I suppose, but local Javascript can probably hack around it until such time as ECMAScript is updated).

  • Have VisualEditor convert the JavaScript date object into a localized date for that wiki (which should be relatively easy since it has access to all the local MediaWiki JavaScript functions).

Reasonable later-solution.

I'm not sure why everyone in this Phabricator discussion is thinking that Citoid interacts directly with CS1 as that isn't true.

I... guess? That's like saying Wikipedia doesn't have a white background: it doesn't (pretty sure it's not #fff but I've haven't checked lately), but no-one cares because the end effect is that some systems mostly-powered by Citoid do. Either Citoid does the work or the other systems do.

Summer 2010

This is directly valid in en.WP's CS1. It also has a representation in the new EDTF (which amusingly has the same issues as the ambiguous months). See page 14 in the link above.

On Kaldari's suggestion to substitute a localized date, I think that harms translation efforts. It might be nice to see 2021. április 19. in the wikitext for your article, but it is not nice to have to turn that into 19 April 2021 or 19 avril 2021 or 2021年4月19日 later.

I wonder whether this is as big of a problem as it seems to me personally. I checked a dozen pages in https://en.wikipedia.org/wiki/Special:RandomInCategory/CS1_errors:_dates and found none of these. There were several of the MM/DD/YYYY format, but nothing that looked like YYYY-MM.

I... guess? That's like saying Wikipedia doesn't have a white background: it doesn't (pretty sure it's not #fff but I've haven't checked lately), but no-one cares because the end effect is that some systems mostly-powered by Citoid do. Either Citoid does the work or the other systems do.

Please see https://en.wikipedia.org/wiki/Separation_of_concerns. Citoid should not be trying to accommodate the peculiarities of every wiki. That is the job of VisualEditor and RefToolbar, which sit between Citoid and the citation templates at the local user interface level.

On Kaldari's suggestion to substitute a localized date, I think that harms translation efforts. It might be nice to see 2021. április 19. in the wikitext for your article, but it is not nice to have to turn that into 19 April 2021 or 19 avril 2021 or 2021年4月19日 later.

I wonder whether this is as big of a problem as it seems to me personally. I checked a dozen pages in https://en.wikipedia.org/wiki/Special:RandomInCategory/CS1_errors:_dates and found none of these. There were several of the MM/DD/YYYY format, but nothing that looked like YYYY-MM.

Well it does seem to happen: https://en.wikipedia.org/w/index.php?search=%22undefined+NaN%22&title=Special%3ASearch&go=Go&ns0=1

On Kaldari's suggestion to substitute a localized date, I think that harms translation efforts. It might be nice to see 2021. április 19. in the wikitext for your article, but it is not nice to have to turn that into 19 April 2021 or 19 avril 2021 or 2021年4月19日 later.

I would argue it might be more Anglo-centric, since in Russian Wikipedia, for example, English Wikipedia’s ‘enter whatever’ dates are frowned upon in favour of ISO format. At the same time, all of this could’ve been some local script for enWP that somehow reacts to a Citoid citation insertions in VE and converts automatic dates of 2021-12 form into 2021-12-XX.

  • Have Citoid return all dates in regular ISO 8601 format, e.g. 2020-12 for December 2020.

The reasons we have issue with it on en.WP are well-founded.

@Izno - As I mentioned above, the Citoid service never interacts directly with English Wikipedia, so that's not a reason for Citoid to not use 2020-12. Accommodation of CS1 (and any potential localization) should happen at the VisualEditor/RefToolbar level, not at the service level.

CS1 also does because this -XX form is for unspecified day in the latest standard (see "Level 2", page 11 in the URL above), which even better matches the intent of Year-Month dates found in the wild that Citoid is pulling from, anyway. VisualEditor and/or Date should be updated to adhere to the new standard as well. (I think Date is out of ECMAScript so that's a different story I suppose, but local Javascript can probably hack around it until such time as ECMAScript is updated).

So you're suggesting we just ignore the fact that the JavaScript Date object doesn't support '-XX'? That's just asking for bugs.

  • Have VisualEditor convert the JavaScript date object into a localized date for that wiki (which should be relatively easy since it has access to all the local MediaWiki JavaScript functions).

Reasonable later-solution.

Then for the short-term, have VisualEditor add the '-XX', not Citoid.

@Mvolz - For the short term, what do you think about moving the '-XX' addition from the Citoid service to the Citoid extension (i.e. VisualEditor)? Otherwise, we need to fix RefToolbar on all the wikis ASAP per https://en.wikipedia.org/w/index.php?search=%22undefined+NaN%22&title=Special%3ASearch&go=Go&ns0=1 (which I can't do since I don't have permissions).

Please see https://en.wikipedia.org/wiki/Separation_of_concerns. Citoid should not be trying to accommodate the peculiarities of every wiki. That is the job of VisualEditor and RefToolbar, which sit between Citoid and the citation templates at the local user interface level.

The separation of concerns principle may not trump the interest of having a centralized sanitation of (arbitrary) citation data. Are you arguing that Citoid should not sanitize any data retrieved? That would seem to be much different than how the extension has functioned and subsequently extends to other input forms and data that Citoid manages.

I think that harms translation efforts...
I would argue it might be more Anglo-centric...

Indeed.

At the same time, all of this could’ve been some local script for enWP that somehow reacts to a Citoid citation insertions in VE and converts automatic dates of 2021-12 form into 2021-12-XX.

VE is hard to interact with (as you particularly were moaning about on Discord today ;) ), never mind that you would need to convince local users that there shouldn't be the centralized sanitation mentioned above. As mentioned, we have at least one other tool that relies on Parsoid returning certain data in a certain fashion. Fix the problem in one place, not two, and better yet, at the source that we can control (Parsoid).

The reasons we have issue with it on en.WP are well-founded.

@Izno - As I mentioned above, the Citoid service never interacts directly with English Wikipedia, so that's not a reason for Citoid to not use 2020-12. Accommodation of CS1 (and any potential localization) should happen at the VisualEditor/RefToolbar level, not at the service level.

When I mean "well-founded", that's "well-founded for everyone, not just en.WP". Simply because other wikis are fine with their dates possibly being imprecise shouldn't stop us from sanitizing that imprecision away at the earliest point possible. Which in this case is Citoid.

So you're suggesting we just ignore the fact that the JavaScript Date object doesn't support '-XX'? That's just asking for bugs.

No, actually. I am suggesting that the approach you personally have settled on is opinionated (which is fine) but seems to diminish or ignore how the other issues related to this are treated. I happen to agree that hacking around Date is bad, but I also know that Citoid shouldn't give us what we consider to be garbage (and which other wikis should too).

Then for the short-term, have VisualEditor add the '-XX', not Citoid.

Maybe even a reasonable long-term solution, but I have a suspicion that if I tagged this task for VE the task would be put immediately into the Freezer. (Which IMO is a discouraging column name to anyone who cares about any of meaningful tasks in that column, even if it accurately reflects that no-one will work on it. Oh [not to be sarcastic], it's actually in the External and Administrivia board already - no more impressive on that point.)

Unfortunately, that still misses the point of having a centralized sanitation and which is now two software modules that would be impacted (one of which is onwiki and one of which is not, so you can't even control when the changes are made simultaneously much less the consistency of the changes).

If there's a correct argument to my line, it's about how the citation modules on some wikis either don't exist and so they're handcrafting a validation suite if at all in wikitext, OR they're so out of date as to have many other issues with how they format data. EN.WP has always led the way on this particular point at least for this module, but that doesn't mean the other wikis have followed. Which is arguably a different issue altogether (c.f. global modules which en.WP certainly wouldn't opt into from the repo and which would at best donate its reused work to). And even beyond that, you yourself recognize there are some 20 versions of RefToolbar.... It's not going to scale unless we take care of it at the source of the data.

It's a hard problem and at best no-one has really listed the pros and cons out to where and which systems do what regarding citations for this particular task (probably unfairly to Mvolz who seems to discover things the hard way) much less which should be doing what. EN.WP isn't going to change on the point, and not solely because we are curmudgeons who think the wiki world should be run our way. ;)

@Izno - I'm not arguing against Citoid doing sanitation of the date. I just think the Citoid service should sanitize it to YYYY-MM, not YYYY-MM-XX for three reasons:

  • EDTF is too new and I don't know a single browser that supports it yet (via JS).
  • All 20 versions of RefToolbar handle YYYY-MM fine. YYYY-MM-XX breaks them.
  • As stjn mentioned, the #time parser function also doesn't support YYYY-MM-XX (although I'm not sure what that impacts).

While I understand that the idea of fixing the problem from the VisualEditor side (either with '-XX' or full internationalization) may feel like throwing a request into a black hole, I'm pretty confident that @Mvolz could handle it, and I'm curious to get her opinion on that idea.

It is the clear and explicit consensus of en.wiki editors not to allow YYYY-MM dates. Not in what is shown to readers, but also not in the source code, and not in the allowed parameters of citation templates. So whether you think YYYY-MM is better or not is irrelevant: if Citoid did this on en.wiki it would be in error and in violation of the local consensus, and should not be used.

This argument is rather insignificant: if Citoid violates some local consensus, this is something to be fixed locally (provided tools to fix locally if necessary, like some sort of JavaScript hook to modify the data before Citoid sends it into VE interface, some modifications made to RefToolbar etc.).

I’ve went ahead and checked that, for example, Russian Template:Cite web does not support 2008-03-XX format due to relying on #time (but handles the error gracefully by showing the raw string). I don’t doubt the same issue exists in other wikis.

It is the clear and explicit consensus of en.wiki editors not to allow YYYY-MM dates. Not in what is shown to readers, but also not in the source code, and not in the allowed parameters of citation templates. So whether you think YYYY-MM is better or not is irrelevant: if Citoid did this on en.wiki it would be in error and in violation of the local consensus, and should not be used.

@DavidEppstein - The Citoid service does not determine what is shown to readers, nor what is in the source code, nor what is put in the citation templates. That is determined by VisualEditor and RefToolbar. I completely agree with you that we should not violate the local consensus, nor should we violate the MoS, nor should we insert data that is incompatible with CS1. You are arguing against a straw-man.

As a data service, the Citoid service should provide the date in a widely-recognized standardized format that can be utilized by both VisualEditor and RefToolbar (and whoever else consumes Citoid data). VisualEditor and RefToolbar should then translate that date into whatever is appropriate for English Wikipedia (and all the other wikis with their own unique requirements).

As a data service, the Citoid service should provide the date in a widely-recognized standardized format that can be utilized by both VisualEditor and RefToolbar (and whoever else consumes Citoid data). VisualEditor and RefToolbar should then translate that date into whatever is appropriate for English Wikipedia (and all the other wikis with their own unique requirements).

I speculate that the correct move here is to have Citoid provide the most-detailed spec-compliant date format it can, regardless of local wiki support, and then have the per-wiki templatedata maps updated with some kind of annotation of how the date needs to be transformed to comply, which any tools using the data can then easily pick up on.

I don't think that structure currently supports that, but it looks like a fairly logical extension -- the existing "date": "date" could become something like "date": "parserFunctionName(date)"? (I'm suggesting this precise form because it'd be easily extended to any other local transformation that's required, without needing any more tool-specific changes.)

Given EDTF was raised above, I thought I'd note that there is now a EDTF Wikibase data type extension, although it might be a while before it's considered for use in Wikidata.

As for use in Citoid: if Date is an issue, perhaps don't use the regular Date directly? There are JS libraries around for EDTF and it might be possible to have the Date as a private member of a class that is used instead.

Of course it might open up issues (but also opportunities) if you want to use the more advanced features of EDTF to more accurately represent the knowledge around dates associated with citations.

Of course, I understand everything, but why are the problems of only English Wikipedia being solved at the global level, again? Breaking down everything around the world.

We are an international community of developers and participants based on world standards. We need to strive for one format, not increase them. For a long time, everyone around us used ISO, now suddenly we are switching to EDTF. Which supports neither JS nor PHP.

Nobody has ever used XX, now one of the many dozens of instruments uses it. And why suddenly XX? Why not force English Wikipedia to simply convert to the output format they need.

If you want to globally accept the use of EDTF and XX as a practice, please accept it globally, otherwise it is impossible to parse the dates normally, you have to do a lot of checks. I think in projects we need to configure bots that would clean up the garbage that Citoid will now add.

In general, as always, a very bad decision.

Damn guys, really. I'm doing some work here in my community to reduce the used technical dates to one kind YYYY-MM-DD: to drop 12.12.12, 12 July 2020, Jule 12, 2020, 20-12-2000, 12/12/12 etc and you throw that in here. I am very angry.

Moreover, the parser function (https://www.mediawiki.org/wiki/Help:Extension:ParserFunctions#.23time and https://www.mediawiki.org/wiki/Extension:Scribunto/Lua_reference_manual#mw.language:formatDate) does not support this format, are you seriously introducing a standart that even MediaWiki cannot handle? I have almost all dates in templates in projects go through this parser function, should I start displaying bare numbers to users or write a new wrapper instead of the parser function myself?

The community of Python language hasn't heard of EDTF either. stackoverflow, docs.python.org, supported format-codes. The size of this community is almost larger than that of Javascript and Php (on which the Wiki is written). Pywikibot is written in Python.
So, can be sure that the generally programming languages will not adopt the standard in the coming years.

But if it will be used in Wiki, as an alternative to Wikidata qualifiers or some local parameter values like in English Template: Cite web # Date, I agree with by this comment. Although the corresponding proposal T207705 is already 3 years old ...

...I have almost all dates in templates in projects go through this parser function...

I don't know what kinds of templates Iniquity deals with, or what kind of dates Iniquity's templates deal with. But #time can only be used with Gregorian calendar dates. An example of an English language template where #time ought not be used is Template:Infobox royalty. This thread has mostly been concerned with Citioid, and anything online has a Gregorian calendar publication date.† But that reasoning does not apply to other dates.

† If a publication was published on paper long ago, one could give the publication date of the online version, and use the orig-date parameter for the publication date of the paper version. Whether Citoid could distinguish between the online and paper publication dates in some random website is another matter.

hi y'all – after talking with @DLynch , @Mvolz , and @Whatamidoing-WMF about this and understanding the range of considerations you (plural) have helpfully surfaced in this ticket, we're going to revert change 674692 for now.

Below you will find:

  1. The impact of reverting change 674692
  2. The thinking that led to reverting change 674692
  3. What we see as the lessons we've collectively learned through this latest conversation that ought to inform any future work on this ticket

Although, before that, we'd like to acknowledge people who have shared information to help us more deeply understand the situation and made attempts to cope with change 674692:


1. Impact of reverting change 674692
Some, but not all, partial dates will be returned in YYYY-MM format when using citoid, as they did prior to this change. For partial dates to comply with CS1 at enwiki, individual users will need to manually edit the date.

2. Motivation for reverting change 674692

  • Change 674692 made it clear to us that a workable solution requires additional work.
  • Projects that depend on MediaWiki's #time parser (e.g. German) would not accept the date format change 674692 introduced (YYYY-MM-XX).
  • RefToolBars need to be fixed to handle dates which new Date() doesn’t parse correctly.
    • Note: Marielle Volzhas proposed a fix for this particular issue on Wikipedia talk:RefToolbar here.
  • EDTF Level 1 format is not widely accepted

3. New information that surfaced in making and discussing change 674692
Volunteers from a wide range of projects need to reach a consensus on the standardized date format projects across the movement will accept before a singular format can be encoded in Citoid.

Change 688304 had a related patch set uploaded (by Mvolz; author: Mvolz):

[mediawiki/services/citoid@master] Revert "Change ambiguous days to XX"

https://gerrit.wikimedia.org/r/688304

Thank you very much for rolling back the change, forgive me for flaring up above. I am working on the standardization of dates in a project, and your change was at a very wrong time.

I want to open RfC about dates and time on meta (https://meta.wikimedia.org/wiki/Requests_for_comment), do you think this platform will be successful for the issue of standardization of dates?

... I want to open RfC about dates and time on meta (https://meta.wikimedia.org/wiki/Requests_for_comment), do you think this platform will be successful for the issue of standardization of dates?

If someone starts an RFC, I hope on the very first post they will make it clear the scope of the RFC. Just Citoid? Just citations? What calendars are supported? What is the earliest allowable date? What is the greatest allowable date?

... I want to open RfC about dates and time on meta (https://meta.wikimedia.org/wiki/Requests_for_comment), do you think this platform will be successful for the issue of standardization of dates?

If someone starts an RFC, I hope on the very first post they will make it clear the scope of the RFC. Just Citoid? Just citations? What calendars are supported? What is the earliest allowable date? What is the greatest allowable date?

I thought to raise the following questions:

  1. Is there a need for a general document regulating the use of dates for technical needs in an international project with 1000 subprojects?
  2. If we need it, which of the standards should we use: ISO, EDTF or smth else (wikimeda date format?) . Because I don't like ISO either, since it's paid and not 'open source'.
  3. What else should be regulated in the document besides the recording of dates and time.

It seems very unlikely to me that a developer-initiated proposal on meta to potentially override a local consensus of en on article content (how to format dates) will generate light rather than heat.

Of course, I understand everything, but why are the problems of only English Wikipedia being solved at the global level, again?

The issue of how to represent uncertain or limited-precision dates is not a "problem of only English Wikipedia".

We are an international community of developers and participants based on world standards.

EDTF was developed also by "an international community of developers and participants based on world standards." [Disclosure: I played a part in that community not least expressing the needs, as I saw them, of various Wikimedia projects]

We need to strive for one format, not increase them. For a long time, everyone around us used ISO, now suddenly we are switching to EDTF.

Much of EDTF (AFAICT, all of the parts it is intended to use, in this proposal) are part of ISO 8601-2:2019

I'm doing some work here in my community to reduce the used technical dates to one kind YYYY-MM-DD

How will you use uncertain or limited-precision dates, for which real-world usecases have been demonstrated?

I tend to agree regarding the RFC. Best to work on adapting technology to editors rather than vice versa. We provide options so that projects can pick out what works for them, rather than trying to shoehorn them into our chosen option.

We are an international community of developers and participants based on world standards. We need to strive for one format, not increase them. For a long time, everyone around us used ISO, now suddenly we are switching to EDTF. Which supports neither JS nor PHP.

It's not reasonable to expect PHP and JS to support EDTF directly, especially as libraries in those languages exist; but I agree support should be present at a low level in MediaWiki before it's used heavily within the community.

It'd also be nice if there were rich methods of viewing and entering dates. EDTF's not intended as a display format so much as a common textual data format. Editors might see a few instances of 2012-01-XX, or 2012-01?, or [2000, 2012], or ../1985-04-12 - but they probably shouldn't be displayed that way to readers.

The library the EDTF extension is based on has a "humaniser" for that purpose using TranaslateWiki - though I'm sure it could be improved - but as of yet there's no entry form that I'm aware of. I hope technical resources are deployed to make it easier to enter and read rich dates, as potential uses range far beyond citations.

Regarding "world standards": while EDTF is not (just) an ISO Date[Time] that PHP/ECMAScript/JavaScript currently understand, it is within ISO 8601-2:2019 as a profile, after a long standardisation process resulting in changes from the original proposal (e.g. reducing the use of letters representing English words).

It's not done to be different, or to get people to pay for a standard. It resolves issues about how to specify intervals, or uncertainty, or sets of dates, or to clarify "a month" (2012-01) vs. "one unspecified day within a month" (2012-01-XX ) vs. "approximately Spring in the southern hemisphere" (2012-29~) in a standard, concise way - reducing ambiguity and language/culture-specific notation.

It's a far more flexible data type that could alleviate many of the issues raised early on in this ticket - if used internally with proper mapping. It's being adopted by institutions that Wikimedia projects and affiliates work with; that's why the extension I mentioned was made, and why it's being pushed towards an official release (T280656).

...Damn guys, really. I'm doing some work here in my community to reduce the used technical dates to one kind YYYY-MM-DD: to drop 12.12.12, 12 July 2020, Jule 12, 2020, 20-12-2000, 12/12/12 etc and you throw that in here. I am very angry.

At Wed, May 12, 07:48 UT Pigsontewing saw fit to comment on Iniquity's comment.

I have at least two issues with Iniquity's comment.

    1. This thread is about internationalizing Citoid dates. General reduction of the number of different date formats is much broader, and should probably be addressed in a different forum.
  1. Iniquity does not define "technical date", and defining the scope of the discussion is critical. For example, 29 February 1700 was observed in London but cannot be represented in EDTF or ISO 8601.

The issue of how to represent uncertain or limited-precision dates is not a "problem of only English Wikipedia".

I agree that this is not only a problem of the English Wikipedia, but it was solved only for it, the rest of the communities had to fix what was broken. This is not done in normal communities. First, the base is made, then the standard is changed.

EDTF was developed also by "an international community of developers and participants based on world standards." [Disclosure: I played a part in that community not least expressing the needs, as I saw them, of various Wikimedia projects]

I don't argue with that. My proposal only meant that it should be the same for everyone.

Much of EDTF (AFAICT, all of the parts it is intended to use, in this proposal) are part of ISO 8601-2:2019

Yes, I know that, but they are still different formats and names. There must be unity.

How will you use uncertain or limited-precision dates, for which real-world usecases have been demonstrated?

Initially I wanted to use additional parameters, but now I see that more work needs to be done to implement the EDTF format, or the same format for uncertain or limited-precision dates.

    1. This thread is about internationalizing Citoid dates. General reduction of the number of different date formats is much broader, and should probably be addressed in a different forum.
  1. Iniquity does not define "technical date", and defining the scope of the discussion is critical. For example, 29 February 1700 was observed in London but cannot be represented in EDTF or ISO 8601.

Dates from the cytoid are inserted into a huge number of sources, the format of which is looked at and users try to copy it to other places. This is a completely new mechanic, which if introduced en masse. These formats are also needed in the dates of birth and death, or in the dates of the work. I can't tell users: "oh sorry, it works for us here, but not here for some unknown reason. How can we do the same? We don’t know. MediaWiki do not support this template.". Code first, tools second. We can't do the opposite.

...These formats are also needed in the dates of birth and death, or in the dates of the work...

EDTF and ISO 8601 simply don't work for birth and death dates. This is because such dates, in historical works and primary sources written near the time of the event, were written in the calendar that was in force on the date, and at the location, of the birth or death. It is often extremely laborious to figure out whether these dates were in the Julian or Gregorian calendar. In the case of Julius Caesar, for example, among historians there is an uncertainty of a few days about how the date would be converted into the Gregorian calendar, even though it is known to be 15 March 44 BC in the Julian calendar.

In the case of Julius Caesar, for example, among historians there is an uncertainty of a few days

Which is precisiely why we need a method of marking-up dates as imprecise, and where applicable as imprecise within defined limits.

In the case of Julius Caesar, for example, among historians there is an uncertainty of a few days

Which is precisiely why we need a method of marking-up dates as imprecise, and where applicable as imprecise within defined limits.

There is a distinction between a date being imprecise in all calendars, vs. being a precisely known day in some non-Gregorian calendar but the conversion to Gregorian is either uncertain. or too difficult for the typical Wikipedia editor.

Is the date format will has a mark to indicate which the calendar used? As a kind of wiki extension to the EDTF format. Like, Julian date "J44-03-15".
Or will need to keep specifying an additional tag about the date calendar? Like a string description in the text next to the date, an additional parameter in templates, a switch / qualifier in the Wikidata properties.

Change 688304 merged by jenkins-bot:

[mediawiki/services/citoid@master] Revert "Change ambiguous days to XX"

https://gerrit.wikimedia.org/r/688304

Is the date format will has a mark to indicate which the calendar used? As a kind of wiki extension to the EDTF format. Like, Julian date "J44-03-15".
Or will need to keep specifying an additional tag about the date calendar? Like a string description in the text next to the date, an additional parameter in templates, a switch / qualifier in the Wikidata properties.

That's essentially what the Wikidata date type does; it stores the calendar as a separate URL; current choices are Julian calendar or Gregorian calendar; either is proleptic if necessary. Proleptic means starting at some fairly modern date where it is certain that society was using the calendar correctly, and applying the rules to work backwards in time to the date of interest. It's known that the Romans did not follow the leap year rules correctly before 1 March AD 8, so before then, it's unknown how well the proleptic Julian calendar agrees with the calendar as it was actually observed by the people of Rome.

A "Julian date" is something else again. For example, noon today, UT (13 May 2021) is Julian date 2459348. The Julian date for noon UT, 15 May AD 44 is 1737266.

Sure, calendar algorithms, like Julian day as well as Unix time are beyond the scope of ISO.
It seems to be correct to keep the date as given in the printed source (let it be Julian "44-03-15" BC), without the Gregorian conversion variations. E. g. Wikidata birthdays: d:Q1048, d:Q5592. Will the date format have a code (like "J") about the calendar being used? Current #time supports codes for multiple calendars, excluding Julian.