Page MenuHomePhabricator

Internationalise citoid dates
Open, Stalled, HighPublic1 Story Points

Description

  • Put out years in year only format (i.e. YYY)
  • Put out all dates in a readable format (i.e. May 2010) in the date field to address the polluted data issue ASAP.

This is a possible way forward for internationalising dates:

  • Translate dates on our end.

OR

Note: Discussion is also happening on ENWP here: https://en.wikipedia.org/wiki/Help_talk:Citation_Style_1#ISBNs_in_mw:Citoid

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Note that the newly merged T165116 is about ISBN data, not DOI data. Citoid is making up fake months and dates for ISBNs which only have years available. As with the DOI, it should not be generating false data and passing it off as accurate.

Mvolz added subscribers: mobrovac, Mvolz.EditedMay 12 2017, 3:40 PM

Hmm, same, I seem to remember another discussion about this elsewhere but also unable to find it.

Anyway, I think the issue is we never decided exactly what to about with it.

We chose to turn all date fields into ISO because it is a standard; when we didn't validate dates and would get things that CS1 wouldn't accept. Plus they were largely in English, making it an i18n issue.

There's not a great standard for returning partial dates in a single field. We could do:

Partial ISO (which I have seen in the wild in other places) i.e. 2007-05 for May 2007. (Causes CS1 error)

Return each component in separate fields, i.e.

year: 2007
month: 05
day: 12

For books, either partial ISO (which is just the year anyway) or returning it in a separate year field both would work pretty well.

For journals, returning it in separate parts is not ideal because we don't have any easy way to use template data to combine those into a single field. Plus most templates don't have separate month/day fields so we're left concatenating digits. If we do {date: [month, year]} it will translate into "05, 2007" in the field, which causes CS1 errors and is not human readable.

Basically, all of the options for journal that I've laid out don't work in CS1, aren't human readable, or don't work in other languages because they're in English. This is why we've steadfastly stuck to ISO thus far.

For a first step, to deal with the book issue, I'd suggest either having a separate year field, or allowing date field to do partial ISO but only for years right now, not the Month Year option, since that would cause CS1 errors. But the choice might depend on what we eventually do about the journal issue. @mobrovac thoughts?

The best way might be to get templates to be able to deal with ambiguous dates; i.e. allow separate fields for each date component in the template itself, or to be able to accept partial ISO. With templates currently doing neither, it's hard to decide which is the best path to take. 2007-08 doesn't look human readable, but a template could display this as "May 2007". But since this is community maintained code I don't really have any control over this :). What do you think @Whatamidoing-WMF?

Mvolz renamed this task from Citoid should not assume published on 1st of a month if DOI only gives "Month and Year". to Figure out how to deal with incomplete dates, i.e. year only or year and month only.May 12 2017, 3:47 PM
Mvolz updated the task description. (Show Details)

I note that citoid is generating fake dates for journal publications that have non-fake dates whose format is beyond what can be represented as YYYY-MM-DD. For instance doi:10.13110/discourse.37.1-2.0003 has an actual date of "Winter/Spring 2015", but when I do 'curl -LH "Accept: application/x-bibtex" http://dx.doi.org/discourse.37.1-2.0003' I only get back "year = 2015" and of course citoid turns that into the false 2015-01-01. You can't do anything here about sources that don't give you the full data, of course, but that could help explain where some of these dateless publication dates are coming from.

Re: the "05, 2007" example above, I do not understand why it is a big deal to convert "May" into "Maio" if you are on the Italian Wikipedia. It's a straight one-to-one conversion for most languages. You must have a table somewhere. BattyBot task 25 uses a set of regexes on en.WP to do date conversions for many languages; the code is right there, ready for reuse.

Have a radio button in the tool asking editors to choose MDY or DMY, then apply the appropriate conversion, leaving out the day and the month if they are not available in the source. This seems like a pretty straightforward piece of code to write. What am I missing?

TheDJ added a subscriber: TheDJ.EditedMay 12 2017, 9:24 PM

While this is being discussed, I suggest disabling the ISBN feature for at least en.wp however. Inserting incorrect information is worse than inserting nothing.

BTW. I note that 'just the year' is a valid ISO date. And '2017-02' for year month, is also a valid ISO date. They are however not recognized by the CS1 Lua modules atm. Can easily be done, just two more regexps. (because no OR groups in lua regex'es) in it's validator. Wether or not people would want that in the wikicode is another discussion.

Ocaasi_WMF updated the task description. (Show Details)May 12 2017, 10:06 PM

Regardless of whether 2017-02 is valid as an ISO date, it is not valid for use on Wikipedia. https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Dates_and_numbers says explicitly "Do not use these formats." So the only acceptable solution for month-year date formats appears to be to bite the bullet and internationalize the month names.

Josve05a added a comment.EditedMay 12 2017, 10:24 PM

Regardless of whether 2017-02 is valid as an ISO date, it is not valid for use on Wikipedia.

Correction: English* Wikipedia.

So the only acceptable solution for month-year date formats appears to be to bite the bullet and internationalize the month names.

Wikidata already does this. Put 2005-11 into any date field (such as Property:date of birth) and it will automatically convert it (as +2005-11-00T00:00:00Z in the backend, and) to langauge-i18n dates. 2005-11 will output November 2011 in English, so the table of month names in different languages already exists in the WMF-sphere, jut harvest some code from their codebase.

Mvolz added a subscriber: Jdforrester-WMF.EditedMay 12 2017, 11:09 PM

Regardless of whether 2017-02 is valid as an ISO date, it is not valid for use on Wikipedia. https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Dates_and_numbers says explicitly "Do not use these formats." So the only acceptable solution for month-year date formats appears to be to bite the bullet and internationalize the month names.

Citation templates can theoretically produce the correct style from these types of data. But it would require a lot of extra work from the template.

Wikidata already does this. Put 2005-11 into any date field (such as Property:date of birth) and it will automatically convert it (as +2005-11-00T00:00:00Z in the backend, and) to langauge-i18n dates. 2005-11 will output November 2011 in English, so the table of month names in different languages already exists in the WMF-sphere, jut harvest some code from their codebase.

Thanks, that is helpful, I had also considered zeroing out the fields in ISO too; I'm not sure which is more preferred but if wikidata is doing that it seems a good idea to follow suit. The truncated ISO just looks nicer for year only dates.

That said, Citoid is a microservice written in node.js; we don't have access to anything in mediawiki except via the API :). So there's nothing we can really reuse. We do know the requested language from mediawiki so we could attempt to do our own localisation.

I had a look around and https://github.com/abritinthebay/datejs seems like it does internationalisation if that is the preferred option; but I haven't tried it so I'm not sure how good it will be. We also discuss internationalisation here for punctuation: T161963.

That said, I really prefer returning a standard format like +2005-11-00T00:00:00Z or 2005-11-00 and having the template style it to be human readable, since localisation is after all the whole point of localised templates. A local template will be able to do a better job of making these dates the right style for the wiki, better than a random nodejs library will do anyway. But it does require more effort from community maintainers and that might be unsurmountable.

This is again a community issue maybe @Whatamidoing-WMF and @Jdforrester-WMF have insight into. I'm not sure if CS1 maintainers would be willing to accept these kinds of changes if we wrote them?

Mvolz raised the priority of this task from Low to High.May 12 2017, 11:09 PM
This comment was removed by Mvolz.

Come on over to https://en.wikipedia.org/wiki/Help_talk:Citation_Style_1 and start a conversation. It can work: @Whatamidoing-WMF has been consistent in engaging with en.WP about Tidy going away, coming back with updates, and that has built some trust, at least within a small community of gnomes. Efforts like that will go a long way toward improving the relationship between en.WP and WMF.

If you come over and explain what you would like to do and make a reasoned case for your proposed method, that will work a lot better than deploying a tool that has a known bug that delivers incorrect data to citations.

Re: "A local template will be able to do a better job of making these dates the right style for the wiki, better than a random nodejs library will do anyway. ": This is not actually true, because (at least on English Wikipedia) multiple date styles are allowed, and "the right style" is the style that is consistent with what has already been used for the other references in the article. The local templates do not generally have this information, any more than citoid does (because it is outside the template parameters and often not recorded with the special "use dmy dates" templates).. So the way to make the dates the right style is to ask the user which style to use before inserting them, rather than passing the job off to some other software that doesn't have any better idea than citoid what to do. Jonesey's suggestion above of a radio button would work.

Jc3s5h added a comment.EditedMay 13 2017, 12:02 AM

The WMF should provide any of its employees working on this an official copy of ISO 8601 and they should be required to read it. Among other things, they would find that the only correct way to represent the current year is "2017". "2017-00" or "2017-00-00" are just wrong. Likewise, the only correct ways to represent the current year and month are "201705" or "2017-05"; "2017-05-00" is wrong.

Also understand that IS0 8601 in this situation is a one-way protocol. Citoid can produce it internally and use it to produce a cite on Wikipedia, but dates can't go from Wikipedia to other places in ISO 8601 format because Wikipedia contains many Julian calendar dates, and Julian calendar dates are not allowed in ISO 8601. Since ISBNs and that ilk postdate the replacement of the Julian calendar by the Gregorian (c. 1923 or earlier) we can expect the dates in these databases to be Gregorian.

Citation templates can theoretically produce the correct style from these types of data. But it would require a lot of extra work from the template.

The work's (mostly) done (at enwiki and any wiki that's got a semi-current copy of enwiki's templates). The |df= parameter formats ISO into whatever's wanted for an article.

@Trappist_the_monk, do you have any objections to having CS1 support all of the ISO formats? If not, then citoid could report 2017 (which CS1 handles now) for books published this year, and 2017-05 (which CS1 dislikes) for journals published this month, and the template can re-format them however the MOS desires.

@Trappist_the_monk, do you have any objections to having CS1 support all of the ISO formats?

It is not for me to object; I have no power there. However, en.wiki Manual of Style does object. The date validation in cs1|2 tries to adhere to what WP:MOS allows. When WP:MOS permits other forms of year initial numeric dates, cs1|2 will support them.

@Trappist_the_monk, do you have any objections to having CS1 support all of the ISO formats?

It is not for me to object; I have no power there. However, en.wiki Manual of Style does object. The date validation in cs1|2 tries to adhere to what WP:MOS allows. When WP:MOS permits other forms of year initial numeric dates, cs1|2 will support them.

If CS1 accepted the dates and rendered them as written (i.e. 2007-05), then yes, this would violate the manual of style. I don't think anyone is suggesting that.

What I was suggesting is to have the template accept the 2007-05 and display this as "May 2007" on en wiki which would not violate the manual of style; But would require the template to do some extra work.

This is probably the wrong venue for talking about changes to en.WP's MOS or CS1 templates, but the short version is that the "YYYY-MM" format is discouraged because of ambiguity. If an editor (or script) inputs "2004-05" to mean "2004–2005", the template rejects that ambiguous date format rather than convert it to "May 2004".

Mvolz added a comment.May 13 2017, 1:30 PM

This is probably the wrong venue for talking about changes to en.WP's MOS or CS1 templates, but the short version is that the "YYYY-MM" format is discouraged because of ambiguity. If an editor (or script) inputs "2004-05" to mean "2004–2005", the template rejects that ambiguous date format rather than convert it to "May 2004".

Thanks, that's helpful. That suggests that maybe the 00-ed out version that wikidata uses i.e. 2007-05-00 might preferable since there's no way to confuse that with a date range.

... have the template accept the 2007-05 and display this as "May 2007" ...

Changing your example from 2007-05 to 2007-08 makes the latter a form that is almost correct for year ranges; should be 2007–08 with an en dash. I suspect that it is because of this permitted year range that yyyy-mm date forms are not permitted.

In general cs1|2 do not attempt to transform the content of their parameters. Exceptions to that general rule are the automatic conversion of hyphens to en dashes in the page and date parameters. This has caused some antagonism because hyphenated page numbers are perfectly legitimate.

I can imagine a date format that is intentionally intended to be transformed. For example, we might use the correct iso8601 form |date=200708 which cs1|2 could transform to August 2007. Additionally, this standard iso8601 form will support date ranges: |date=200708/200709 which cs1|2 could transform to August–September 2007.

But, this form still doesn't answer the issue that DavidEppstein mentioned at T132308#3258883:

... For instance doi:10.13110/discourse.37.1-2.0003 has an actual date of "Winter/Spring 2015", ...

As far as I know, iso8601 doesn't support seasonal or quarterly dates, nor does it support proper noun dates (Christmas 2015). While I would prefer a solution that adheres to some known standard, perhaps that's not possible. We might 'extend' iso8601 as our own 'standard' for these dates that aren't iso8601 compliant. Perhaps |date=2015.Winter/2015.Spring becomes Winter–Spring 2015; |date=2015.Christmas becomes Christmas 2015. cs1|2 does not support quarterly dates because MOSDATE is mute but I can imagine |date=2015.Q2 rendering as Second Quarter 2015.

Change 353706 had a related patch set uploaded (by Mvolz; owner: Marielle Volz):
[mediawiki/services/citoid@master] If only year is provided, only put year in date field

https://gerrit.wikimedia.org/r/353706

Mvolz added a comment.May 13 2017, 1:52 PM

... have the template accept the 2007-05 and display this as "May 2007" ...

Changing your example from 2007-05 to 2007-08 makes the latter a form that is almost correct for year ranges; should be 2007–08 with an en dash. I suspect that it is because of this permitted year range that yyyy-mm date forms are not permitted.

In general cs1|2 do not attempt to transform the content of their parameters. Exceptions to that general rule are the automatic conversion of hyphens to en dashes in the page and date parameters. This has caused some antagonism because hyphenated page numbers are perfectly legitimate.

I can imagine a date format that is intentionally intended to be transformed. For example, we might use the correct iso8601 form |date=200708 which cs1|2 could transform to August 2007. Additionally, this standard iso8601 form will support date ranges: |date=200708/200709 which cs1|2 could transform to August–September 2007.

But, this form still doesn't answer the issue that DavidEppstein mentioned at T132308#3258883:

... For instance doi:10.13110/discourse.37.1-2.0003 has an actual date of "Winter/Spring 2015", ...

As far as I know, iso8601 doesn't support seasonal or quarterly dates, nor does it support proper noun dates (Christmas 2015). While I would prefer a solution that adheres to some known standard, perhaps that's not possible. We might 'extend' iso8601 as our own 'standard' for these dates that aren't iso8601 compliant. Perhaps |date=2015.Winter/2015.Spring becomes Winter–Spring 2015; |date=2015.Christmas becomes Christmas 2015. cs1|2 does not support quarterly dates because MOSDATE is mute but I can imagine |date=2015.Q2 rendering as Second Quarter 2015.

These are all great suggestions. I think in iso8601 it is not valid to leave out the dashes except when the time is also included as well though.

Worth noting that if you try this in wikidata, i.e. "Fall 2003" you will get "date is malformed..." so they haven't figured it out either. Since iso8601 allows ranges, I can imagine us translating say, "Fall 2003" to be Sept-Nov 2003, which would be 200709/200711 instead of having the special formats. And converting Christmas to December 25th :) (which we could actually do right now with the current set-up actually.) And quarter 1 is Jan-March?

Thanks, that's helpful. That suggests that maybe the 00-ed out version that wikidata uses i.e. 2007-05-00 might preferable since there's no way to confuse that with a date range.

The templates output from Citoid will be mixed with other citations in the article that were typed by hand. Indeed, many of the Citoid generated templates will be imperfect and require manual fixes. Thus, the dates should not be regarded as some hidden format that you can do whatever you want with; rather they should be regarded as human-readable information that must obey the Help:Citation Style 1 documentation (which in turn defers to Wikipedia Manual of Style/Dates and numbers).

Worth noting that if you try this in wikidata, i.e. "Fall 2003" you will get "date is malformed..." so they haven't figured it out either. Since iso8601 allows ranges, I can imagine us translating say, "Fall 2003" to be Sept-Nov 2003, which would be 200709/200711 instead of having the special formats. And converting Christmas to December 25th :) (which we could actually do right now with the current set-up actually.) And quarter 1 is Jan-March?

We are dealing with seasons and quarters as printed in the publication. Publications located in the southern hemisphere will have different definitions of spring, summer, etc. than northern hemisphere publications. Some publications that mention quarters may be referring to fiscal year quarters, which could be just about anything. The goal should be to allow the reader to look at the Wikipedia article, then look at the publication cover, and determine they are the same, regardless of how seasons or quarters are defined.

Trappist_the_monk added a comment.EditedMay 13 2017, 2:34 PM

These are all great suggestions. I think in iso8601 it is not valid to leave out the dashes except when the time is also included as well though.

Section 4.1.2.3 does say that reduced accuracy year and month dates do require the hyphen.

Still, since the suggestion violates iso8601 for seasonal, quarterly, and proper noun dates, dropping the hyphen is simply further extension or adaptation of the standard to suit our needs. Or we don't bother to refer to this thing as iso8601 at all; it becomes a date interchange format used internally to wmf.

Worth noting that if you try this in wikidata, i.e. "Fall 2003" you will get "date is malformed..." so they haven't figured it out either. Since iso8601 allows ranges, I can imagine us translating say, "Fall 2003" to be Sept-Nov 2003, which would be 200709/200711 instead of having the special formats. And converting Christmas to December 25th :) (which we could actually do right now with the current set-up actually.) And quarter 1 is Jan-March?

Fall 2003 might be Sept-Nov 2003 in the northern hemisphere, but is spring in the southern.

Dates in citations should, as closely as possible within the constraints of MOSDATE reflect the dates actually used in the sources. A Christmas issue of some periodical may not have 25 December on the cover. This same is true for quarterly dates.

Mvolz added a comment.May 14 2017, 4:10 PM

Thanks all. It seems to me that in a lot of these "dates" that these are actually issue names: like Summer 2003 is the Summer issue from 2003. But Summer sounds date-ish so it ends up in "date" field. So we could try to parse out words like Summer or Q1 into the 'issue' field and put the year in the date field by itself.

Might that work for these odd cases? Can anyone think of a publication Summer 2003 date or similar that has an issue number in addition?

Can anyone think of a publication Summer 2003 date or similar that has an issue number in addition?

Try this insource search at en.wiki: insource:/\| *date *= *Summer/

In fact, the example I already gave had an issue number as well as a "Winter/Spring" date:

Double Exposures: Derrida and Cinema, an Introductory Séance
James Leo Cahill and Timothy Holland
Discourse
Vol. 37, No. 1-2 (Winter/Spring 2015), pp. 3–21
Published by: Wayne State University Press
DOI: 10.13110/discourse.37.1-2.0003

The issue number is the part where it says "No. 1-2". The date is "Winter/Spring 2015". Also note that "Winter/Spring 2015" is ambiguous, even for northern hemisphere dates: does it mean the period beginning in December 2015 and lasting through the spring of the following year, or does it mean the period that ends in Spring 2015? In this case it's the latter but I had to look at the adjacent issue dates to tell. So it would be a mistake to assume that one can always parse these things and turn them into unambiguous ISO date ranges. The dates are what the publishers give as the dates, and if we want to include a date in a citation (instead of just punting and giving only the year) then those are the dates we need to use.

Headbomb removed a subscriber: Headbomb.May 15 2017, 2:22 PM

Change 353706 merged by Mobrovac:
[mediawiki/services/citoid@master] If only year is provided, only put year in date field

https://gerrit.wikimedia.org/r/353706

Mentioned in SAL (#wikimedia-operations) [2017-05-15T15:07:52Z] <mobrovac@tin> Started deploy [citoid/deploy@3ed34ef]: Better publishing date extraction support - T132308

Mentioned in SAL (#wikimedia-operations) [2017-05-15T15:10:42Z] <mobrovac@tin> Finished deploy [citoid/deploy@3ed34ef]: Better publishing date extraction support - T132308 (duration: 02m 49s)

Mvolz added a comment.May 15 2017, 3:36 PM

We've deployed a fix for the year issue; all dates with only a year should now have just a year.

Please note that we're still working on the partial date issue; I think what we've discovered here is that publishers have a very loose definition of what constitutes a "date" - however, we still have to abide by the style guidelines. We do get back dates which violate CS1 rules like 10-04 and 11/11/2007 so we still need to attempt to validate.

Having our own modified version of ISO that includes seasons and quarters I think increases the burden on citation templates to too great a degree, so as much as it bothers the standards compliant person in me, I think we just have to do this sort of arbitrarily.

Whatamidoing-WMF added a comment.EditedMay 15 2017, 4:10 PM

however, we still have to abide by the style guidelines

For the record: No, you don't. Software and services that are used on hundreds of wikis are not required to abide by the policies or guidelines of any individual wiki. There is even a policy at the English Wikipedia that acknowledges the unreasonability of devs being expected to customize software to fit each community's ever-changing and sometimes contradictory standards. Devs are only required to have a consensus from the MediaWiki community that their software choices are right for the software.

However, this particular style guideline does contain some information and advice that is not project- or language-specific, and it identifies a number of interesting situations. So of course it would be sensible and probably efficient to learn from it, even though citoid is not technically required to abide by it.

We do get back dates which violate CS1 rules like 10-04 and 11/11/2007 so we still need to attempt to validate.

At en.wiki the rules are Manual of Style rules that cs1|2 adhere to; they are not 'CS1 rules'

Having our own modified version of ISO that includes seasons and quarters I think increases the burden on citation templates to too great a degree, so as much as it bothers the standards compliant person in me, I think we just have to do this sort of arbitrarily.

I'm not at all sure I fully understand what you've written here. How is 'our own modified version of ISO' more burdensome on citation templates than a non-standard 'arbitrary' something? If you accept the part of iso8601 for full dates and year-only dates (en.wiki already supports these) and define a year month version of that which zeros out the days (yyyy-mm-00) it isn't too much of a struggle for en.wiki and others to render that as Month YYYY. That same form works for date ranges in the standard iso8601 form yyyy-mm-00/yyyy-mm-00. The hard part is still seasons and proper-name dates. I've offered one possible solution to that dilema, there may be other and better solutions. Whatever it is that chosen, document it, adhere to it, and advertise it so that we all know what it is.

Lots of tools and bots examine citations. I'm not really sure which ones examine the rendered HTML, which ones examine the COinS metadata, and which ones examine the wikitext. Anything that examines the wikitext and finds yyyy-mm-00 should reject it as completely non-standard.

If you insist on creating dates that don't follow any standard, I would suggest supplying it in a parameter with a special name, like citoid-date=blahblahblah. But since there is no mechanism to keep human editors from playing with the citoid-date parameter, I don't like this idea.

Anything that examines the wikitext and finds yyyy-mm-00 should reject it as completely non-standard.

which is why I wrote:

! In T132308#3266180
Whatever it is that is chosen, document it, adhere to it, and advertise it so that we all know what it is.

! In T132308#3267381, @Jc3s5h wrote:
If you insist on creating dates that don't follow any standard, I would suggest supplying it in a parameter with a special name, like citoid-date=blahblahblah. But since there is no mechanism to keep human editors from playing with the citoid-date parameter, I don't like this idea.

We have these issues:

  1. publishers have non-standard ways of writing publication dates
  2. iso8601 is a standard that is not capable of communicating all dates that publishers commonly write
  3. at en.wiki, Manual of Style dictates which of the myriad available date formats are permissible (presumably this applies to other languages as well)
  4. at en.wiki, editors complain about writing en dashes and therefore often use a hyphen instead
  5. cs1|2 and other citation templates elsewhere must make some sort of sense out of citoid's rendering of publisher's non-standard publication dates

We know that iso8601 cannot represent all dates that publishers write (season, quarter, and proper name are some). Somehow, somewhere, some mechanism must be contrived to allow citoid to do that. We know that the various wikis may have differing notions regarding how certain dates are to be displayed. This is relatively easy when citoid can represent dates in an iso8601 format or a format appropriate to the language but falls apart where the date cannot be represented by iso8601 or, as is most likely, citoid does not (or will not) have support for the plethora of languages (a huge task).

To answer these conflicting issues we can concoct our own standard for dates produced by citoid. Perhaps the first sentence of that standard is:

  • Where possible, dates produced by citoid shall be rendered in accordance with iso8601.

Because of the opening sentence in Our New Standard, editors at en.wiki would need to give up the freedom of writing 2007-08 (with a hyphen) and would need to write 2007–08 (with an en dash) because cs1|2 would need to be modified to render the former as August 2007 (because MOS does not allow for YYYY-MM dates).

Following that, Our New Standard describes how citoid is to render dates that cannot be rendered in a form supported by iso8601. For example, something like this perhaps:

  • Seasonal dates: for single dates: YYYY.<season> for ranges: YYYY.<season>/YYYY.<season>

It would continue to describe what it is that <season> means for an international audience; similarly for <quarter> and <proper name>.

To resolve this date transfer issue, there must be a bit of give and take. It is ok for citoid to depart from iso8601 as long as the departure is itself published and advertised as its own standard. When the iso8601 committee catch up with Wikipedia, Our New Standard becomes obsolete.

I believe that if you went to enwiki and asked the practical question:

"Would you rather that:

  1. we change the Manual of Style to accept more ISO 8601-compliant formats, which will have the side effect of requiring 100% of editors to use en dashes properly for date ranges in all citation templates, even if they don't know what an en dash is or how to type it on their Windows box, or
  2. we use a local standard that doesn't violate the Manual of Style, in which a source published in August 2007 can be unambiguously marked in a citation template as 2007-08-00, the dash-making bots will not mistake it for a date range (the bots could even convert it to August 2007), and the CS1 template will automagically display it as August 2007 so that no reader will ever see the zeroes?"

then their first choice will be that editors type August 2007 by hand, and their second choice will be the "non-standard" 2007-08-00. They will reject the proposal that depends upon every editor using dashes properly.

If you're not going to follow a standard, you should go far away from the standard to avoid confusion. I'd suggest keeping the date and year field just as they are. In the absence of both those fields, the template could look for citoid-month, citoid-year, citoid-day, citoid-season, and whatever else is required.

|date= already doesn't follow IS0 8601. It has never followed ISO 8601. The decision that |date= would not follow ISO 8601 was made years before citoid was created. I'm not sure why "editor doesn't follow ISO 8601 while typing manually" should be separated from "editor still doesn't follow ISO 8601 while using a semi-automated script".

Perhaps this ticket can be split into two tickets. One that ensures that dates such as "2013" or "March 2017" aren't being represented as 2013-01-01 or "2017-03-01", and one ticket that discusses how to format dates such as "Winter 2012" in an MOS/ISO standard way. I don't see why Citoid can't just go to the "lowest denominator". If only a year is known, then only produce a year, if year and month are known, either only produce year+month, or only the year, but not a day. Whether or not CItoid should do 2013-03 or 2013-03-00 or March 2013 can be discussed, but first, the tool should stop doing 2017-03-01 or 2013-01-01, After the false dates have stopped being produced with Citoid, we can discuss how to make the dates more precise.

|date= already doesn't follow IS0 8601. It has never followed ISO 8601. The decision that |date= would not follow ISO 8601 was made years before citoid was created. I'm not sure why "editor doesn't follow ISO 8601 while typing manually" should be separated from "editor still doesn't follow ISO 8601 while using a semi-automated script".

The |date= parameter has always supposed to contain some version of the date that would generally be regarded as correct English, and in recent years, has been expected to follow the date formats accepted in MOSNUM. It has never been acceptable to use a format that would never be considered correct English, such as 2017-05-00.

What I wrote was merely a suggestion and should be taken to be just that: a suggestion. It was intended to move the conversation ahead. My point was to show that it is possible to create our own inter-tool date exchange standard so that citoid can transmit dates to the various wikis with their various templates in a consistent and understandable manner. We have already had Editor Jc3s5h declare the YYYY-MM-00 format to be unacceptable, you are suggesting, I think, that requiring that en.wiki editors write an en dash in YYYY–yy year range dates is unacceptable. Are we now stymied? Shall we just up stumps and retire to the pavilion?

I don't understand what you mean by this. At en.wiki, from their earliest days through today, cs1|2 templates have always accepted some form of iso8601 date. Prior to March 2014 cs1|2 simply rendered what they were given so they accepted all forms of iso8601 dates. MOS may not have approved but, in their adolescence, cs1|2 did not care what MOS thought. From March 2014, dates are expected to comply with MOS which still allows the iso8601 yyyy-mm-dd form. And where did the quoted text in your post come from? I can't find it on this ticket.

It has never been acceptable to use a format that would never be considered correct English, such as 2017-05-00.

True, but, until now, we have not considered using such a form sub rosa as an inter-tool date exchange mechanism that unambiguously identifies a month and year date where the template parameter in wiki text is |date=2017-05-00 and where, transformed by the template, the final rendering is May 2017.

Sorry, I do not know how to read the line containing " T132308#3268915" so cannot respond to your question.

[offtopic] @Jc3s5h: If the link does not make your browser jump to that comment, please click "Changes from before your most recent comment are hidden. Show Older Changes" first and then try again. Thanks!

|date= already doesn't follow IS0 8601. It has never followed ISO 8601. The decision that |date= would not follow ISO 8601 was made years before citoid was created. I'm not sure why "editor doesn't follow ISO 8601 while typing manually" should be separated from "editor still doesn't follow ISO 8601 while using a semi-automated script".

Currently, the standard followed by |date= is "use the subset of acceptable English-language dates (including some of ISO 8601 that is allowed in MOS:DATES". If software is going to create date information that is neither ISO 8601 nor correct English, it won't be following the current standard. If the software-created dates are separated from the human-created dates by giving the software-created dates a different parameter name, the template would transform the software-created dates into the appropriate format for the article before rendering them. If the new parameter and |date= were present in the same citation, the new parameter should be ignored.

Humans should not edit the new parameter. If there is no |date= parameter and the editor knows the new parameter is wrong, the human would create a correct |date= parameter and delete the new parameter.

cs1|2 templates have always accepted some form of iso8601 date.

cs1|2 templates have always accepted some ISO 8601-compliant dates, but perhaps more to the point, enwiki has always rejected some ISO 8601-compliant dates, and enwiki has always accepted some ISO 8601-non-compliant dates. Therefore, "this suggestion doesn't comply with this standard (that we aren't complying with anyway)" [1] does not sound like a logical argument to me.

[1] See, e.g., comments such as "Anything that examines the wikitext and finds yyyy-mm-00 should reject it as completely non-standard." We agree that this isn't the standard presented in ISO 8601. But anything that finds |date=Summer 1942 should equally reject that as completely non-standard, too, because that also isn't the standard presented in ISO 8601.

[1] See, e.g., comments such as "Anything that examines the wikitext and finds yyyy-mm-00 should reject it as completely non-standard." We agree that this isn't the standard presented in ISO 8601. But anything that finds |date=Summer 1942 should equally reject that as completely non-standard, too, because that also isn't the standard presented in ISO 8601.

"yyyy-mm-00" is completely non-standard for the date parameter because it is neither ISO 8601 nor a proper English date. |date=Summer 1942 is standard because it is both accepted in MOS:DATES and it is proper English.

Thanks all! I have decided on a game plan:

  • Put out all dates in a readable format (i.e. May 2010) in the date field to address the polluted data issue ASAP.
  • Possibly translate them on our end as well.

This is a possible way forward for standardising dates which would occur over a longer time scale:

  • Write up a standard format for publishers' dates. Likely in the form 2007-10-00, 2007-summer, 2007-q1.
  • Put the new standard in a new field called publisherDate in citoid for transitioning purposes.
  • See if any wikis are willing to use publisherDate. Each wiki would be able to decide if they want to accept the new format in their 'date' field, or if they prefer to have a separate field for it like 'publisher-date' or 'citoid-date'.
  • Assess if the standard publisherDate is now suitable for the date field and potentially replace it. If not, remove it. This bit is sticky if there is only partial conversion.
Mvolz renamed this task from Figure out how to deal with incomplete dates, i.e. year only or year and month only to Deal with incomplete and non standard dates, i.e. year only, year and month only, or season / quarter.May 18 2017, 11:00 AM
Mvolz removed a project: Patch-For-Review.
Mvolz updated the task description. (Show Details)
Mvolz added a comment.EditedMay 18 2017, 11:06 AM

I have opened up the conversation to the wikicite-discuss group as well: https://groups.google.com/a/wikimedia.org/forum/#!topic/wikicite-discuss/a2kRHayAiyo

Someone there suggested EDTF (extended date time format) https://www.loc.gov/standards/datetime/ - this looks like exactly what we need.

The only issue is that they do represent missing precision in the form YYYY-MM

I know this is ambiguous w.r.t. to ranges but

  1. 2010-11 causes as CS1/2 error and tells people they should use 2002–2003 (with em dash). So this is not valid, and technically CS1/2 could accept it and interpret it as Nov 2010. It won't alert the user if they think they're adding a range, but I feel like they should figure it out when it gets rendered "wrong" - and I think the amount of this kind of error would be low anyway.
  2. If CS1/2 doesn't want to allow them despite point 1, it could go into a different field as previously proposed.

IMO we should use this ETDF because I think the last thing the world needs is any more standard formats, so if there's one out there that looks like it could mostly do what we want, we should use it :).

In response to T132308#3272430, @Mvolz:

Wait a minute, citoid can render all dates in readable format and in the appropriate language? Tell me again why we've been having this conversation?

In T132308#3273017, @Mvolz wrote in part:

Someone there suggested EDTF (extended date time format) https://www.loc.gov/standards/datetime/ - this looks like exactly what we need.

I participated in the discussion that lead to EDTF. A modification of this has been proposed for the next version of ISO 8601. I have a draft but am not allowed to make it available to others. My concern about ISO 8601 is that copies are so expensive that few editors who are not employed by a relevant institution will have access to a copy, so there is much misinformation about the contents of ISO 8601. I expect this problem to continue with the new edition.

In T132308#3273017, @Mvolz wrote in part:

I have opened up the conversation to the wikicite-discuss group as well: https://groups.google.com/a/wikimedia.org/forum/#!topic/wikicite-discuss/a2kRHayAiyo

I can't edit google groups because of the settings of the organization that I obtain access to gmail through, so I'll comment here. The question was raised about publication date vs. copyright. An apropos example is the Oxford Companion to the Year which has a copyright year of 1999 but was "reprinted with corrections" in 2003. Both dates are listed in WorldCat. The information about reprinting with corrections comes from my paper copy.

I don't know the details of what year a publisher is expected to place in the copyright notice, but I would speculate that the corrections did not involve sufficient creative effort by the authors to restart the copyright period.

The only issue is that they do represent missing precision in the form YYYY-MM

Is that not handled by §5.2.2 Unspecified? That section reads, in part:

  1. Year and month specified, day unspecified.
    • 1999-01-uu
      • some day in January 1999

That form is much the same as the 1999-01-00 form suggested elsewhere and accomplishes the same thing.

Also missing is quarterly date and what they have chosen to call holiday date format support. These are recognized as issues but that version of EDTF is mute on those topics.

  1. 2010-11 causes as CS1/2 error and tells people they should use 2002–2003 (with em dash). So this is not valid, and technically CS1/2 could accept it and interpret it as Nov 2010. It won't alert the user if they think they're adding a range, but I feel like they should figure it out when it gets rendered "wrong" - and I think the amount of this kind of error would be low anyway.

cs1|2 emits an error message for 2010-11 not so much because of the hyphen but because of ambiguity; is 11 a month or a year? There is no error when the last two digits are outside the range 00-12. Years in a range are separated with en dash not em dash; see the MOS. As I've suggested elsewhere in this ticket, cs1|2 can interpret the YYYY-MM-uu form: |date=2010-11-uu → November 2010. I'll hack the cs1|2 sandbox to demonstrate this in the next day or two.

  1. If CS1/2 doesn't want to allow them despite point 1, it could go into a different field as previously proposed.

It isn't cs1|2. The restrictions on date-parameter values in the form YYYY-xx (where xx may be a year or month) is imposed on cs1|2 by the en.wiki MOS. Please stop blaming cs1|2 for limitations that are imposed on it by the en.wiki MOS.

IMO we should use this ETDF because I think the last thing the world needs is any more standard formats, so if there's one out there that looks like it could mostly do what we want, we should use it :).

I've only read it once but am inclined to agree. The numeric values that is uses for seasons are similar to the internal values that cs1|2 uses to represent seasons. I've made a TODO: note in the cs1|2 date validation code to change to the EDTF values. I notice that cs1|2 supports 'Fall' as a synonym of 'Autumn'; EDTF does not.

Jc3s5h added a comment.EditedMay 18 2017, 4:51 PM

It isn't cs1|2. The restrictions on date-parameter values in the form YYYY-xx (where xx may be a year or month) is imposed on cs1|2 by the en.wiki MOS. Please stop blaming cs1|2 for limitations that are imposed on it by the en.wiki MOS.

Dates in citations are controlled by the Citing sources, not by Manual of Style/Dates and numbers (abbreviated MOS:DATES). By consensus at Help:Citation Style 1 editors decided to use the date portion of MOS:DATES The editors could decide to make some exceptions, and make the allowable dates for cs1|2 a bit different than the allowable dates in MOS:DATES.

For citation formats other than cs1|2, which are allowed in the English Wikipedia, other date formats, that don't follow MOS:DATES, could be used. For example, if an article followed APA style for citations, the date 1993, September 30 could be used.

I'll hack the cs1|2 sandbox to demonstrate this in the next day or two.

Easier than I thought. Discussion and simple examples at en.wiki Help talk:Citation Style 1.

Change 354249 had a related patch set uploaded (by Mvolz; owner: Marielle Volz):
[mediawiki/services/citoid@master] Relax date validation significantly

https://gerrit.wikimedia.org/r/354249

Rical added a subscriber: Rical.May 22 2017, 3:36 PM

Change 354249 merged by Mobrovac:
[mediawiki/services/citoid@master] Relax date validation significantly

https://gerrit.wikimedia.org/r/354249

Mentioned in SAL (#wikimedia-operations) [2017-05-31T20:13:22Z] <mobrovac@tin> Started deploy [citoid/deploy@7d69554]: Relaxing date validation - T132308

Mentioned in SAL (#wikimedia-operations) [2017-05-31T20:15:54Z] <mobrovac@tin> Finished deploy [citoid/deploy@7d69554]: Relaxing date validation - T132308 (duration: 02m 32s)

mobrovac changed the task status from Open to Stalled.May 31 2017, 8:31 PM

The patch relaxing date validation is now live on all projects. It incorporates some of the suggestions outlined in this task, so please test it. Setting the task as stalled until further input.

because this conversation has apparently collapsed and died without obvious resolution, I have removed the code that supported edtf transformations from the cs1|2 module sandbox.

Mvolz renamed this task from Deal with incomplete and non standard dates, i.e. year only, year and month only, or season / quarter to Consider using EDTF format to standardise dates.Oct 19 2017, 2:11 PM
Mvolz removed a project: Patch-For-Review.
Mvolz updated the task description. (Show Details)
Mvolz added a comment.EditedOct 19 2017, 2:21 PM

because this conversation has apparently collapsed and died without obvious resolution, I have removed the code that supported edtf transformations from the cs1|2 module sandbox.

The current status is that we have very weak date validation and no one has commented since then, so I guess the relaxed validation was a satisfactory quick fix. We could leave it at that, for the most part.

We still have the issue that all dates are English language. We still could consider implementing https://www.loc.gov/standards/datetime/ISO_DIS%208601-2.pdf or at least moving towards it cautiously, but since we're receiving the data in a non-standard format to begin with, we have to be careful about transforming things into nonsensical data (like what happened before), and in the end it might make sense to not have a standard and just continue to do weak validation. If we were a publisher that could guarantee of the format of our own data, a standard would be best, but it just might not be plausible in this scenario, where we're getting dates back in all sorts of non standard formats.

Mvolz renamed this task from Consider using EDTF format to standardise dates to Internationalise citoid dates.Oct 19 2017, 2:23 PM
Mvolz updated the task description. (Show Details)
Elitre added a subscriber: Elitre.Oct 19 2017, 4:40 PM
Rical removed a subscriber: Rical.Mar 20 2018, 10:36 AM