Page MenuHomePhabricator

Citoid should be validating date fields
Closed, ResolvedPublic40 Estimated Story Points

Description

Forking from T94209.

About the Cite journal template at http://en.wikipedia.beta.wmflabs.org/w/index.php?title=User:Elitest/sandbox&oldid=212341#Testing_DOIs%7C displaying a Check date values in: |date= (help) error message.

Doi: 10.1542/peds.2007-2362

Returns: "date":"11/01/2007" in response, from Zotero

Event Timeline

Elitre raised the priority of this task from to Needs Triage.
Elitre updated the task description. (Show Details)
Elitre added a project: Citoid.
Elitre subscribed.
Mvolz triaged this task as Medium priority.
Mvolz moved this task from Backlog to IO Tasks on the Citoid board.
Mvolz set Security to None.

@mobrovac: yay, date validation.

Obviously I'd like to convert all dates to ISO (yyyy-mm-dd) but we have no idea what's coming back from Zotero here. 11/01/2007 could be either 2007-01-11 or 2007-11-01 :(.

We could delete any date that we can't reliably convert to ISO, or we could just make a best guess...

This indeed is a real impasse. Zotero does include a date-guessing function, but:

  • it does its best to guess the actual date, but uses the running platform's locale to distinguish between mm[/-]dd and dd[/-]mm formats, which is a big nonsense if you ask me (not to mention the hacks of the like of oh, if month >= 13 then this must be the day portion)
  • only a handful of translators are using it, and they do not use it to validate dates, but for some special purposes (such as determining the publication year in BibTex entries)

Hence, we are basically left to our own device here. In the general case, unfortunately, I don't see a proper solution per se. We might do simplistic checks, e.g. to see if the date starts with a year, and if so it is safe to assume the format is yyyy[/-]mm[/-]dd. Otherwise, some sites also expose the language and/or locale meta headers, which might be used to guess the format (but that still remains an educated guess).

In this concrete issue, though, it seems it would be more appropriate to write a translator for aapublications.org - the source actually contains a valid date in its DC attributes:

<meta content="2007-11-01" name="DC.Date" />

However, the DC part is completely ignored by the default translator, and citation_* fields are used instead, which contain:

<meta content="11/01/2007" name="citation_date" />

Filed: https://github.com/zotero/translators/issues/876

But we also need to do something here, because there's no guarantee what comes out of the translators. I say we discard ambiguous dates as a matter of course, and save any where (yyyy && ( mm or dd is >=13, mm===dd,)) and similar for when it's yy etc (more complicated, can we assume yy-mm-dd doesn't exist and that it will always be mm-dd-yy or dd-mm-yy?)

Do we do something different from the native scraper (i.e. not Zotero?) Open Graph specifies ISO format for their date fields, so we can probably trust those to a larger degree. Also this is where technical debt might bite us, not sure there's a non-messy way currently to convert only results coming from Zotero :). Alll the conversion methods are currently in ZoteroService but we use those methods from Scraper too.

A human user is not likely to notice a bad date, but might notice and add a missing date, and I have a horror of corrupt data.

But we also need to do something here, because there's no guarantee what comes out of the translators. I say we discard ambiguous dates as a matter of course, and save any where (yyyy && ( mm or dd is >=13, mm===dd,))

Sounds like a sane policy. In any case, we can be pretty sure we can get at least the year in most cases.

and similar for when it's yy etc (more complicated, can we assume yy-mm-dd doesn't exist and that it will always be mm-dd-yy or dd-mm-yy?)

IIRC, only mm/dd/yy exists in practice, which is an (informal) US-style date format, so if we encounter it, we can be pretty sure what's what.

Do we do something different from the native scraper (i.e. not Zotero?) Open Graph specifies ISO format for their date fields, so we can probably trust those to a larger degree. Also this is where technical debt might bite us, not sure there's a non-messy way currently to convert only results coming from Zotero :). Alll the conversion methods are currently in ZoteroService but we use those methods from Scraper too.

Hence, I think we shouldn't tackle this until we get a proper input/translate/output pipeline in place.

This might not be a good idea.

What will you do for quarterly periodicals, whose correct date is "Fall 2010"? You'd have to set up a list of valid date-related keywords to accept. The list itself might not be very long, but you would have to do it in more than 100 languages.

Also, what will you do when the publication's "date" is actually volume-issue numbers? Look at the "date" scheme we use on https://en.wikipedia.org/wiki/Wikipedia:VisualEditor/Updates/April_2015 It's not as unusual as we could hope. I am concerned that this type of effort might accidentally turn "2–2015" into "February 2015".

Mvolz removed Mvolz as the assignee of this task.May 11 2015, 9:50 AM

Change 221369 had a related patch set uploaded (by Mvolz):
Convert all Zotero date fields to ISO

https://gerrit.wikimedia.org/r/221369

Change 221369 merged by Mobrovac:
Convert all Zotero date fields to ISO

https://gerrit.wikimedia.org/r/221369

mobrovac claimed this task.

Deployed, resolving. If the issue still persists, please reopen the ticket.