Page MenuHomePhabricator

Month name and year preceded or followed by a dot or comma is parsed as having a day
Open, LowPublic

Description

A similar issue to T151088, the following are all parsed as "1 April 1987":

  • "April 1987."
  • "April 1987,"
  • ", April 1987"
  • ". April 1987"

As are combinations of both, e.g.

  • ", April 1987."

Or multiple characters, e.g.

  • "April 1987..."

The spacing around the characters does not make a difference.

For all of them, I would expect either "April 1987" or an error.

Event Timeline

Vvjjkkii renamed this task from Month name and year preceded or followed by a dot or comma is parsed as having a day to caaaaaaaaa.Jul 1 2018, 1:01 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
JJMC89 renamed this task from caaaaaaaaa to Month name and year preceded or followed by a dot or comma is parsed as having a day.Jul 1 2018, 2:44 AM
JJMC89 raised the priority of this task from High to Needs Triage.
JJMC89 updated the task description. (Show Details)
JJMC89 added a subscriber: Aklapper.
thiemowmde added a project: patch-welcome.
thiemowmde added subscribers: thiemowmde, Addshore.

This is an other situation where none of the (currently half a dozen) custom Wikibase parsers is able to understand an input string, and parsing falls back to PHP's problematic build-in parser (see http://php.net/manual/en/datetime.formats.php).

In my opinion the best option is to improve the existing YearMonthTimeParser. This parser is meant to understand dates with precision "month".

// Before:
'/^(-?[\d\p{L}]+)\s*?[\/\-\s.,]\s*(-?[\d\p{L}]+)$/'

// After:
'/^[\p{P}\p{Z}]*?(-?[\p{L}\p{N}]+)\p{Z}*?[\p{P}\p{Z}]\p{Z}*(-?[\p{L}\p{N}]+)[\p{P}\p{Z}]*$/'

// The same, just documented:
'/^
    [\p{P}\p{Z}]*?     # irrelevant punctuation/whitespace (ungreedy)
    (-?[\p{L}\p{N}]+)  # capture group 1 contains either month or year
    \p{Z}*?            # irrelevant whitespace (ungreedy)
    [\p{P}\p{Z}]       # at least 1 separator
    \p{Z}*             # irrelevant whitespace
    (-?[\p{L}\p{N}]+)  # capture group 2 contains either month or year
    [\p{P}\p{Z}]*      # irrelevant punctuation/whitespace
    $/x'

https://www.regular-expressions.info/unicode.html is a nice cheat sheet for these \p{…} Unicode character classes.

Properly testing this in YearMonthTimeParserTest is a must. Additionally, at least one relevant edge case should be added to TimeParserFactoryTest.

It also happens when preceded by a hyphen: - April 2000 and -April 2000 turn into 1 April 2000 BCE

They don't contain a dot or comma, but while testing something, I also found that "91-04 bc" and "0091-04 bc" turn into "1 April 91 BCE" (and for some reason "0091-04-00 bc" turns into "31 March 91 BCE").