Page MenuHomePhabricator

Month and year are sometimes parsed as the first day of the month
Closed, ResolvedPublic5 Estimated Story PointsBUG REPORT

Description

Steps to Reproduce

  1. Go to an item (e.g. sandbox) with Czech (cs) as the interface language.
  2. Add a new property with time datatype (e.g. P585).
  3. Write a month in Czech (see below) and a year (e.g. 2019).

Actual Results

  • For some months, the preview will show the first day of the month:
    • Namely "březen" (3), "květen" (5), "červen" (6), "červenec" (7), "září" (9), "říjen" (10).
  • The rest is parsed correctly: "leden" (1), "únor" (2), "duben" (4), "srpen" (8), "listopad" (11), "prosinec" (12).
    • Possible explanation: All problematic months contain characters with caron. It's possible that the parser (YearMonthTimeParser?) doesn't handle special characters correctly.
  • The same problem happens with all English months (still with the interface in Czech).

obrazek.png (231×896 px, 14 KB)

obrazek.png (240×802 px, 14 KB)

obrazek.png (214×778 px, 13 KB)

Expected Results

  • The value is always parsed to the "month" precision.

Event Timeline

data-values\time\src\ValueParsers\YearMonthTimeParser.php
$logger = \MediaWiki\Logger\LoggerFactory::getInstance( 'MyCoolLoggingChannel' );
$logger->debug( 'stringParse: {value}', [ 'value' => $value ] );
// Matches year and month separated by a separator.
// \p{L} matches letters outside the ASCII range.
if ( !preg_match( '/^(-?[\d\p{L}]+)\s*?[\/\-\s.,]\s*(-?[\d\p{L}]+)$/', trim( $value ), $matches ) ) {
  throw new ParseException( 'Failed to parse year and month', $value, self::FORMAT_NAME );
}
$logger->debug( 'stringParse: ok' );

obrazek.png (555×600 px, 30 KB)

Works for some months but not for all. The \p{L} matches letters outside the ASCII range. trick apparently doesn't work.
@Lucas_Werkmeister_WMDE @Addshore @Ladsgroup Could you help?

Found https://stackoverflow.com/questions/26611495/regex-pl-problems
Solution:

if ( !preg_match( '/^(-?[\d\p{L}]+)\s*?[\/\-\s.,]\s*(-?[\d\p{L}]+)$/u', trim( $value ), $matches ) ) {
  throw new ParseException( 'Failed to parse year and month', $value, self::FORMAT_NAME );
}

(Add a u as modifier for the regex)

Bingo! I could really have tried Google. Weird that https://regex101.com/ (which I used) still matches without /u.

Patch-For-Review: https://github.com/wmde/Time/pull/146

I tried Regex101 too and after that I tried the same with a small php script which I ran on command line and that didn't work. That's when I started searching for what was wrong.

The patch doesn't fix this for dates BCE. When approved, I'll submit one more PR for this.

Change 620954 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@master] WIP DNM:All unicode in month names (using new data-values/time)

https://gerrit.wikimedia.org/r/620954

The patch doesn't fix this for dates BCE. When approved, I'll submit one more PR for this.

https://github.com/wmde/Time/pull/148

Change 621524 merged by jenkins-bot:
[mediawiki/vendor@master] Bump data-values/time to 1.0.2

https://gerrit.wikimedia.org/r/621524

Change 620954 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Test all unicode in month names (using new data-values/time)

https://gerrit.wikimedia.org/r/620954

This seems to be fixed as far as I can tell? In which case I assume it could be pulled out of the “prioritized” column :)

I requested one more improvement to improve compat. with BCE dates. But it solves a todo, it isn't really high priority: https://github.com/wmde/Time/pull/148

Screenshot of the issue that the pull request would hopefully fix:

Screenshot from 2021-06-29 10-36-36.png (176×508 px, 13 KB)

String for easy copy+pasting: září 2019 BCE

Addshore set the point value for this task to 5.Jun 30 2021, 10:49 AM

It turns out the pull request introduces some changes that may not be wanted. Currently (since Partial support for thousands separators in DateTimeParsers), we allow thousands separators in BC dates, under the assumption that “negative dates usually don’t have a month”. So you currently get the following parses (API request):

  • 1 BC: -0001-00-00T00:00:00Z, precision 9
  • 10 BC: -0010-00-00T00:00:00Z, precision 9
  • 100 BC: -0100-00-00T00:00:00Z, precision 9
  • 1 000 BC: -1000-00-00T00:00:00Z, precision 9
  • 10 000 BC: -10000-00-00T00:00:00Z, precision 5
  • 100 000 BC: -100000-00-00T00:00:00Z, precision 4
  • 1 000 000 BC: -1000000-00-00T00:00:00Z, precision 3
  • 10 000 000 BC: -10000000-00-00T00:00:00Z, precision 2

(The precision jumps because we “default to year precision for range 4000 BC to 4000” in IsoTimestampParser::getPrecisionFromYear().)

With the change to the YearMonthTimeParser, you instead get:

  • 1 BC: -0001-00-00T00:00:00Z, precision 9
  • 10 BC: -0010-00-00T00:00:00Z, precision 9
  • 100 BC: -0100-00-00T00:00:00Z, precision 9
  • 1 000 BC: -0000-01-00T00:00:00Z, precision 10
  • 10 000 BC: -0000-10-00T00:00:00Z, precision 10
  • 100 000 BC: -100000-00-00T00:00:00Z, precision 4
  • 1 000 000 BC: -1000000-00-00T00:00:00Z, precision 3
  • 10 000 000 BC: -10000000-00-00T00:00:00Z, precision 2

Put differently: do we want to parse “1 234 BC” as 1234 BC, or as January 234 BC?

Hi @Lucas_Werkmeister_WMDE, great that you spotted this! I see no reason to change the current behavior here: Let's continue to parse negative dates as is (e.g. “1 234 BC” as 1234 BC).

Pull request merged; we probably want to put out a new release of the library now, and use it in Wikibase / mediawiki-vendor?

Change 704792 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/vendor@master] Bump data-values/time to 1.0.4

https://gerrit.wikimedia.org/r/704792

Change 704793 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikibase@master] Update data-values/time to 1.0.4

https://gerrit.wikimedia.org/r/704793

Change 704792 merged by jenkins-bot:

[mediawiki/vendor@master] Bump data-values/time to 1.0.4

https://gerrit.wikimedia.org/r/704792

Change 704793 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Update data-values/time to 1.0.4

https://gerrit.wikimedia.org/r/704793

This should now be fixed in production.
Ping @Manuel for verification

Manuel claimed this task.

This is fixed! Thank you for reporting it @matej_suchanek!