Page MenuHomePhabricator

Wikidata web form recognizes Japanese date 2001年8月31日 but not 2016年7月1日
Closed, ResolvedPublic

Description

Steps to reproduce:

  1. Go to any item (I used https://www.wikidata.org/wiki/Q28684004), add an "inception" field, which takes a date.
  2. Enter this date: 2001年8月31日
  3. The tool is clever enough to parse this Japanese date, fortunately.
  4. Enter this date: 2016年7月1日
  5. The tool does not understand this perfectly valid Japanese date, and says "The time value is malformed."

Screenshot from 2019-01-17 11-47-08.png (209×872 px, 23 KB)

Screenshot from 2019-01-17 11-46-44.png (197×867 px, 24 KB)

For your information, I copied-pasted these dates from https://ja.wikipedia.org/wiki/%E8%A1%8C%E5%B7%9D%E3%82%A2%E3%82%A4%E3%83%A9%E3%83%B3%E3%83%89 and https://ja.wikipedia.org/wiki/SGT%E7%BE%8E%E8%A1%93%E9%A4%A8 respectively.

Event Timeline

I can explain this behavior, if it helps. The first date is unambiguous. We do have a parser that ignores all punctuation. It checks if it can find 3 numbers, and if there is only one way these 3 numbers can be mapped to a year, month, and day. This works for an input like "2001 8 31", but not "2016 7 1". The later can be January 7th or July 1st. The current set of parsers can't know and gives up.

One possible solution is to add a new parser especially for Japanese dates.

, and aren't punctuation though, they're Chinese characters/words meaning "year", "month" and "day" which are used to make the names of years, months and days (e.g. "8月" means "August"), so both dates are actually unambiguous.

It's not specific to Japanese either, it's the same across East Asia:

Dates are written the same in zh (including zh-hans, zh-hant, zh-cn, zh-hk, zh-mo, zh-my, zh-sg and zh-tw), gan (including gan-hans and gan-hant), hak, hsn and lzh. This format is also used (untranslated) by ami, ii, pwn, szy, tay, trv and za.

wuu and yue are similar, using "YYYY年M月D号" (simplified) and "YYYY年M月D號" (traditional), with the word (simplified)/ (traditional) instead of 日.

ko (and ko-kp) use "YYYY년 M월 D일", where , and are the hangul forms of 年, 月 and 日.

cdo and nan are using Latin orthographies and have "YYYY nièng M nguŏk D hô̤" and "YYYY-nî M-goe̍h D-ji̍t" respectively, where nièng and nî are the pronunciation of 年, nguŏk and goe̍h are the pronunciation of 月, hô̤ is the pronunciation of 號/号 and ji̍t is the pronunciation of 日.

Change 886919 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Thiemo Kreuz (WMDE)):

[mediawiki/extensions/Wikibase@master] Make DateFormatParser accept more Asian/Chinese date formats

https://gerrit.wikimedia.org/r/886919

Change 886920 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Thiemo Kreuz (WMDE)):

[mediawiki/extensions/Wikibase@master] Let DateFormatParser accept and skip redundant day of the week

https://gerrit.wikimedia.org/r/886920

So the first of the above changes, Make DateFormatParser accept more Asian/Chinese date formats, enables Wikibase to parse dates like “2023年3月12日” in Japanese as 2023-03-12. That change seems unproblematic and is good to go, I think.

However, the second one, Let DateFormatParser accept and skip redundant day of the week, is a bit trickier. With that change, all of the following would parse as 2023-03-12 in Japanese:

  1. 2023年3月12日 – as before
  2. 2023年3月12日 (日) – correct weekday (Sunday)
  3. 2023年3月12日 (一) – blank weekday
  4. 2023年3月12日 (whatever) – nonsense weekday
  5. 2023年3月12日 (月) – different weekday (Monday)

Thiemo mentioned some possible alternatives in the commit message:

  1. Stick with what the previous patch did and accept only inputs without a day of the week.
    • This would accept 1 and reject 2-5.
  2. Add an option to the parser similar to the existing "monthNames" option and require callers to list all possible names. Only these will be accepted.
    • This would accept 1, 2 and 5, and reject 3 and 4.
  3. Make an entirely separate parser for Chinese dates.
    • Something like this is required, I think, if we want to accept 2 and reject 5. Though I think it should be possible to handle this in a way that’s not specific to Chinese, but works in any language that contains named days of the week in the date format.

I think this is a product question (and, IIRC, not one we explicitly discussed in yesterday’s bug triage hour): when users input a date with a named day of the week, and the day of the week doesn’t match the date (as in “2023年3月12日 (月)”), should the date successfully parse (ignoring the named day of the week) or not?

IMHO, it’s at least conceivable that users would specify the day of the week as an intentional redundancy, and expect the software to check that they didn’t make a mistake in either the numeric date or the day of the week. On the other hand, I don’t think it’s realistic for us to return a specific error like “this date doesn’t match the specified day of the week”; the result would just be a parse error, or worse, Wikibase might even fall back to another parser and still successfully parse the date: just like it’s already able to parse “2001年8月31日”, for instance.

So the first of the above changes, Make DateFormatParser accept more Asian/Chinese date formats, enables Wikibase to parse dates like “2023年3月12日” in Japanese as 2023-03-12. That change seems unproblematic and is good to go, I think.

However, the second one, Let DateFormatParser accept and skip redundant day of the week, is a bit trickier. With that change, all of the following would parse as 2023-03-12 in Japanese:

  1. 2023年3月12日 – as before
  2. 2023年3月12日 (日) – correct weekday (Sunday)
  3. 2023年3月12日 (一) – blank weekday
  4. 2023年3月12日 (whatever) – nonsense weekday
  5. 2023年3月12日 (月) – different weekday (Monday)

That sounds reasonable to me. It seems plausible that the bit after the date could sometimes not actually be the weekday (e.g. "the choice we made on 2023-03-12 (Monday) doesn't work so we need to pick a different day") and I would probably be pleased that it extracted the date from the input instead of making me delete all the extra text first.

I think this is a product question (and, IIRC, not one we explicitly discussed in yesterday’s bug triage hour): when users input a date with a named day of the week, and the day of the week doesn’t match the date (as in “2023年3月12日 (月)”), should the date successfully parse (ignoring the named day of the week) or not?

IMHO, it’s at least conceivable that users would specify the day of the week as an intentional redundancy, and expect the software to check that they didn’t make a mistake in either the numeric date or the day of the week. On the other hand, I don’t think it’s realistic for us to return a specific error like “this date doesn’t match the specified day of the week”; the result would just be a parse error, or worse, Wikibase might even fall back to another parser and still successfully parse the date: just like it’s already able to parse “2001年8月31日”, for instance.

I would interpret a parse error in this situation as "we don't support including the day of the week" and delete it from the input, leaving just "2023年3月12日", so I think it would only make sense to check whether the day of the week matches if it also has an error message for that situation.

or worse, Wikibase might even fall back to another parser and still successfully parse the date […]

… which might be good or bad:

  • Unambiguous dates will be parsed just fine by one of the later, more relaxed parsers. These will ignore the day of the week anyway. Making an earlier parser more strict doesn't mean "2023年3月12日 (月)" will be rejected. We would need to change the entire approach to give this guarantee – which I think we shouldn't do.
  • We know the final PHP date parser can sometimes produce weird results. It's always better to make an earlier parser accept a date when we have enough certainty. I think this is the case here. The day of the week at the end of "2023年3月12日 (月)" can be wrong for many reasons. It's very unlikely such a mismatch means the date is entirely wrong and needs to be rejected, in my opinion. And even if, we don't have a good way to tell the user.

Change 886919 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Make DateFormatParser accept more Asian/Chinese date formats

https://gerrit.wikimedia.org/r/886919

Thiemo mentioned some possible alternatives in the commit message:

  1. Stick with what the previous patch did and accept only inputs without a day of the week.
    • This would accept 1 and reject 2-5.

That option seems not so great to me because according to Wikipedia the most common format does include the day of the week explicitly.

  1. Add an option to the parser similar to the existing "monthNames" option and require callers to list all possible names. Only these will be accepted.
    • This would accept 1, 2 and 5, and reject 3 and 4.
  2. Make an entirely separate parser for Chinese dates.
    • Something like this is required, I think, if we want to accept 2 and reject 5. Though I think it should be possible to handle this in a way that’s not specific to Chinese, but works in any language that contains named days of the week in the date format.

I think this is a product question (and, IIRC, not one we explicitly discussed in yesterday’s bug triage hour): when users input a date with a named day of the week, and the day of the week doesn’t match the date (as in “2023年3月12日 (月)”), should the date successfully parse (ignoring the named day of the week) or not?

I am leaning towards we should parse it in this case and assume that the day name is wrong, not the day number.

IMHO, it’s at least conceivable that users would specify the day of the week as an intentional redundancy, and expect the software to check that they didn’t make a mistake in either the numeric date or the day of the week. On the other hand, I don’t think it’s realistic for us to return a specific error like “this date doesn’t match the specified day of the week”; the result would just be a parse error, or worse, Wikibase might even fall back to another parser and still successfully parse the date: just like it’s already able to parse “2001年8月31日”, for instance.

Change 886920 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Let DateFormatParser accept and skip redundant day of the week

https://gerrit.wikimedia.org/r/886920

Arian_Bozorg claimed this task.
Arian_Bozorg subscribed.

It looks like option 2 is the one we went with,

So, 2016年7月1日 doesn't work, but 2016年7月7日 works (the weekday must be correct)

Thanks so much for this :)