Page MenuHomePhabricator

[Task] Dates written as 9/7/2017 should not always be parsed as American MM/DD/YYYY
Open, LowPublic

Description

When information is entered into a property field with 'time' datatype (eg start time, P580), the system makes an attempt to parse what's been entered and convert it into a coherent standardised date, then shows this to the user so they can change it if it's been misinterpreted.

Dates entered as xx-yy-zzzz are challenging for the parser, as they might be intended as MM-DD or DD-MM. The algorithm seems to check to see if xx or yy is greater than 12, and if so, treat that as the month day element; but if both are twelve or less, then it's still ambiguous. At this point, the parser gets a bit strange.

Dates written as 9 7 2017, 9.7.2017, 9,7,2017, or 9-7-2017 - with the digits separated by spaces, dots, commas, or hyphens - are interpreted as 9 July - ie DD-MM-YYYY.

Dates written as 9/7/2017 - with the digits separated by forward slashes - are interpreted as September 7, ie MM-DD-YYYY.

There does not seem to be any clear reason for this inconsistency, and it's a bit confusing for end users. It looks like the intended outcome is for dates to default to DD-MM-YYYY unless obviously MM-DD, but this doesn't seem to be reliably implemented.

(As an aside, most other punctuation causes an invalid date - the one unusual exception is |, where 9|7|2017 is parsed as the year "9" - apparently everything after the first pipe is ignored. But as this is an incredibly odd way to represent dates, it probably doesn't matter...)

Event Timeline

thiemowmde triaged this task as Low priority.EditedJun 28 2017, 7:25 PM

I'm impressed by this really comprehensive task description. What's described there is exactly what happens internally. Thanks a lot for putting so much work into this!

Dates written as 9/7/2017 […] are interpreted as […] MM-DD-YYYY.

This describes the current situation. We do have a chain of half a dozen date parsers. There is one that checks if it can find one number that is greater than 31 (that must be the year), and an other number between 13 and 31 (that must be the day). This parser ignores punctuation. The last parser takes punctuation into account because slashes are a hint at the American format MM/DD/YYYY, while all other punctuation characters typically hint at DD.MM.YYYY.

This is not an error. I'm afraid there is nothing I can do here, and have to close this ticket as invalid.

Note that we want to take the users language into account when parsing dates. We are working on code that will allow us to do so. This is already tracked in about a dozen other tickets (see T87764).

This all sounds great until...

...because slashes are a hint at the American format MM/DD/YYYY, while all other punctuation characters typically hint at DD.MM.YYYY.

...which sounds completely weird to me. In fact, I assumed it was obviously not the explanation when I first reported this on WD:PC - "It can't be that using slashes is unique to MDY notation - I've been writing DMY dates this way all my life." :-)

Before this I'd never heard of the idea that slashes are distinctively American - it's true that it's the punctuation Americans most commonly use, but so do a lot of other people. WP's list of common national date styles has about even numbers of dd/mm and dd.mm.

This last parser sounds like it's introducing ambiguity and inconsistency that really doesn't need to be there, and there's no reason to assume that users will expect this behaviour. Can we not just turn this bit off and leave the other parsers to do their work?

I just described how our code behaves right now, just as you did. This is not meant to be set in stone. The real world is complicated, which is why we created Wikidata in the first place.

We know the last PhpDateTimeParser in the chain is bad. But it's still better than nothing. Turning it off right now means you will get an error message when entering "9/7/2017". I don't think this makes the situation better.

I'm afraid there is no way any parser can ever be sure what "9/7/2017" means. Even if we use your IP address or location (something we should never do because of privacy reasons), or the interface language from your preferences, it might still either mean MM/DD/YYYY or DD/MM/YYYY. The only thing we can (and want to) use is the date format from your personal preferences at https://www.wikidata.org/wiki/Special:Preferences#mw-prefsection-rendering. See https://gerrit.wikimedia.org/r/153211.

thiemowmde renamed this task from Inconsistent parsing of entered date values in Wikidata - punctuation causes problems to [Task] Dates written as 9/7/2017 should not always be parsed as American MM/DD/YYYY.Jun 29 2017, 11:05 AM
thiemowmde moved this task from incoming to needs discussion or investigation on the Wikidata board.

[Apologies - forgot to ever put in a response to this. Thanks for looking into it.]

I agree that 07/09/2017 is always a bit ambiguous and we can never reliably say what the user means. But I guess what's confusing me here is that 07-09-2017 or 07 09 2017 are also a bit ambiguous and we can never reliably say what the user means, even if we think it's a little more likely to be one rather than the other.

It feels like we should either treat them all the same way, defaulting to DMY, or reject them all as invalid and ask the user to resubmit. I don't understand what's gained by having a different presumption based solely on the punctuation used. However, I guess that if this weird behaviour is something we're inheriting from PHP, there's not much we can do about it!

Based on https://en.wikipedia.org/wiki/Date_format_by_country, apart from two English speaking islands, the United States is the only country to interpret 07/09/2017 as MM/DD/YYYY.
So for languages other than English, DD/MM/YYYY seems to be the norm.

The time zone as defined in https://www.wikidata.org/wiki/Special:Preferences#mw-prefsection-rendering can also be used to check if the user is in the United States. If not, DD/MM/YYYY can also be used. No need to use the IP address.

By combining these two approaches, the number of wrongly recognized dates should be minimal.

The time zone as defined in https://www.wikidata.org/wiki/Special:Preferences#mw-prefsection-rendering can also be used to check if the user is in the United States. If not, DD/MM/YYYY can also be used. No need to use the IP address.

There are a few other countries in the same time zones...

There are a few other countries in the same time zones...

Those time zones are defined by country, not by hour differential, so that's not a problem.

There are a few other countries in the same time zones...

Those time zones are defined by country, not by hour differential, so that's not a problem.

So far as I am aware, the system only cares about offset from UTC. So no, it is a problem.

There are a few other countries in the same time zones...

Those time zones are defined by country, not by hour differential, so that's not a problem.

So far as I am aware, the system only cares about offset from UTC. So no, it is a problem.

Indeed, I thought the time zone setting with city had been automatically set, but it turns out it was me at some point earlier.
So no luck with the time zone setting. :(

I'm afraid there is no way any parser can ever be sure what "9/7/2017" means. Even if we use your IP address or location (something we should never do because of privacy reasons), or the interface language from your preferences, it might still either mean MM/DD/YYYY or DD/MM/YYYY. The only thing we can (and want to) use is the date format from your personal preferences at https://www.wikidata.org/wiki/Special:Preferences#mw-prefsection-rendering. See https://gerrit.wikimedia.org/r/153211.

Actually, we could use the location, as do the Compact Language Links .
Combined if necessary with the user set language, that should give us a pretty good guess.

We could maybe have a preference, such that the user can specify whether they want date parsing to be based on an assumption of MM/DD/YYYY or DD/MM/YYYY. Stop guessing. Ask the users.

I see this bug is 2.5 years old. Which is fine. Because it only sucks hugely every single time I enter a date where month & day values are below 13. No rush.

As far as I understand the situation, it is not considered "broken". Whatever the user does, they must check the preview to see if the date parsers guess is correct. This won't change, no matter how good the parsers guess is.

But yes, it's absolutely possible to improve the parser even further. Unfortunately, I'm not actively working on this code any more myself.

Personally, I like the idea of checking the users time zone, if they provided one in their preferences. This might sound strange. After all, the user is not copy-pasting dates that are about the user, right? They might paste from all kinds of sources, American and non-American. But: I believe this would still be tremendously helpful, because the experts behavior matches the users expectation much, much closer then. An American user is used to 9/7/2017 becoming "September", while non-Americans are much more used to it becoming "July". Also, I believe we can safely assume that non-Americans user are more likely to paste dates from non-American sources. The amount of mistakes (e.g. when the user does not check the preview carefully) would be much lower this way, I feel.

As a workaround it might be possible to create a gadget that kicks in when it detects an input with slashes and both numbers below 13, and asks which of the two possible interpretations the user wants. Such a gadget can be written by the community or by the Wikidata team as an opt-in feature for users who struggle with this.

UI which has the capacity to mangle input and demands that the user must do something beyond their simple input of the date, is unambiguously, categorically and unimpeachably broken. It should be considered as such.

I'm afraid such a comment won't make anybody work on this harder.