Page MenuHomePhabricator

Duplication of dates due to different encode
Open, LowPublic

Description

Problem:
We have 2 ways to express the same date for precision lower than month. One is with 00 and one with 01 for day and month. This may lead to cases where an Item has two statements that are basically identical but are not.
The issue is currently being fixed by a bot after the fact but it would be better to fix it before is entered.

Example:

BDD
GIVEN
AND
WHEN
AND
THEN
AND

Acceptance criteria:

Open questions:

  • Why do we currently allow both?
  • How much is this actually intentionally used?

Suggestion:
A) We should only have one way of describing a specific precision.

  • If the precision is "year" and the date -01-01 set it to -00-00 instead.
  • If the precision is "month" and the date is 01, set it to 00 instead.
  • If the precision is "decade" or lower, always set the month and date to -00-00.
  • Ideally we would do this normalization when the edit is made.

B) We continue to allow specific dates even if the precision is year to allow more precision in between.

  • Similar to A but if the precision is "year" then setting a different day and month would still be possible (<- what about things that actually happened on -01-01?)
  • Changes to the UI are necessary if we want editors to understand and edit these kinds of intentional. deviations. (otherwise, the problems of the status quo would remain)

C) The same as A but we introduce precision "quarter" (or something similar) to allow more precision in between.

  • Identical to A but would require us to add an additional precision type.

E) Implement Extended Date/Time Format (EDTF) Specification

Original report:

Dates with precision lower than month (10), mainly dates with year precision (9) - but also century (7) and others -, could be encoded in different ways, thus falsely duplicating values and generating single-value constraint violations. This is a known problem also affecting the use of QuickStatements (https://www.wikidata.org/w/index.php?title=Help:QuickStatements&oldid=1657403335#Removing_statements), whose users should take into account both formats instead of one of them.

For year precision (9): yyyy-01-01 vs yyyy-00-00

Other documentation:

As of now, MatSuBot (operated by @matej_suchanek) merges claims just differing for 00-00 and 01-01 preferring the claim with sources in order to perform a smaller number of edits; it looks for claims to be merged through a query (see https://www.wikidata.org/w/index.php?title=Topic:Wxspna7q8jnn17u8).

The proposal is: allowing only one format and uniforming all existing dates with precision lower than 10 to that format.

See also: T221610: Date statements aren’t merged if the displayed text is the same but internal representation is different

Event Timeline

Related task, specifically for the problem in merging items: T221610

This task was discussed in the Bug Triage Hour at the Wikidata Data Quality Days 2022:

  • Considered harmful to the checking of constraint violations (it floods single-value violations) and to date re-use
  • The task was also mentioned in the session about inconsistent data modeling
  • B was suggested in the discussion to improve the sorting of uncertain dates.

This task was also raised more times in different project chats:

  • (all of them, or most, are listed in this task)
Lydia_Pintscher updated the task description. (Show Details)

Note: as of now MatSuBot (operated by @matej_suchanek) only merges claims of date of birth (P 569) and date of death (P 570); while the range of properties treated by the bot can be extended (see https://www.wikidata.org/w/index.php?title=Topic:Wxspna7q8jnn17u8), this means that as of now cases of duplication in other properties with datatype "time" (62 in total, so 60 properties) are remaining not treated.