[Task] Run bot to mark dates that need checking for calendar model correctness
Closed, ResolvedPublic

Description

We need to run a bot that adds a qualifier to all statements with a potentially wrong date. The qualifier should be something like "instance of: Wikidata date needing calendar model check".

Things to do:

  • open bot approval request on Wikidata
  • look into how far back we care about the calendar model to be checked. Stas brought up things like the start of the universe where it really doesn't matter. Where is the cut-off?
  • get most up-to-date list of problematic dates
  • run bot

Tags added:

Lydia_Pintscher updated the task description. (Show Details)
Lydia_Pintscher raised the priority of this task from to High.
Lydia_Pintscher added a project: Wikidata.
Lydia_Pintscher moved this task to consider for next sprint on the Wikidata board.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 7 2015, 11:31 PM
Addshore claimed this task.Jul 22 2015, 9:27 AM
Addshore set Security to None.
Jc3s5h added a subscriber: Jc3s5h.Jul 22 2015, 2:00 PM

Old dates might have been determined from historical records, or might have been determined by astronomical calculations (date of an eclipse, for example). Morrison and Stephenson provide an equation showing the difference between counting actual days by sunrises or sunsets, compared to using the equations of motion of the solar system in "Historical Values of the Earth's Clock Error ΔT and the Calculation of Eclipses" in the Journal for the History of Astronomy v. 35 (2004) p. 332. In around 3800 BCE the difference would have been one full day. So I suggest there is no need to check dates earlier than 3800 BC.

@Jc3s5h: Sounds good. Can we maybe go even further?

{{ping|Lydia_Pintscher}} I am not an expert on ancient calendars. Duncan Steel in Marking Time: The Epic Quest to Invent The Perfect Calendar (Wiley, 2000, page 36) mentions Sumerian records around 3500 BC. I don't know if any of these can be connected to a modern calendar. Unless an expert comes along who can tell us about the earliest possible connection between ancient records and the modern calendars, I'd stick with 3800 BC.

Jonas renamed this task from run bot to mark dates that need checking for calendar model correctness to [Task] Run bot to mark dates that need checking for calendar model correctness.Aug 13 2015, 2:46 PM
Addshore changed the task status from Open to Stalled.Aug 14 2015, 3:23 PM
Lydia_Pintscher changed the task status from Stalled to Open.

Proposing for next sprint since the bot has been approved now (well, half a year ago), see https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot/Addbot_5

It might be useful to state what range of dates will be checked, and what the source of the dates will be, since JSON and RDF dates state dates in different formats, and in the case of the flavor of RDF used in this link the form of the date depends on whether the software that produces the rdf was able to convert the date from Julian to Gregorian or not. Such irregularities might cause some dates that should be marked to be missed.

Addshore added a comment.EditedSep 12 2016, 5:54 PM

The code that will be used to generate the lists can be seen at https://github.com/wmde/wikidata-analysis/pull/9 (although the code is now on gerrit)

The conditions for list 1 are at https://github.com/wmde/wikidata-analysis/blob/6bc2e46bcd2bd92213f1e56b9b0a1604f1f46604/java/analyzer/src/main/java/org/wikidata/analyzer/Processor/BadDateProcessor.java#L38
The conditions for list 2 are at https://github.com/wmde/wikidata-analysis/blob/6bc2e46bcd2bd92213f1e56b9b0a1604f1f46604/java/analyzer/src/main/java/org/wikidata/analyzer/Processor/BadDateProcessor.java#L48

The source of the dates are the values stored in the backend JSON, so what is in the database, and what you can see throguh the API.

Jc3s5h added a comment.EditedSep 12 2016, 10:50 PM

Thanks to Addshore for the answer of Sep. 12, 17:54 UT.

I'm not fully up on all the jargon.

Is this link:

https://www.wikidata.org/wiki/Special:EntityData/Q4115189.json

an example of what Addshore referred to in the statement "The source of the dates are the values stored in the backend JSON, so what is in the database, and what you can see throguh the API."

Starting with list 2, it appears that any Gregorian calendar date with a year less than 1584 will be on the list (and being on the list means the date is suspect). Not too bad, although we really have to throw up our hands for truly old dates such as the creation of the universe. I really think the maximum safe year for Gregorian should be 1583 rather than 1582, since the change became effective 15 October 1582 at the earliest.

For list 1, it appears that all Julian calendar dates with precision equal to or better than month will be on the list, even if they are before 1584. If the year is 1582 or earlier, isn't it likely that Julian is the most appropriate calendar?

Good news. Some stats:

Dates in Gregorian calendar < "+1582-10-15T00:00:00Z" with precision > 9 (no qualifiers). query is on property documentation

P5694899
P57018332
P58040
P582300
P585106
Addshore added a comment.EditedSep 19 2016, 9:08 AM

I have just re run the script I wrote rather a long time ago and here are the lists that came out

Dates marked as Julian that are more precise than year
24945 statements
P4067

Dates marked as Gregorian, before 1584
130388 statements
P4068

Roughly 155,000 statements in total.
This only looks at the main snaks of statements and does not account for dates in qualifiers or references.

If you limit Gregorian before 1584 (or 1582) to precision > 9, do you get comparable numbers?

That being said, many of the years marked as Gregorian on https://tools.wmflabs.org/reasonator/?q=Q208233&lang=en seem to be off by one year.
Unfortunatly not all. For years before year 1, probably all of them.

Maybe a special check for anything before year 1 is needed.

It looks like adding the restriction of precision > 9 to the second list brings the number of matches statements down dramatically (then roughly the same for each list).

You can run this dump scan yourself using:
https://github.com/wikimedia/analytics-wmde-toolkit-analyzer
or use the prebuilt jar:
https://github.com/wikimedia/analytics-wmde-toolkit-analyzer-build
and a command as something like:

java -Xmx2g -jar ./toolkit-analyzer.jar --processors BadDate --store ~/data --latest

If you limit Gregorian before 1584 (or 1582) to precision > 9, do you get comparable numbers?

That being said, many of the years marked as Gregorian on https://tools.wmflabs.org/reasonator/?q=Q208233&lang=en seem to be off by one year.
Unfortunatly not all. For years before year 1, probably all of them.

Maybe a special check for anything before year 1 is needed.

There was confusion about how to enter years less than one, so each and every year needs to be reviewed. As for Reasonator, it displays the death date of Julius Caesar as 44 even though it is in Wikidata as 44 BCE, so Reasonator is in serious need of repair.

Addshore moved this task from Backlog to In Progress on the User-Addshore board.Sep 22 2016, 11:53 AM

Now that the bot has begun to mark items, what is the procedure to follow when a marked item has been reviewed by an editor and found to be correct?

@Jc3s5h This should be as simple as removing the instance of qualifier.
We of course have the lists of all statements that will be touched in this run of the script.
If we do ever do a future run we will still have that list of guids that we can avoid!

Esc3300 updated the task description. (Show Details)Sep 22 2016, 5:23 PM
joeroe added a subscriber: joeroe.Sep 22 2016, 5:38 PM

I see a lower bound was discussed here, but was it implemented? AddBot has been tagging prehistoric dates, which are likely to be estimates based on radiocarbon or other absolute dating methods with error margins of decades or centuries, so the Gregorian/Julian distinction is irrelevant.

The tree seems to be about processing them, not tagging

When the tree was designed no lower bound was added as some errors that could often occurs while entering dates (manually, not by bot) could result in very unexpected actual dates.

Thus all of the dates tagged should be checked, but there is a high probability that many prehistoric dates are correct.

The bit of extra checking leg work ensures we have not missed anything!

@Addshore Erm, manually checking 155,000 records does not sound like "a bit of extra legwork" to me... I understand that it's hard to think of every edge case, but the lower bound issue was specifically brought up above. What is the point of using a bot if you are going to knowingly introduce a such large number of errors that all of its work needs to be checked by humans?

I suggest to mark only the most obvious cases now. we can still discuss making less obvious cases for manual checking later.

I suggest to mark only the most obvious cases now. we can still discuss making less obvious cases for manual checking later.

Obvious according to what criteria?
From the flow chart diagram this task is only tagging to 2 obvious cases.
I can easily add lower bound from now (and remove the qualifier from any that are below the bound that I have already added) if that is desired.

But again, this risks having a bunch of bad dates that will not get reviewed and may never even be spotted. Adding this qualifier makes the dates that need attention easy to spot, and easy to write a tool for, but do not effect any use of the data!

So most statements have already been tagged in the first run today:

A SPARQL query for statements with mainsnak date marked as Julian that is more precise than 1 year can be found at http://tinyurl.com/j7f678d (24,956 in the list, 34 ish to be tagged)
And a query for statements with mainsnak date marked as Gregorian before 1584 can be found at http://tinyurl.com/jojf69f (88,402 in the list, 42k ish to be tagged)

The tagging has already proved to be a great help, for example identifying https://www.wikidata.org/wiki/Q19265 which will likely lead to 1500 tags that can easily be removed.
I think I will continue the tagging in the morning (in 12 hours ish) and then we can work on systematically bringing the numbers down.

That reminds me that QuickStatements adds all dates as Gregorian ..

Esc3300 added a comment.EditedSep 23 2016, 5:32 AM

Would you tag all dates

  • before year 1 (years BC)
  • with a precision of 9 (year) or higher

as well? Obviously, if they are not tagged for some other reason already.

These may (or may not be) off by 1 year.

Esc3300 updated the task description. (Show Details)Sep 23 2016, 5:37 AM
Esc3300 updated the task description. (Show Details)Sep 23 2016, 5:40 AM

Would you tag all dates

  • before year 1 (years BC)
  • with a precision of 9 (year) or higher as well? Obviously, if they are not tagged for some other reason already.

    These may (or may not be) off by 1 year.

Hmm, I could do, could you provide some examples?

Esc3300 added a comment.EditedSep 23 2016, 9:18 AM

It seems that day precision dates already get marked, so the sample "Cleopatra" in T129823 is taken care of and the dates for Japanese emperors are not in mainsnak so they wouldn't be tagged anyways.

Looking at some of the P570 values (658 items) and P569 values (307 items), it seems that most are not problematic. So it might not be needed.

Addshore updated the task description. (Show Details)Sep 23 2016, 10:40 AM

Okay, dates from the initial 2 planned lists are all tagged.
I have also update the query links to include some extra information.

List 1 stands at 24956 Results in 4370 ms (statement with mainsnak date marked as Julian that is more precise than 1 year)
List 2 stands at at 130714 Results in 16262 ms (statement with mainsnak date marked as Gregorian before 1584)

Addshore moved this task from In Progress to Needs Review on the User-Addshore board.
Jc3s5h added a comment.EditedSep 23 2016, 2:33 PM

@Addshore asked for some examples of off-by-one years where the year is less than one.

One example is Pacorus I of Parthia who died in 38 BC. The accepted value in Wikidata is 38 BC, but a pending edit shows 37 BC, even though the reference in the pending edit shows 38 BC. This indicates either the editor requesting the pending edit was confused, or the tool the editor was using was faulty.

A similar example is Emperor Ai of Han.

@Addshore I'm not very you happy you resumed running the bot after I objected. Wasn't @Ymblanter clear in his approval? My whole watchlist is littered with edits like https://www.wikidata.org/w/index.php?title=Q19335975&curid=20943120&diff=378556950&oldid=378180294 .

The bot followed this tree up to the first level, at which point human intervention is needed.

@Multichill the two edits edits that you have linked appear to be for statements that have no references. Per the decision tree we should try and find references for these to ensure they are correct.

The bot followed this tree up to the first level, at which point human intervention is needed.

@Multichill the two edits edits that you have linked appear to be for statements that have no references. Per the decision tree we should try and find references for these to ensure they are correct.

And I think it is most unlikely that the dates are correct. I have never encountered books, encyclopedias, etc. that state dates as Gregorian dates even though the Julian calendar was in force at the time and place of the events, as is the case for the two edits complained about.

Jc3s5h added a comment.EditedSep 23 2016, 7:45 PM

@Jc3s5h This should be as simple as removing the instance of qualifier.
We of course have the lists of all statements that will be touched in this run of the script.
If we do ever do a future run we will still have that list of guids that we can avoid!

I notice you said lists of all statements touched. What if a statement is found to be incorrect and is corrected. For example, it is a Gregorian date and, after consulting the supporting source, it is changed to Julian. Can we be assured a future run will not then mark it as a suspicious Julian calendar date with day precision?

Jc3s5h added a comment.EditedSep 23 2016, 8:07 PM

@Jc3s5h This should be as simple as removing the instance of qualifier.
We of course have the lists of all statements that will be touched in this run of the script.
If we do ever do a future run we will still have that list of guids that we can avoid!

In the posts at https://www.wikidata.org/wiki/User_talk:Succu#Questionable_revert user Succu seems to feel that verifying entries, and memorializing the fact that they have been verified by removing the appropriate properties, is not allowable until some sort of public announcement is made.

Yes please remove them where the statement was checked.

The bot followed this tree up to the first level, at which point human intervention is needed.

Naively, I thought you had found a way to do that ;) Maybe a task for WikiReading?

@Jc3s5h This should be as simple as removing the instance of qualifier.
We of course have the lists of all statements that will be touched in this run of the script.
If we do ever do a future run we will still have that list of guids that we can avoid!

I notice you said lists of all statements touched. What if a statement is found to be incorrect and is corrected. For example, it is a Gregorian date and, after consulting the supporting source, it is changed to Julian. Can we be assured a future run will not then mark it as a suspicious Julian calendar date with day precision?

Yes, so once added the statements are safe to remove, they will not be re added by a future run as I have a list of GUIDs to be skipped in the future! (the lists are essentially the same as the lists published earlier in this ticket).
The GUIDS will remain the same throughout changes to the statement, the one exception is when items are merged, where statements are moved from one item id to another.

Jc3s5h added a comment.EditedSep 25 2016, 12:11 PM

Yes, so once added the statements are safe to remove, they will not be re added by a future run as I have a list of GUIDs to be skipped in the future! (the lists are essentially the same as the lists published earlier in this ticket).
The GUIDS will remain the same throughout changes to the statement, the one exception is when items are merged, where statements are moved from one item id to another.

I'm not familiar with how the GUIDs are generated. Will statements be protected from future runs if, for example, the correction is carried out by deleting the incorrect birth date and creating a new birth date property. Or if there are two birth dates, one of which is marked and incorrect, the other is unmarked but incorrect. So the marked incorrect one is deleted and the unmarked incorrect one is fixed, say, by making what was an AD 1600 Gregorian birth date a Julian date.

I would add that this run is suitable enough as a one-time run, but utterly unacceptable as a maintenance task. We want people to add Julian dates with day precision when that is what the source says.

The source of the dates are the values stored in the backend JSON, so what is in the database, and what you can see throguh the API.

Yes, but the lists that are now provided in the task description use Wikidata Query Service, which in turn relies on RDF Dump Format. I have not experimented to discover if this will hide any instances. It will cause the dates displayed in the list (if the date contained in the database is Julian) to be converted from Julian to Gregorian, if the RDF related software is capable of doing the conversion.

the correction is carried out by deleting the incorrect birth date and creating a new birth date property

Deleting the statement and creating a new one will result in a new GUID.
Changing the value of a statement will result in the same GUID.

Or if there are two birth dates, one of which is marked and incorrect, the other is unmarked but incorrect. So the marked incorrect one is deleted and the unmarked incorrect one is fixed, say, by making what was an AD 1600 Gregorian birth date a Julian date.

It sounds like the corrected one (if it met one of the criteria of the script) would be tagged again on a new run.

I would add that this run is suitable enough as a one-time run, but utterly unacceptable as a maintenance task.

I agree this should be a one time thing. As said above, previously wikibase itself made the whole calendar area very confusing, this is now mostly resolved as I understand so in the future this should be less of an issue.

Lydia_Pintscher closed this task as Resolved.Oct 4 2016, 1:38 PM
Jarekt added a subscriber: Jarekt.Sep 21 2017, 4:44 PM