Page MenuHomePhabricator

improve handling of dates - calendar models and precision
Closed, ResolvedPublic8 Estimated Story Points

Description

As a mismatch provider I want to provide accurate mismatches concerning date values via the csv import.

Problem:
Mismatch Finder currently does not take into account the calendar model and precision of dates when importing mismatches via csv, but they are fairly important for figuring out the exact meaning of a date.

Current state:

  • We accept iso-style dates in the wikidata_value column on the csv - without calendar model and precision

Desired state in the future:

  • To enable different precision for dates, we would like to be able to infer the dates precision from the format that it is given in: e.g. 1950s would infer a date with a precision of decade (8)
  • To enable different calendar models, we would like to infer the most likely calendar model in the same way Wikibase does it. When the mismatch provider wants to overwrite this, we want to give them the option to do that by providing the calendar model as an additional field in the csv import.

Example:

Different Precisions:

  • 2022-07-14 -> precision day
  • 1950-05 -> precision month
  • 1950 -> precision year
  • 1950s -> precision decade
  • 19. century -> precision century
  • 2. millennium -> precision millennium
  • 2022-00-00 -> precision year
  • 1950-05-00 -> precision month

Explicit Calendar Model:

statement_guid,property_id,wikidata_value,meta_wikidata_value,external_value,external_url
Q184746$7814880A-A6EF-40EC-885E-F46DD58C8DC5,P569,1046-04-03,Q12138,3 April 1934,http://fake.source.url/12345

Implicit Calendar Model:

statement_guid,property_id,wikidata_value,meta_wikidata_value,external_value,external_url
Q184746$7814880A-A6EF-40EC-885E-F46DD58C8DC5,P569,1934-04-03,,3 April 1934,http://fake.source.url/12345

BDD
GIVEN
AND
WHEN
AND
THEN
AND

Acceptance criteria:

  • documentation and example for mismatch providers is updated
  • various date formats are accepted as listed above and the precision is inferred automatically
  • various dates are accepted and the calendar model is inferred in the same way as Wikibase does it, unless the calendar model is explicitly specified in the mismatch

Notes:
The above only applies to the Wikidata side of the mismatch ("wikidata_value" in the uploaded CSV). The external source side ("external_value") can put into their value whatever they want incl. the name of the calendar model.

Event Timeline

I believe it's important to handle Wikidata-compliant date objects, which capture precision and Julian/Gregorian calendars.
Currently, the import API service doesn't accept that format, resulting in flawed mismatches in case of dates with precision less than day.
This was the case for the MusicBrainz dataset (see https://mismatch-finder.toolforge.org/store/imports).

Example CSV row

Currently accepted format:

Q71706$97C92E54-F3C8-4CB3-B868-B22BBD8E2431,P569,1480-01-01,1500-01-01,https://musicbrainz.org/artist/8b149b02-d9df-40db-a1c6-ed28abd0e496

Expected:

Q71706$97C92E54-F3C8-4CB3-B868-B22BBD8E2431,P569,"{'time': '+1480-01-01T00:00:00Z', 'timezone': 0, 'before': 0, 'after': 0, 'precision': 9, 'calendarmodel': 'http://www.wikidata.org/entity/Q1985727'}","{'time': '+1500-01-01T00:00:00Z', 'timezone': 0, 'before': 0, 'after': 0, 'precision': 9, 'calendarmodel': 'http://www.wikidata.org/entity/Q1985786'}",https://musicbrainz.org/artist/8b149b02-d9df-40db-a1c6-ed28abd0e496

Result in the UI

Note that 01-01 are not real months and days, just placeholders to build the timestamp.

Side effect

Dates that are actually equal appear as different values, as they might just differ in terms of placeholders.

The Wikidata API seems to return different placeholders. For instance:

These shouldn't be different claims, and look like a bug, by the way.

That was so tricky to explain, I hope it's clear.

Lydia_Pintscher renamed this task from figure out calendar model handling to improve handling of dates - calendar models and precision.Jul 5 2022, 1:23 PM
Lydia_Pintscher updated the task description. (Show Details)

Task breakdown notes:

To infer the precision and calendar model, we can probably just use the wbparsevalues API without reimplementing this logic (see example API sandbox).

We also think this needs to go through story writing. What happens with the calendar model and precision? In what way are dates currently stored, and how are we changing that? What does the result look like for users? (Unfortunately, the item linked by @Hjfocs no longer has any mismatches.)

Currently, we don’t have enough information to estimate this task.

We also think this needs to go through story writing. What happens with the calendar model and precision? In what way are dates currently stored, and how are we changing that? What does the result look like for users?

Currently, we don’t have enough information to estimate this task.

I'll try and answer the questions as I understand them.

  • "What Happens with the calendar model and precision?" Can you elaborate on that? I'm not sure what expectation or context you are considering here?
  • "In what way are dates currently stored, and how are we changing that?" Currently, dates (as all wikidata values) are stored as a string (as they can be arbitrarily any type wikidata supports). See 2021_07_19_123858_create_mismatches_table.php. What this change most probably will require us to do, is to add another column to this table to be able to store arbitrary meta information about a value (in this case, it will be calendar model, in the geocoordinates case, it will be globe). Additionally, we will have to adapt the csv import validator to allow for this extra field in the CSV, as described in the task.
  • "What does the result look like for users?" I would assume the result for each Wikidata Value will look like the user inputted it, as specified in the list at the description of this task, there's no manipulation on the display of dates done on our part, only validation. However, this question is important, and probably needs some UI work with regard to properly visualizing the calendar model. I think @Lydia_Pintscher and I discussed this in the past.

I think the first question is more or less the combination of the second and third. We know that we want to change the way dates are stored, but we also need to know what we’ll then do with this information, how we’ll show it to users – otherwise, if there’s no user-observable difference, we might as well save ourselves the effort ;) and I think the ACs in the current task description are kind of glossing over the last part.

It would still be great if we could have a screenshot of the status quo, since the given example no longer works:

image.png (398×732 px, 56 KB)

[...] We know that we want to change the way dates are stored [...]

I don't think that is true. Currently, wikidata values are stored as strings, and we want to keep it that way. Most probably the only thing that needs to change is the addition of a column to the expected CSV format and potentially something in the validation step to take the calendar model information included in that column into account.

As for the display, I agree, we need to specify this a little better.

Sorry, I only meant the added column as the storage change. (The added CSV column probably corresponds directly to an added column in the mismatches table as well?)

Yeah, I think that's the right approach, at least that's the first one I'd go with ;)

As for how to show it: I thought we'd show it the same way as on Wikidata with the same rules? So for example a date that we'd usually parse as Gregorian but is set to Julian would show the calendar model in superscript. Like here:

image.png (158×925 px, 12 KB)

Here’s an example of what Wikidata dates in the Mismatch Finder currently look like (URL):

image.png (571×1 px, 91 KB)

(Found via: SELECT item_id FROM mismatches JOIN import_meta ON mismatches.import_id = import_meta.id WHERE mismatches.property_id = 'P569' AND import_meta.expires > '2022-10-18' LIMIT 10;)

Task breakdown notes:

  • In the input, we only need to support English (e.g. 19. century but not 19. Jahrhundert)
  • During import, we already parse the value to ensure it’s valid (checkValueErrors, and passesparseValue)
    • TODO: Does that mean 19. century should already be accepted, since wbparsevalue would be able to parse it? If not, where does it currently get rejected?
    • The parse result is cached
  • In the database, we should store the original string (e.g. 19. century), just like before: the database has the same content as the CSV file
  • When showing a mismatch, we should parse it again (using the API); replace the calendar model, if an explicit calendar was given in the meta column; then format it (again using the API) in the user’s UI language, and show that
    • The decision of whether to show the calendar model or not is up to Wikibase

Plan of action:

  • parse and then format date values when showing mismatches (T321173)
  • check if 19. century is already accepted as input; if not, figure out where it gets rejected
    • if 19. century is not yet accepted as input: make the necessary changes so it’s accepted (subtask to be created depending on the result of the above bullet point)
  • add meta_wikidata_value column to the CSV file format and database (T321165)
    • should be optional
    • validation: only allowed for “Time” properties, must contain a single item ID
  • use meta_wikidata_value to override parsed value’s calendar model before formatting, when showing mismatches (T321174)
    • parse value (Wikibase API)
    • if meta_wikidata_value given, replace parsed value’s calendar model with meta_wikidata_value
    • format value (Wikibase API)
    • show result

check if 19. century is already accepted as input; if not, figure out where it gets rejected

I tested it locally – it is indeed accepted. You can already use 19. century (and, I assume, any other value that will pass through wbparsevalue, though I haven’t tested this) as the wikidata_value.

For the sprint 9 planning: we estimate that 5 story points remain.

We should allow CSV files without meta_wikidata_value column because we are adding it as optional and it's not good for backward compatibility. However, we have another optional column in our CSV files, external_url and unlike meta_wikidata_value, the CSV file should have that column even there is no value for the external_url field. I think we should treat there columns same because they have similar attributes: we should allow CSV files without external_url column just like with meta_wikidata_value. Otherwise we have two optional columns but one is not so optional compared to the other one. I wonder if you agree @Lydia_Pintscher?

I believe we said the meta_wikidata_value column should always be present but it can be empty. Does that address your concern @HasanAkgun_WMDE?

I believe we said the meta_wikidata_value column should always be present but it can be empty. Does that address your concern @HasanAkgun_WMDE?

When we do that, we brake existing mechanism and as soon as we ship the code everyone would start to get errors because they don't have meta_wikidata_value column on their CSV file. I mean, it will cause backward compatibility errors. @Lydia_Pintscher

Yes. The affected people have already been informed about that.

Oh, that's great! I'll proceed like you said then, thanks a lot!

Michael subscribed.

This should be verifiable on Wednesday, after the train has reached Wikidata.