Page MenuHomePhabricator

Ability to report mismatches on qualifiers
Closed, ResolvedPublic13 Estimated Story Points

Authored By
Lydia_Pintscher
Jul 21 2022, 7:39 AM
Referenced Files
F38205434: Screenshot 2023-10-11 at 14.40.55.png
Oct 11 2023, 12:46 PM
F37974560: Screenshot 2023-10-04 at 18.28.36.png
Oct 4 2023, 4:52 PM
F37541501: image.png
Aug 17 2023, 11:06 AM
F35794451: T313469.png
Nov 18 2022, 12:21 PM
F35794447: T313469.png
Nov 18 2022, 12:21 PM
F35794168: T313469.png
Nov 18 2022, 11:05 AM
F35794133: T313469.png
Nov 18 2022, 11:05 AM
F35794129: T313469.png
Nov 18 2022, 11:05 AM

Description

As a mismatch provider I want to be able to report mismatches on Mismatch Finder to data that is stored in qualifiers in order to improve data quality of all data, not just data stored in the main part of a statement.

Problem:
We currently don't allow reporting of mismatches on qualifiers on the Mismatch Finder. these should be accepted as important data is stored there as well.

Example for how the upload csv could look now:
CSV for a Q42 statement with qualifier, where both the statement and the qualifier are mismatched:

item_id,statement_guid,property_id,wikidata_value,external_value,external_url,type
Q42,Q42$A3B1288B-67A9-4491-A3AA-20F881C292B9,P3373,Q14623673,”Shoshanna Adams”,example.com,statement
Q42,Q42$A3B1288B-67A9-4491-A3AA-20F881C292B9,P1039,Q10943095,”cousin”,example.com,qualifier

CSV for a Q42 statement with qualifier, where only the qualifier is mismatched:

item_id,statement_guid,property_id,wikidata_value,external_value,external_url,type
Q42,Q42$A3B1288B-67A9-4491-A3AA-20F881C292B9,P1039,Q10943095,”cousin”,example.com,qualifier

For uploaders to understand that qualifiers are now accepted in their csv files, we need to update the Mistmatch Finder User Guide documentation here: https://github.com/wmde/wikidata-mismatch-finder/blob/main/docs/UserGuide.md#importing

Screenshots/mockups:

1a:

image.png (2×2 px, 369 KB)

Figma file

BDD
GIVEN a qualifier
AND a mismatch
WHEN a CSV of a mismatch is uploaded to the Mismatch Finder
AND it contains mismatches for a qualifier
THEN they are accepted
AND shown on the Mismatch Finder website

Acceptance criteria:

  • mismatches on qualifiers are accepted from the upload CSV
  • mismatches on qualifiers are correctly shown in the Mismatch Finder website
  • the Mismatch Finder User Guide is updated to reflect that qualifiers are accepted as mismatches

Open questions:
How do we display qualifier mismatches on the results page?

  • Should we show claim as statement in the table, or change the type in the csv to statement? As there is no type available yet, it won't break anything if we make one of the available types statement instead of claim so that it reflects the UI on Wikidata.
  • Do we require two tickets here (one for uploaders and one for reviewers)? For uploaders we would need to update the documentation here: https://github.com/wmde/wikidata-mismatch-finder/blob/main/docs/UserGuide.md#importing
  • Should the title of the first column be Mismatch? No, UX will look into some other options that are more clear

Notes
For previous uploads with no type value, the value should be statement

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Qualifier mismatches: Request for feedback on designs

This task, as well as T313469: Allow reporting of mismatches for labels, descriptions and aliases, requires the redesign of the Mismatch Finder's results page and/or table. These should now allow users to visualize and set the status of the new types of mismatches: terminological data (labels in all languages, aliases, potentially descriptions) and qualifier properties.

The initial design exploration aimed at being explicit and providing as much context and differentiation as possible about the new mismatch types: this involved presenting them – generally – in separate tables that displayed descriptive information (e.g. to which statement values is the qualifier being applied to), under the umbrella of their respective item.

Initial proposals to visualize qualifiers on the results page (see designs in Figma)

While it seems safe to assume that the volume of qualifier mismatches is going to be quite low in comparison to property mismatches, we thought it was necessary to provide clarity around what these are, and guidance on where to find them. Those intentions motivated the following initial designs:

Option 1: Display mismatches on qualifiers on a separate tableOption 2: Display mismatches on qualifiers in the mismatches table (new column)
T313467_ Example A, item.png (2×2 px, 235 KB)
T313467_ Example A, item (1).png (1×2 px, 213 KB)
Description: A separate table collects mismatches on qualifiers related to the same item. For guidance, the table's title indicates the statement that the qualifier belongs to.Description: In this version, a new “Qualifier of” column is added to a table iff a qualifier mismatch is present for a particular value of an item’s property.

Nevertheless, after further iteration, product and design decided to try to group all mismatch types (properties, qualifiers and terms) in one table instead of including multiple of these.So, for the sake of simplicity, the latest direction aims at using a single table to display and specify all the mismatch types:

Latest iteration (see designs in Figma)

Option 1: A new column specifies the type of mismatch (two possible placements)Option 2: Descriptive text provides mismatch type inside the Mismatch column
1a
T313469.png (2×2 px, 233 KB)
1b
T313469.png (2×2 px, 233 KB)
T313469.png (2×2 px, 236 KB)
Description: A new column is introduced in the mismatch table to document the ‘type’ of the listed mismatches (or 'where' they are). In option 1a, the new ‘mismatch type’ column is placed to the right of the mismatch column. In option 1b, the column is placed at the beginning of the table.Description: In this version, a subtle descriptive text indicates the type of mismatch from within the Mismatch column.

Questions regarding the latest design iteration:

  1. Regarding Option 1: The placement of the mismatch type column (before or after mismatches) and its title (currently: 'type') need validation.
  2. Are the mismatch types clear? Or should we say "statement property", "value qualifier"?
  3. Is this last iteration providing sufficient context regarding qualifiers and their meaning? Or would users require extra information to understand the meaning?
  4. Displaying a fixed column or description could feel redundant when a table only contains a single type of mismatch (i.e. most likely properties).

  5. Overall question about including qualifier mismatches in the mismatch finder: How common is it for external databases to use qualifiers? 
What’s the chance for the same qualifier property to be specified for the same value? 
The most useful/obvious use case I can think of would be comparing WB instances with WD, but is that the purpose of the Mismatch finder?

Looking at these again much much later, I think option 1 would be the best way forward.

Option 2 would be a little too busy and adds an extra layer of hierarchy.

In answer to your questions:

  1. we can discuss validation in a meeting
  2. I think "statement" and "qualifier" are used independently enough to be used here
  3. Yes, with the mismatch column providing context, I think this should be clear
  4. Although this could feel redundant for some people, this would also provide the necessary information for people to validate the mismatches
Arian_Bozorg renamed this task from ability to report mismatches on qualifiers to [SW] Ability to report mismatches on qualifiers.Aug 15 2023, 9:24 AM
Arian_Bozorg renamed this task from [SW] Ability to report mismatches on qualifiers to Ability to report mismatches on qualifiers.Aug 24 2023, 12:32 PM
ItamarWMDE set the point value for this task to 13.Sep 13 2023, 1:49 PM

Task Breakdown Notes

  • This will require us to create a database migration, an update all of the seeders and factories used in tests
  • Also we will need to update the Mismatch Model, and the CSV Parser to support the new type column
  • Potentially we will have to update the new type column for all previous unexpired mismatched to the value: "statement"
  • Requires us to change the UI Components to support the new table
  • It seems like the type column on the CSV upload will be mandatory from this change onwards, is that correct? @Arian_Bozorg
  • In case this is true, we need to either ensure that we communicate the change to uploaders in due time / add some sort of backwards compatible solution (for instance a new version number in the new format's route and in place API message that notifies users of the upcoming deprecation of the former route)

Potential Plan of Action:

  1. Migrate the db to add a type column
  2. Update the Mismatch Model to support the type property (this includes factories and seeders for the model)
  3. Update the ValidateCSV & Import CSV Jobs
  4. Modify the MismatchRow component to include the new additional information

Thanks for all the notes Itamar :)

It seems like the type column on the CSV upload will be mandatory from this change onwards, is that correct?

Would it be possible to assign uploads that have no type automatically as a main snak / statement? This may help with the transition

If not, we will inform the current Mismatch Finder users ahead of time so that they can update their workflows accordingly.

Potentially we will have to update the new type column for all previous unexpired mismatches to the value: "statement"

Yes, I think this is a good solution here

@Arian_Bozorg The issue is not about the existing mismatches in the database. It's about the difference in the request format and transparency in what we support. We are now adding a new column to the CSV that will have to be validated, and we should be transparent about what is supported or not. If we accept both with and without the type column as legal inputs indefinitely, then the documentation should update accordingly.

Note for Task Breakdown:
Update docs. User Guide

Yes, that makes sense. Then for simplicity, a type column on the CSV upload will be mandatory, and we will inform current Mismatch Finder users so that they can update their workflows accordingly.

Thank you! an option that could accompany this decision and @HasanAkgun_WMDE came up with is that in addition to making the column mandatory, we keep supporting the old format for an intermediary period (a few months) and respond with some kind of deprecation warning to ensure that even people who are not yet aware of the public notification we made can be informed about it, but still succeed their upload for a while. I'm sure Hasan can explain this better ofc.

Relevant PRs:
Add 'type' column migration and model update: https://github.com/wmde/wikidata-mismatch-finder/pull/678
Accept empty values in type column at import time and populate them with 'statement' value if empty and update docs: https://github.com/wmde/wikidata-mismatch-finder/pull/679
Add type column to GUI: https://github.com/wmde/wikidata-mismatch-finder/pull/687

This is not done, there's still the UI part to complete

This is not done, there's still the UI part to complete

omg! this is true!!!

This is not done, there's still the UI part to complete

omg! this is true!!!

No worries, honest mistake :) I know the rush of dopamine that comes with moving tickets across the board. I miss it a lot :D

This is not done, there's still the UI part to complete

omg! this is true!!!

No worries, honest mistake :) I know the rush of dopamine that comes with moving tickets across the board. I miss it a lot :D

PR contaning the GUI was added to comment with all relevant PRs for this ticket here: https://phabricator.wikimedia.org/T313467#9183202

This seems to be live already and broke @Mike_Peel's latest upload it seems :( https://mismatch-finder.toolforge.org/store/imports
And it looks like we just added the type to the end of the upload CSV judging from the error message? That seems suboptimal. I would expect it in second place.

Seems like there was a deployment to production 5 days ago, probably by mistake while working on T345857. I will revert it straight away.

As for the column order @Lydia_Pintscher, we are following what is stated to us in the task description by @Arian_Bozorg. If there are any issues with that, I suggest you sync on tickets before bringing them to the developers.

Thanks!
And yeah I now see that we already have it in the task description. @Arian_Bozorg Let's look into it.

@Mike_Peel @Lydia_Pintscher The system was reverted to the stable version, apologies for the mix-up. Could you attempt the upload again, please?

Hey there. I'm checking the latest changes applied to the table in our staging environment. The column proportions look all good with the addition of "Type" 👍🏻 . Nevertheless, I realized that with the table now getting more crowded, some of the information (particularly that included in the "Upload info" column) becomes harder to read in smaller viewports and before the table is linearized. Here's a screenshot at 765px, to show you what I mean:

Screenshot 2023-10-04 at 18.28.36.png (1×1 px, 156 KB)
.

We're right now using the tablet WiKit breakpoint (720px) to trigger the change in the table component layout (from its regular display to linerarized). This value is now too small, and we should probably replace it by a discrete value that prioritizes the correct display of information within the component. My proposal is to linearize the table at an independent breakpoint, starting at 800px. This is to, as mentioned, make sure that we keep its content readable under the described circumstances. Happy to hear your thoughts!

We're right now using the tablet WiKit breakpoint (720px) to trigger the change in the table component layout (from its regular display to linerarized). This value is now too small, and we should probably replace it by a discrete value that prioritizes the correct display of information within the component. My proposal is to linearize the table at an independent breakpoint, starting at 800px. This is to, as mentioned, make sure that we keep its content readable under the described circumstances. Happy to hear your thoughts!

Guergana kindly checked the feasibility of my recommendation and dug up the fact that the WiKit Table is configured to strictly only take tokens as breakpoint values. We're thus unable to specify a custom breakpoint without modifying the implementation of the WiKit component at this point (lesson learned for Codex :). With this in mind, and the migration project already in progress, I think we can wait and implement this improvement when the time to replace/migrate the table comes.

Edit: See T348271: [WtC-M3] Port Table component to Mismatch Finder

@Mike_Peel @Lydia_Pintscher The system was reverted to the stable version, apologies for the mix-up. Could you attempt the upload again, please?

I can't do the upload again this week, but will try to run it again some time next week.

Reran today, same error according to https://mismatch-finder.toolforge.org/store/imports

Oh, apologies again, thanks for trying, we will look into this asap!

I discussed it more with Arian and we agree that the type makes sense to go first in the CSV. Should I update the task description accordingly?

And sorry. I initially did it as you implemented it without thinking too much about it -.-

Reran today, same error according to https://mismatch-finder.toolforge.org/store/imports

Oh, apologies again, thanks for trying, we will look into this asap!

I think the issue should be fixed now, I was able to successfully upload a test file.

Re-run this evening, it seems to be stuck at 'Pending'.

Re-run this evening, it seems to be stuck at 'Pending'.

It might have just taken a while for the job to complete. As far as I can see the import succeeded

It might have just taken a while for the job to complete. As far as I can see the import succeeded

Ah, great, I see the same now. :)

Design verification done on Chrome 117, Safari 17 and Firefox 11. All looking good 👍🏻

As previously mentioned, there's a display issue related to responsiveness that needs fixing (the linearization of the table should happen at a larger breakpoint). In some browsers, even horizontal scroll is triggered:

Screenshot 2023-10-11 at 14.40.55.png (677×757 px, 105 KB)

Since we shouldn't modify the WiKit component, I would recommend adjusting the table's breakpoint as part of the process to port it to the Mismatch finder (see T348271: [WtC-M3] Port Table component to Mismatch Finder). I would understand if we wanted to make this a separate task, though.

Very sorry, but after speaking with @Lydia_Pintscher, would it be possible to change the format of the csv and make type the first column? Sorry this wasn't resolved before it was entered into the sprint.

Example for how the upload csv could look now:
I will update the task description with the new format as well.

The rest looks good to me, once Sarai gives the ok and assuming the Mismatch Finder User Guide will be updated once we're live

EDIT: examples removed to avoid confusion

Is this the breaking change? I'm currently using 'item_id,statement_guid,property_id,wikidata_value,meta_wikidata_value,external_value,external_url'. I'm not sure what a guid is? Will 'type' be optional - in which case it is better at the end rather than the start?

Hi Mike,

Yes, this will be a breaking change when it is released. We will be in touch with you when it is scheduled to go live so that you can update your workflows accordingly.

And thank you so much for your feedback on the type column, with that in mind we can refer back to the original format as we had it. With both the statement_guid and type column as optional.

Example for how the upload csv could look now:
CSV for a Q42 statement with qualifier, where both the statement and the qualifier are mismatched:

item_id,statement_guid,property_id,wikidata_value,external_value,external_url,type
Q42,Q42$A3B1288B-67A9-4491-A3AA-20F881C292B9,P3373,Q14623673,”Shoshanna Adams”,example.com,statement
Q42,Q42$A3B1288B-67A9-4491-A3AA-20F881C292B9,P1039,Q10943095,”cousin”,example.com,qualifier

CSV for a Q42 statement with qualifier, where only the qualifier is mismatched:

item_id,statement_guid,property_id,wikidata_value,external_value,external_url,type
Q42,Q42$A3B1288B-67A9-4491-A3AA-20F881C292B9,P1039,Q10943095,”cousin”,example.com,qualifier

I wanted to express the following concern here again, just in case someone agrees on the need to find a mitigation factor:

If type is optional, and its default value is "statement", then in case type is not provided users of the Mismatch finder will be misguided when trying to correct mismatches on qualifiers and terms: labels, descriptions, aliases (a mismatch type that we were planning to introduce in the future, see T313469: Allow reporting of mismatches for labels, descriptions and aliases).

Mitigation factors I can think of to prevent misleading users: 1) Make type required; 2) If kept optional, then rather indicate it in the table when a value hasn't been specified instead (e.g. "undefined").

We're just talking about the upload data. The system would assume type to be statement for the ones where it is not specified and also show statement to the user. Or are you talking about something else?

I'm talking about that, yes. Unless I'm missing something, if the missing value of type is set by default to "statement" and displayed like so in the results table, then that will be incorrect/ misleading for users trying to fix mismatches that actually are on qualifiers and/or terms.

E.g:

MismatchTypeValue on WDValue on ext
field of work*Statementethologyanthropology
EnglishStatementJane GoodallJane Goodal

*But this is actually used as a qualifier

Ah. Yeah the uploader will have to specify it if they are not uploading mismatches for statements. If they are not doing that then their mismatches will indeed be wrong and will need to be fixed. We're fairly sure that most mismatches will be on statements though so for convenience we said it can be left empty.
As for mismatches on labels, descriptions, aliases: We'll have to see but they might actually need to go into a separate CSV from the brief look Arian and I had at it. Adding them to the current one would blow up the number of columns quite a lot, most of them being empty most of the time.

If you're going to be supporting mismatches on labels, descriptions, aliases (which would be good!), then making type required might make sense - and making changes like that while the tool is still in early stages is good (or at least, better than later!). There are ~3.6m mismatched English descriptions at https://en.wikipedia.org/wiki/Category:Short_description_is_different_from_Wikidata that I could start automatically uploading when that becomes possible.

Please do let me know when I need to update my code - ideally in the first half of a month since it runs at month-start!

Thanks for the feedback Mike! Looks like we will keep the type column in the last position and it will remain optional for now.

This will be released on 1 November, so should work for your first half of the month schedule :)

Thanks for the feedback Mike! Looks like we will keep the type column in the last position and it will remain optional for now.

This will be released on 1 November, so should work for your first half of the month schedule :)

Ah, my code runs on the 1st of every month, so that's the perfect storm... should I run it early this month, or could you release a day or two later?

We can definitely release it a day or two later. This will give us a chance to review the release before it goes out.

We can definitely release it a day or two later. This will give us a chance to review the release before it goes out.

Thanks!