Page MenuHomePhabricator

Ignore already reviewed mismatches on reupload
Closed, ResolvedPublic8 Estimated Story Points

Description

As a mismatch reviewer, I don't want to waste my time with reviewing mismatches that have already been reviewed by someone else before.
As a mismatch provider, I want the Mismatch Finder to ignore mismatches from my previous uploads that have already been reviewed in order to not have to remove these mismatches from my upload.

Problem:
Currently previously reviewed mismatches will show up again if it they are reuploaded and will have to be reviewed again. This is not a good use of reviewer time. We can solve this issue by simply ignoring intentional mismatches in new uploads that we already have marked as reviewed in previous uploads. This will only apply to the exact same mismatches (i.e. only the expiry date can be different) from the same mismatch provider.

(Removing the already reviewed mismatches by the mismatch provider is possible now that they have access to the review decisions T329156 . However this is additional work we're putting on uploaders that should be avoided.)

BDD
GIVEN Mismatch upload
AND a previous reviewed mismatch
WHEN an exact match of a previous mismatch is uploaded that only differs in the expiry date
THEN it is not imported into the mismatch store

Acceptance criteria:

  • We only drop mismatches that have previously been reviewed. Mismatches that have not been previously reviewed are imported again, potentially creating duplicates.
  • This should only apply to comparisons between mismatches by the same mismatch provider
  • A benchmark for 250 uploads is established and during Product Verification it is decided if the import time is something we want to track

Tech notes
Before the change is made, let's check to see how long it takes to upload 250 mismatches as a benchmark for comparison when this change is made.
Product to review the discrepancy of time once the change is made.

Original ticket

Several people have now set up workflows to automate uploading mismatches to the Mismatch Finder every x weeks. If a mismatch has been uploaded in week 1 and is being reviewed in week 2 as being an intentional mismatch it means that neither the external source nor Wikidata has been adjusted so the two values match. The mismatch will show up again in the next upload in week 3 and will have to be reviewed again. This is not a good use of reviewer time.

Removing the already reviewed mismatches by the mismatch provider is possible once T329156 is fixed and they have access to the review decisions. However this is additional work we're putting on them that should be avoided.

We can solve this issue by simply ignoring mismatches in new uploads that we already have marked as reviewed in previous uploads.

Event Timeline

Arian_Bozorg renamed this task from Ignore already reviewed mismatches on reupload to [SW] Ignore already reviewed mismatches on reupload.Aug 15 2023, 10:02 AM
Arian_Bozorg updated the task description. (Show Details)

I'd advise for us to find a way to notify users (/uploaders) about which of the mismatches were ignored and why (either because they already exist or have been previously reviewed). Displaying this in the upload summary, if possible, would be a good idea.

I'd advise for us to find a way to notify users (/uploaders) about which of the mismatches were ignored and why (either because they already exist or it has been previously reviewed). Displaying this in the upload summary, if possible, would be a good idea.

If something like that can be done, and the results accessible in a machine-readable format, then I could add checks into my script to avoid uploading the same ones the next month. It would also be interesting to have access to the previously reviewed and rejected mismatches to find data quality issues.

Arian_Bozorg renamed this task from [SW] Ignore already reviewed mismatches on reupload to Ignore already reviewed mismatches on reupload.Aug 24 2023, 12:31 PM

This makes sense to me, in these instances it would make sense here to mark them as 'reviewed' in the 'review status' column in the csv.

Task Breakdown Notes:

  • the basic requirement of this is to compare the columns from the imported file against previously imported mismatches
  • if the mismatch already exists in the DB in its exact form, for this uploader, it is ignored during upload
  • the existing mismatch gets tagged as "reviewed" in the "review_status" column of the csv summary of the upload
  • we should make sure all developers have the same access rights to the mismatch finder DB and know how to query it

Potential Plan of Action:

  • set up a reliable benchmark to measure upload resources for 250 Mismatches. This benchmark will be used to compare the performance trade of of this change. Subtask to be opened by @HasanAkgun_WMDE
  • The ImportController.php class controls all the services related to the upload (validation, import etc.). We could have it call a dedicated service to perform the required comparison.
  • All mismatches with status "reviewed" are not to be displayed for review in the MSMF results page.
  • Decisions regarding what DB strategy we observe to fit this feature are to be made next week.
  • if the mismatch already exists in the DB in its exact form, for this uploader, it is ignored during upload
  • the existing mismatch gets tagged as "reviewed" in the "review_status" column of the csv summary of the upload

These seem to be contradictory.
We either import and mark as reviewed or we ignore it and don't import the particular mismatch. I highly recommend not importing.

During task breakdown a few questions came up. After discussing those with Lydia we decided to alter the task scope to skipping previously reviewed mismatches on upload. The task description was adjusted accordingly.

I'd advise for us to find a way to notify users (/uploaders) about which of the mismatches were ignored and why (either because they already exist or have been previously reviewed). Displaying this in the upload summary, if possible, would be a good idea.

Sprint 3 Planning
from a technical pov, keeping track of individual mismatches would be unnecessary overhead. But communicating about the upload is worthwhile and will be a separate task