Page MenuHomePhabricator

Feature request: Structured way of reporting bad data on Wikidata that comes from external data sources
Open, Needs TriagePublicFeature

Description

User story:
As a Wikidata editor,
I want to have a simple way of reporting errors in external id values to the respective data sources
in order to fix these errors upstream in an efficient way.

As a curator of an external data source,
I want to get error reports from Wikidata in a structured way
in order to deal with them more systematically and efficiently.

Problem:

  • The status quo for Wikidata editors is very time-consuming and often unproductive:
    • language difficulties between user and institution
    • each institution has a different processes
    • there is often no answer to such reports in practice
  • The process is often intransparent:
    • it might help if everyone could see the number of solved and unsolved errors per external data source (e.g. to avoid duplicate reports, or as an incentive)

Solution:
The primary goal should be a structured way of reporting errors for Wikidata editors.

Ideas:
Maybe the simplest way to solve this (maybe even without the need for coding):

Some more elaborate features of a potential new tool:

  • managing the databases should be allowed for both Wikidata editors and interested institutions (on which one or more external IDs on Wikidata are based)
    • institution should be able to solve reports to them (each employee of the institution can log in and solve issues)
  • reports should be possible through the tool interface
    • and ideally also directly from a Wikidata item through an apposite gadget (e.g. button for each external ID value "add mistake report")
  • constraint violations are automatically added
    • the automatic system should also include statements deprecated with qualifier P2241: Q29998666 (reason: error in the referenced source)
  • possible improvement: give the institution some possibility to solve reports also semiautomatically
  • maybe we can integrate this to Mismatch Finder (or at least use some code and infrastructure from it)
    • Mismatch finder also has external errors to report back (if something war reported on by mistake)
  • ideally, we could also report systematic errors that exist in a group of values of the external ID (e.g. in the form of a message)

The basic tables could look similar to Mismatch finder:

  • Property
  • External ID
  • WD Item
  • Status (can be changed by the institution)
  • Comment by the reporter (why do we think this is wrong; automatic or manual)

Example table from Mismatch Finder

Notes:

Mockups:

Acceptance criteria:

Open questions:

  • Do we need a tool, or could we solve this with a wikipage + properties system?
  • What workflow would work well for curators of external data sources?
  • Should old errors remain on Wikidata (with suitable deprecation reason), after they have been changed/fixed in referenced source?
    • maybe they should, especially if the reference has an access date (the tool would need to be aware of this)

Community communication:
Who we needs to keep in the loop and in what way:
Who this could be interesting for and in what way:

Original:
Created in a working session at the Wikidata Data Quality Days 2022.

Event Timeline

Manuel renamed this task from Structured way of reporting bad data on Wikidata that comes from external data sources to Feature request: Structured way of reporting bad data on Wikidata that comes from external data sources.Jul 10 2022, 11:43 AM
Jonteemil changed the subtype of this task from "Task" to "Feature Request".Jul 19 2022, 8:46 PM

We need to reinforce our action to convince the most important databases in accepting, in a standard way, reports from Wikidata users. Recently, e.g., BNF started refusing any mistake report: https://www.wikidata.org/wiki/User:CaféBuzz/BNF. P.S. Today this ticket turns one!

I was made aware of this ticket at the last office hour per @Epidosis comment.

Some months ago I made the query https://w.wiki/9QXY to get all the deprecated statements sourced with a specific identifier (in this case, P7796) and the normal or prefered statement that could be used instead. It currently relies on the presence of a specifc P248 as well in the ref, but probably that could be scrapped. This query however has scaling issues (on P269 it doesn’t work without restricting the items considered).

I believe such query-based approach could be a good start, if only to get errors coming from a specific source, as well as suggestions of correction. They can be easily constructed from the property itself, so adding a link to the property talk page should be feasible.

It still need the involment of third parties to get errors corrected.