Page MenuHomePhabricator

Remove signal: Number of sitelinks
Closed, ResolvedPublic3 Estimated Story Points

Description

Problem:
The number of sitelinks are taken into account when determining the quality of an Item. However the number of sitelinks is more representative of the popularity of the concept behind the Item than the quality of the Item. We should remove it.

Acceptance criteria:

  • Sitelink count is no longer taken into account for quality scoring

Event Timeline

Side note: Please keep in mind that data quality is often understood as the "fitness for use" (http://www.semantic-web-journal.net/system/files/swj414.pdf) or "purpose" (https://www.cio.com/article/3124402/ensuring-the-quality-of-fit-for-purpose-data.html) of the data. From the RfC "Data quality framework for Wikidata":

Following Lukanyenko et al., 2014[1], we define data quality as “the extent to which stored information represents the phenomena of interest to data consumers (and project sponsors), as perceived by information contributors”.

As a result, relevance has traditionally been considered a data quality dimension: all things being equal, the data that is most relevant to users (more linked/shared/used) can also be considered of higher quality. The mentioned RfC also proposed that sitelinks were a relevant data quality dimension for Wikidata in particular:

The extent to which data are sufficiently interlinked to other resources, either within or without Wikimedia.

  • Description and examples: Items should be linked to equivalent entities in other knowledge bases; Items should have an appropriate number of links to other Wikimedia projects.

This side note is not to force anyone to rethink this task or to change the way in which ORES is being used, only to warn that probably not many more signals are being used to determine whether the data ultimately meets people's needs (the decisive indicator and high-level synonym of "data quality"), and that some definitions and objectives reflected in different places under the common name of "data quality" are unfortunately less and less aligned.