Page MenuHomePhabricator

Regularly purge orphaned sitelink, value and reference nodes
Open, LowPublicBUG REPORT

Description

The Wikidata Query service appears to retain data for many sitelinks which are not attached to articles.

Additionally, many such "orphaned" sitelinks are still shown as containing badges. (For examples of item-less sitelinks with badges, see the results of https://w.wiki/4rxZ .)

Event Timeline

dcausse added a subscriber: dcausse.

Sitelink orphans are not clean up at update time for performance reasons, same is done for orphaned values and references. The database is cleaned up during a full reload which we try to plan once a year. Declining as this is done "on-purpose" but please feel free to re-open if this causes any usability issue on your side.

Bugreporter renamed this task from Query service retains orphaned sitelinks to Regularly purge orphaned sitelink, value and reference nodes.Mar 31 2022, 9:27 PM
Bugreporter reopened this task as Open.
Bugreporter added a subscriber: Bugreporter.

Reopen. Having orphaned sitelink, value or reference node means users can add sensitive information into WDQS with no easy way to clean up.

@dcausse , is there a reason these changes are not reflected within the 10min update lag we have for other WD changes to be reflected in WDQS?

Reason is that this data may be referenced by other items and thus cannot be deleted blindly without asking blazegraph: "is this data used by another item?" which would be too costly to ask for every edit.
Another approach is to reload blazegraph from the dumps at regular intervals (TBD: once, twice or four times a year).

I don't think "once, twice or four times a year" is enough. If someone added sensitive information into Wikidata, we need to remove it as soon as possible.

A proposed workflow:

  1. using some queries to find all orphaned sitelink, value and reference nodes (and recording the IRIs)
  2. remove them from triplestore
  3. recheck if they are still unused and if they are not we need to add them back to the triplestore

I am not sure whether it is enough to prevent race conditions.

(This is to be done with a script completely independent of Stream Updater)

Has removing sensitive information been a problem so far? Or is it something that is currently an issue? I acknowledge that as an edge case, it should be ideally be one that we can take care of quickly.
In the past, we've had one off tickets to clean up data quickly without waiting for reloads. In the case of a sensitive information issue, which to my best knowledge is a rare edge case, filing a ticket to manually remove this data would seem to solve this issue.

From what I can tell, there are other remaining issues for different use cases involving Wikidata quality maintenance, and queries returning faulty information. These issues seem lower priority than sensitive information, and could perhaps be addressed by more frequent reloads, though it is an open question what the frequency should be.

As mentioned above, the proposed workflow for real time garbage collection is not a viable solution due to the performance load it would put on WDQS: running this many queries for each edit to find orphans and recheck them would likely choke WDQS further, and cause many more timeouts for users across the board. Unfortunately we are in a position where we need to make compromises right now -- in this case between having freshly updated edits to Wikidata vs having enough stability that users are able to run WDQS queries. Both extremes aren't great: up to date information that nobody can query, vs a severely outdated graph that returns faulty information when queried.

Reloading Blazegraph more regularly (which itself takes human and machine time to do), in addition to specific tickets to remove sensitive information ASAP, is a proposed compromise between data quality and service usability.

Another possible workflow:
(1) once a sitelink/value/reference is unused it is put to a queue (deduplicated)
(2) use something like LDF to quickly filter out value or reference that is still used (no need to do this for sitelinks as one sitelink can only be used in one item) - we need to figure out how to do this once we no longer use Blazegraph
(3) remove if they are unused
(4) recheck

Thanks for the clarification. With regard to orphaned nodes throwing off query results, there should be ways to write SPARQL queries in such a way that they ignore these nodes.

In the meantime, we will try to reload Wikidata more often, as discussed above.

MPhamWMF moved this task from Incoming to Operations/SRE on the Wikidata-Query-Service board.