Page MenuHomePhabricator

Regularly purge orphaned sitelink, value and reference nodes
Open, LowPublicBUG REPORT

Description

The Wikidata Query service appears to retain data for many sitelinks which are not attached to articles.

Additionally, many such "orphaned" sitelinks are still shown as containing badges. (For examples of item-less sitelinks with badges, see the results of https://w.wiki/4rxZ .)

Event Timeline

dcausse subscribed.

Sitelink orphans are not clean up at update time for performance reasons, same is done for orphaned values and references. The database is cleaned up during a full reload which we try to plan once a year. Declining as this is done "on-purpose" but please feel free to re-open if this causes any usability issue on your side.

Bugreporter renamed this task from Query service retains orphaned sitelinks to Regularly purge orphaned sitelink, value and reference nodes.Mar 31 2022, 9:27 PM
Bugreporter reopened this task as Open.
Bugreporter subscribed.

Reopen. Having orphaned sitelink, value or reference node means users can add sensitive information into WDQS with no easy way to clean up.

@dcausse , is there a reason these changes are not reflected within the 10min update lag we have for other WD changes to be reflected in WDQS?

Reason is that this data may be referenced by other items and thus cannot be deleted blindly without asking blazegraph: "is this data used by another item?" which would be too costly to ask for every edit.
Another approach is to reload blazegraph from the dumps at regular intervals (TBD: once, twice or four times a year).

I don't think "once, twice or four times a year" is enough. If someone added sensitive information into Wikidata, we need to remove it as soon as possible.

A proposed workflow:

  1. using some queries to find all orphaned sitelink, value and reference nodes (and recording the IRIs)
  2. remove them from triplestore
  3. recheck if they are still unused and if they are not we need to add them back to the triplestore

I am not sure whether it is enough to prevent race conditions.

(This is to be done with a script completely independent of Stream Updater)

Has removing sensitive information been a problem so far? Or is it something that is currently an issue? I acknowledge that as an edge case, it should be ideally be one that we can take care of quickly.
In the past, we've had one off tickets to clean up data quickly without waiting for reloads. In the case of a sensitive information issue, which to my best knowledge is a rare edge case, filing a ticket to manually remove this data would seem to solve this issue.

From what I can tell, there are other remaining issues for different use cases involving Wikidata quality maintenance, and queries returning faulty information. These issues seem lower priority than sensitive information, and could perhaps be addressed by more frequent reloads, though it is an open question what the frequency should be.

As mentioned above, the proposed workflow for real time garbage collection is not a viable solution due to the performance load it would put on WDQS: running this many queries for each edit to find orphans and recheck them would likely choke WDQS further, and cause many more timeouts for users across the board. Unfortunately we are in a position where we need to make compromises right now -- in this case between having freshly updated edits to Wikidata vs having enough stability that users are able to run WDQS queries. Both extremes aren't great: up to date information that nobody can query, vs a severely outdated graph that returns faulty information when queried.

Reloading Blazegraph more regularly (which itself takes human and machine time to do), in addition to specific tickets to remove sensitive information ASAP, is a proposed compromise between data quality and service usability.

Another possible workflow:
(1) once a sitelink/value/reference is unused it is put to a queue (deduplicated)
(2) use something like LDF to quickly filter out value or reference that is still used (no need to do this for sitelinks as one sitelink can only be used in one item) - we need to figure out how to do this once we no longer use Blazegraph
(3) remove if they are unused
(4) recheck

Thanks for the clarification. With regard to orphaned nodes throwing off query results, there should be ways to write SPARQL queries in such a way that they ignore these nodes.

In the meantime, we will try to reload Wikidata more often, as discussed above.

Thanks for the clarification. With regard to orphaned nodes throwing off query results, there should be ways to write SPARQL queries in such a way that they ignore these nodes.

If it's possible to exclude orphaned nodes in a query, it must be possible to write queries which find orphaned nodes. Why not have a maintenance script to query for orphaned nodes and clean them up, and run it on a regular basis without being dependent on a full reload? It would be more understandable to me if orphaned nodes were cleaned up once a day or so - not real time, but still within a reasonable time frame.

I'm not able to fix my own queries though, due to the query timeouts. e.g. how can I fix this query? As it is, it runs in under a second but returns incorrect data. When I try to make sure the value node exists, e.g. like this or like this, I get a timeout. I have the same problem with wikibase:timePrecision.

(Here's 10,000 nodes to clean up, the most I could get without it timing out. Here's another 107 where the globe isn't Q2)

Ran into this again while trying to check whether a property is in use in references, since pr: includes non-existent references - https://query.wikidata.org/#select%20%2a%20%7B%20%3Fs%20pr%3AP2183%20%3Fval%20%7D

Here's 40,000 reference nodes which need cleaning up (again, the most I could get without a timeout).

This report of grammatical features is wrong because it includes deleted data. Like with the previous queries I mentioned, I'm unable to fix it because that takes it from running in under a second to timing out.

This query returns a form which was deleted 11 months ago.

(Here's 100 forms which need cleaning up)

This report of grammatical features is wrong because it includes deleted data. Like with the previous queries I mentioned, I'm unable to fix it because that takes it from running in under a second to timing out.

This query returns a form which was deleted 11 months ago.

(Here's 100 forms which need cleaning up)

The presence of triples like

wd:L643664-F1 wikibase:grammaticalFeature wd:Q109459317

seems like a different problem to me, forms are not considered "shared resources" and thus cannot be orphaned. I suspect a bug in the way we delete Lexemes, will file a specific task for this.