Regularly purge orphaned sitelink, value and reference nodes
Open, LowPublicBUG REPORT
Actions

Assigned To

None

Authored By

	Yair_rand
	Feb 21 2022, 6:17 AM

Description

The Wikidata Query service appears to retain data for many sitelinks which are not attached to articles.

Additionally, many such "orphaned" sitelinks are still shown as containing badges. (For examples of item-less sitelinks with badges, see the results of https://w.wiki/4rxZ .)

Related Objects

Mentioned In: T326311: Deletion of Lexemes appears to leak triples related to its forms and senses
T323239: Badges for sitelinks not getting updated in query service after a move
T314703: Structured data for deleted files on Commons still visible in SPARQL engine after deletion
Mentioned Here: T105427: Need a way for WDQS updater to become aware of suppressed deletes

Event Timeline

Yair_rand created this task.Feb 21 2022, 6:17 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 21 2022, 6:17 AM

Maintenance_bot added a project: Wikidata.Feb 21 2022, 6:45 AM

Sitelink orphans are not clean up at update time for performance reasons, same is done for orphaned values and references. The database is cleaned up during a full reload which we try to plan once a year. Declining as this is done "on-purpose" but please feel free to re-open if this causes any usability issue on your side.

Reopen. Having orphaned sitelink, value or reference node means users can add sensitive information into WDQS with no easy way to clean up.

@dcausse , is there a reason these changes are not reflected within the 10min update lag we have for other WD changes to be reflected in WDQS?

Reason is that this data may be referenced by other items and thus cannot be deleted blindly without asking blazegraph: "is this data used by another item?" which would be too costly to ask for every edit.
Another approach is to reload blazegraph from the dumps at regular intervals (TBD: once, twice or four times a year).

I don't think "once, twice or four times a year" is enough. If someone added sensitive information into Wikidata, we need to remove it as soon as possible.

A proposed workflow:

using some queries to find all orphaned sitelink, value and reference nodes (and recording the IRIs)
remove them from triplestore
recheck if they are still unused and if they are not we need to add them back to the triplestore

I am not sure whether it is enough to prevent race conditions.

(This is to be done with a script completely independent of Stream Updater)

Has removing sensitive information been a problem so far? Or is it something that is currently an issue? I acknowledge that as an edge case, it should be ideally be one that we can take care of quickly.
In the past, we've had one off tickets to clean up data quickly without waiting for reloads. In the case of a sensitive information issue, which to my best knowledge is a rare edge case, filing a ticket to manually remove this data would seem to solve this issue.

From what I can tell, there are other remaining issues for different use cases involving Wikidata quality maintenance, and queries returning faulty information. These issues seem lower priority than sensitive information, and could perhaps be addressed by more frequent reloads, though it is an open question what the frequency should be.

As mentioned above, the proposed workflow for real time garbage collection is not a viable solution due to the performance load it would put on WDQS: running this many queries for each edit to find orphans and recheck them would likely choke WDQS further, and cause many more timeouts for users across the board. Unfortunately we are in a position where we need to make compromises right now -- in this case between having freshly updated edits to Wikidata vs having enough stability that users are able to run WDQS queries. Both extremes aren't great: up to date information that nobody can query, vs a severely outdated graph that returns faulty information when queried.

Reloading Blazegraph more regularly (which itself takes human and machine time to do), in addition to specific tickets to remove sensitive information ASAP, is a proposed compromise between data quality and service usability.

Another possible workflow:
(1) once a sitelink/value/reference is unused it is put to a queue (deduplicated)
(2) use something like LDF to quickly filter out value or reference that is still used (no need to do this for sitelinks as one sitelink can only be used in one item) - we need to figure out how to do this once we no longer use Blazegraph
(3) remove if they are unused
(4) recheck

Thanks for the clarification. With regard to orphaned nodes throwing off query results, there should be ways to write SPARQL queries in such a way that they ignore these nodes.

In the meantime, we will try to reload Wikidata more often, as discussed above.

• MPhamWMF triaged this task as Low priority.Apr 12 2022, 4:07 PM

• MPhamWMF moved this task from Incoming to Operations/SRE on the Wikidata-Query-Service board.

Bugreporter mentioned this in T314703: Structured data for deleted files on Commons still visible in SPARQL engine after deletion.Aug 6 2022, 6:44 PM

Bugreporter added a project: Privacy.Aug 6 2022, 6:47 PM

HenkvD subscribed.Aug 7 2022, 10:52 AM

• JFishback_WMF added a project: Privacy Engineering.Aug 9 2022, 4:45 PM

• JFishback_WMF moved this task from Incoming to Watching on the Privacy Engineering board.

RP88 subscribed.Aug 10 2022, 10:37 AM

dcausse mentioned this in T323239: Badges for sitelinks not getting updated in query service after a move.Nov 23 2022, 2:29 PM

In T302189#7848616, @MPhamWMF wrote:

Thanks for the clarification. With regard to orphaned nodes throwing off query results, there should be ways to write SPARQL queries in such a way that they ignore these nodes.

If it's possible to exclude orphaned nodes in a query, it must be possible to write queries which find orphaned nodes. Why not have a maintenance script to query for orphaned nodes and clean them up, and run it on a regular basis without being dependent on a full reload? It would be more understandable to me if orphaned nodes were cleaned up once a day or so - not real time, but still within a reasonable time frame.

I'm not able to fix my own queries though, due to the query timeouts. e.g. how can I fix this query? As it is, it runs in under a second but returns incorrect data. When I try to make sure the value node exists, e.g. like this or like this, I get a timeout. I have the same problem with wikibase:timePrecision.

(Here's 10,000 nodes to clean up, the most I could get without it timing out. Here's another 107 where the globe isn't Q2)

Ran into this again while trying to check whether a property is in use in references, since pr: includes non-existent references - https://query.wikidata.org/#select%20%2a%20%7B%20%3Fs%20pr%3AP2183%20%3Fval%20%7D

Here's 40,000 reference nodes which need cleaning up (again, the most I could get without a timeout).

Moebeus subscribed.Dec 12 2022, 2:57 PM

Mahir256 subscribed.Dec 15 2022, 7:19 AM

This report of grammatical features is wrong because it includes deleted data. Like with the previous queries I mentioned, I'm unable to fix it because that takes it from running in under a second to timing out.

This query returns a form which was deleted 11 months ago.

(Here's 100 forms which need cleaning up)

In T302189#8501314, @Nikki wrote:

This report of grammatical features is wrong because it includes deleted data. Like with the previous queries I mentioned, I'm unable to fix it because that takes it from running in under a second to timing out.

This query returns a form which was deleted 11 months ago.

(Here's 100 forms which need cleaning up)

The presence of triples like

wd:L643664-F1 wikibase:grammaticalFeature wd:Q109459317

seems like a different problem to me, forms are not considered "shared resources" and thus cannot be orphaned. I suspect a bug in the way we delete Lexemes, will file a specific task for this.

dcausse mentioned this in T326311: Deletion of Lexemes appears to leak triples related to its forms and senses.Jan 5 2023, 1:33 PM

Regularly purge orphaned sitelink, value and reference nodesOpen, LowPublicBUG REPORTActions

Description

Related Objects

Event Timeline

Regularly purge orphaned sitelink, value and reference nodes
Open, LowPublicBUG REPORT
Actions