Rework how value and reference changes are handled
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	dcausse
	Dec 3 2019, 8:44 AM

Description

The current workflow of the updater requires loading the triples from rdf store prior to sending an update to it.
This makes the process very sensitive to update order.
This also prevents further refactoring to introduce a queue holding the triples to change (do not call Special:EntityData * number of nodes per update).

Details

	Subject	Repo	Branch	Lines +/-
	Track existing values and refs outside of the munger	wikidata/query/rdf	master	+492 -163

Customize query in gerrit

Related Objects

Mentioned In: T239908: Extract more metrics from blazegraph sparql update response
Mentioned Here: T194325: Unrecognized subject messages on Updater
T167759: Reference hash is not stable

Event Timeline

dcausse created this task.Dec 3 2019, 8:44 AM

Restricted Application added a project: Wikidata. · View Herald TranscriptDec 3 2019, 8:44 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

dcausse claimed this task.Dec 3 2019, 8:45 AM

dcausse triaged this task as Medium priority.

References and values are identified by a hash computed over their properties. It is not a unique ID as it is always generated on the fly when extracting the entity data.
The current RDF projection makes it a resource that is referenced from other triples. The problem is that when some of the properties of the value/reference are changed the hash is changed making this process prone to leave orphans in the rdf store.
Other complexity is that the hashes may be used by other entities making the cleanup even harder since we have to cleanup the values/refs only when they're never used elsewhere. This cleanup is currently done in realtime whenever a batch of entities is updated.
The query is as follow:

select ?s ?p ?o WHERE {
  VALUES ?s { list of values/references }
  # Since values are shared we can only clear the values on them when they are no longer used
  # anywhere else.
  
  FILTER NOT EXISTS {
    ?someEntity ?someStatementPred ?s .
    FILTER(?someStatementPred != wikibase:quantityNormalized)
  }
  ?s ?p ?o .
}

and is done once for values and once for references.
We should probably try to monitor how much time is spent trying to cleanup orphaned values & references. But also count how many values & references are duplicated in the dump.
This to answer the following questions:

is it worthwhile to continue to do orphan detection during realtime updates
is it worthwhile to investigate an offline method to prune orphaned values & references
is it worthwhile to dedup values&references at import time

References and values are identified by a hash computed over their properties. It is not a stable ID as it is always generated on the fly when extracting the entity data.

They’re fairly stable in practice, though – I think it’s been a while since the last time we broke the hashes. See also T167759: Reference hash is not stable for more discussion.

But also count how many values & references are duplicated in the dump.

A full ?reference (COUNT(*)) GROUP BY ?reference probably isn’t possible on the live query service, but “imported from English Wikipedia” seems like a good contender for one of the most common references, and is used on 14,021,522 statements. Imports from German (4,987,584) and Russian Wikipedia (3,035,493) are also common, though I suspect they’re beaten by some external database (“stated in”) that I can’t think of right now.

Some numbers extracted from a dump:

number of values: 20,659,551
number of unique values: 11,028,526
number of references: 60,078,314
number of unique references: 58,876,057

So to the question:

is it worthwhile to dedup values&references at import time

probably not, given that we have between 3 and 5 triples per value this would save between 27M and 45M dup triple inserts over ~8B triples
For references it's clearly not worthwhile

In T239687#5711596, @Lucas_Werkmeister_WMDE wrote:

References and values are identified by a hash computed over their properties. It is not a stable ID as it is always generated on the fly when extracting the entity data.

They’re fairly stable in practice, though – I think it’s been a while since the last time we broke the hashes. See also T167759: Reference hash is not stable for more discussion.

Thanks for the link and the context. "stable ID" was probably misleading, I'll rephrase as "non unique ID".

dcausse mentioned this in T239908: Extract more metrics from blazegraph sparql update response.Dec 5 2019, 1:40 PM

Change 556032 had a related patch set uploaded (by DCausse; owner: DCausse):
[wikidata/query/rdf@master] Track existing values and refs outside of the munger

https://gerrit.wikimedia.org/r/556032

gerritbot added a project: Patch-For-Review.Dec 10 2019, 9:01 AM

Addshore moved this task from incoming to in progress on the Wikidata board.Dec 11 2019, 11:47 AM

Change 556032 merged by jenkins-bot:
[wikidata/query/rdf@master] Track existing values and refs outside of the munger

https://gerrit.wikimedia.org/r/556032

Maintenance_bot removed a project: Patch-For-Review.Dec 16 2019, 3:10 PM

Daniel_Mietchen subscribed.Feb 8 2020, 1:25 PM

The munger has been reworked so that it does not deal with this cleanup. The next gen updater will address this cleanup in a different way. For the current updater one thing to keep in mind is that the ref cleanup was disabled some time ago (investigating T194325: https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/437362) and never re-enabled since then. We could imagine disabling values cleanup as well this could give us some room with the current updater.

Gehel closed this task as Resolved.Feb 26 2020, 4:13 PM

Rework how value and reference changes are handledClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Rework how value and reference changes are handled
Closed, ResolvedPublic
Actions