Page MenuHomePhabricator

Changes to RDF formatter URI should be visible in query service
Open, MediumPublic

Description

As a Wikidata editor, I want the query service to reflect the current data in Wikidata. When a formatter URI for RDF resource has been defined, I want it to be consistently applied in the query service, to make it easy to query linked data.

Problem:
When a formatter URI for RDF resource statement is added or edited on an external ID property, the RDF output of all items using that property changes (see Normalized External ID), but the change is not visible in the query service until each affected item is edited, because only then will the query service updater process those items and see the changed value.

Example:
A formatter URI was recently added to TAXREF ID (P3186), but the number of wdtn:P3186 triples is significantly lower than the number of wdt:P3186 triples (user report):

SELECT ?wdtn ?wdt (CONCAT(SUBSTR(STR(100*?wdtn/?wdt), 1, 5), "%") AS ?percent)
WITH {
  SELECT (COUNT(*) AS ?wdtn) WHERE {
    [] wdtn:P3186 [].
  }
} AS %wdtn
WITH {
  SELECT (COUNT(*) AS ?wdt) WHERE {
    [] wdt:P3186 [].
  }
} AS %wdt
WHERE {
  INCLUDE %wdtn.
  INCLUDE %wdt.
}

Try it!

wdtnwdtpercent
4574417923525.52%

Screenshots/mockups:

BDD
GIVEN
AND
WHEN
AND
THEN
AND

Acceptance criteria:

Open questions:

Related Objects

Event Timeline

Lucas_Werkmeister_WMDE renamed this task from Changes to RDF formatter URL should be visible in query service to Changes to RDF formatter URI should be visible in query service.Jul 23 2020, 9:50 AM
Lucas_Werkmeister_WMDE added a subscriber: dcausse.

I assume one way to do this would be to teach the importer what RDF formatter URI statements look like (i. e., it would need to know the special property ID formatter URI for RDF resource (P1921)). Then, when the updater sees a change to this value, it gets a list of affected items (either from the query service or from the links table/API of the repo wiki) and schedules updates for all of them, as if they’d been edited. The biggest problem is probably that this has the potential to suddenly schedule a huge number of updates, depending on how widely used the affected property is (imagine editing the formatter URI of VIAF ID).

@dcausse do you think this is at all feasible?

It might be feasible but it's not clear yet how hard it would be.

Listing possible solutions (from the least desirable to the ideal):

  • reloading the dump is obviously a way to mitigate the issue, we want to make reloads more frequent but given the time it takes to do them it's unlikely that we could do more frequent than a monthly rate. For the time being this the sole work-around we have.
  • trying to reconcile like the current updater: fetch the "current" version of an item and reconcile it with the graph, based on a list of item to update that we would have to be fetched from the links table/API (I'd avoid relying on query service itself). I'm not a big fan of this solution because it's really unclear what is the "current" version of an item.
  • Ideally we'd like to give an "identity" of the RDF projection of an item, the streaming updater is currently using the revision of the item but this is clearly not sufficient in this case since the same item revision may result in a different RDF output at different point in time. Sadly the only identity I can think of is quite huge since it would be the item revision + all its properties:revisions pairs. Storing such state and detecting the creation/update of a property with a P1921 might allow creating a sidestream that we could then use to generate diffs for all the entities affected by this change (the logic to do the formatting will have to be implemented in the new updater but hopefully it's not something that should be too hard). Then we'd have to prioritize the stream to make sure that entity updates take precedence over this kind of updates. We might need to add a new flavor to EntityData to extract all the property revisions (something between the default flavor and the dump one). Assuming we have enough space to store the additional states in the flink pipeline I believe this might be doable in a streaming fashion and still send minimal diffs to blazegraph. I probably missed a lot of details though :)
Gehel triaged this task as Medium priority.Sep 15 2020, 7:57 AM