Page MenuHomePhabricator

Compute page properties information at munge time
Closed, DuplicatePublic

Description

As a maintainer of wdqs I want to workaround T266999 so that I can properly detect what changed on an entity between two revisions instead of relying on the notion of current state.

Some triples in the RDF output like page properties may not depend entirely on the data stored in the entity revision.
This makes it impossible to reconstruct the RDF output of an entity for a particular revision.
To workaround these issues the munger could generate these values on the fly instead of relying on the ones stored in the page properties (\Wikibase\Repo\Content\EntityContent::applyEntityPageProperties):

  • Items:
    • statements should be easy as it is the number of statement and can easily be counted reading the RDF output
    • sitelinks is similar at it solely depends on the data of the entity itself
    • identifiers: is more delicate as it depends on the type of the properties being used
  • Properties
    • statements
  • Lexemes:
    • statements
    • senses: can be inferred from the entity content (not generated currently)
    • forms: can be inferred from the entity content (not generated currently)

Overall most of these values can be inferred directly from the entity content at munge time.

Sole exception is the number of identifiers which requires the knowledge of which entity has a wikibase:propertyType equal to wikibase:ExternalId (https://w.wiki/jnk).
Currently 5451 properties are identified as such and it should be possible to deploy small dataset within the deploy repo containing such information so that the munger can properly infer the number of identifiers at munge time.
To make it stable this dataset should be append only and the date of generation will be matched against the modification date of the entity being munged.

AC:

  • triples wikibase:statements, wikibase:sitelinks and wikibase:identifiers are ignored from the wikibase dumps and generated on the fly at munge time

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@dcausse it sounds like we’re attacking the same problem from two different angles? See my recent comments in T145712 :)

@Lucas_Werkmeister_WMDE indeed, thanks for the link I was not aware of this ticket! :)

I think we agree that most of this data can be computed using the data available in the entity and not rely on page properties, the only one that remains difficult is the number of identifiers as it depends on the properties.

CBogen triaged this task as High priority.Nov 2 2020, 3:13 PM
CBogen moved this task from Incoming to RDF Model on the Wikidata-Query-Service board.