Page MenuHomePhabricator

Figure out how to detect that claim is updated
Closed, ResolvedPublic

Description

I was under the impression that each time the claim is updated, the new ID is generated, so I can just match the IDs to update the claims. However, I have discovered the following: for item Q24517, the old claim in my dump is:

"P279":[{"id":"Q24517$7A855A08-5DD0-41A6-9A36-E3AC3DE24B11","mainsnak":{"snaktype":"value","property":"P279","datatype":"wikibase-item","datavalue":{"value":{"entity-type":"item","numeric-id":2095},"type":"wikibase-entityid"}}

However, in current dump at https://www.wikidata.org/wiki/Special:EntityData/Q24517.json it is:

"P279":[{"id":"Q24517$7A855A08-5DD0-41A6-9A36-E3AC3DE24B11","mainsnak":{"snaktype":"value","property":"P279","datatype":"wikibase-item","datavalue":{"value":{"entity-type":"item","numeric-id":2207288},"type":"wikibase-entityid"}}

As we can see, same claim ID but it refers to a different node. That makes it hard to recognize when the claim must be updated. So I'd like to figure out:

  1. Is it intentional or a bug?
  2. If it's intentional, can it be changed to generate new IDs on change?
  3. If not, what would be the best way to recognize when clam changes?

The items can have many claims, so knowing which ones changed and updating only those would greatly speed up the query service function.

Alternatively, if there's some other format that is better suitable to loading updates, we may want to use that instead of JSON data.

Event Timeline

Smalyshev claimed this task.
Smalyshev raised the priority of this task from to High.
Smalyshev updated the task description. (Show Details)
Smalyshev changed Security from none to None.

Statement IDs are GUIDs (with the Item ID prefixed), and they do not change when the Statement changes (otherwise, they would be hashes, not IDs - References are currently handled by hash). This is intentional and necessarily to be able to discuss Statements as such.

Internally we use hashes for this kind of comparison - I suppose including hashes in the JSON dumps might be nice. But for now, you could just keep a map of GUID -> hash somewhere, and when reading the next dump, re-compute and compare the hash for each statement.

With regard to the Statement's GUID staying stable: there might be some wiggle room here on the data model level: we might change the GUID when the main Snak or Qualifiers (the "claim") change. But adding a Reference shouldn't change the ID. This needs some thought though - the GUID allows a statement to be referenced and discussed across multiple revisions. This is useful for things like tracking which statement violates which soft-constraint, etc.

Probably will have to switch to use content hashes as identifier for change.

We'll use content hash instead of claim ID to detect changes. We'll also use lastrevid on the item to track revisions.