Page MenuHomePhabricator

Stale reads for WDQS Updater
Closed, ResolvedPublic

Description

I suspect that WDQS updater may be reading older RDF data than the latest edit. I.e. I am getting log messages like this:

20:53:35.953 [update 1] WARN  o.wikidata.query.rdf.tool.rdf.Munger - Stale revision on Q5576287: change is 797238340, RDF is 683087222
09:23:21.076 [update 4] WARN  o.wikidata.query.rdf.tool.rdf.Munger - Stale revision on Q132990: change is 797548413, RDF is 791644240
20:14:17.327 [update 8] WARN  o.wikidata.query.rdf.tool.rdf.Munger - Stale revision on Q15277881: change is 802706968, RDF is 736343558

Which means that the Updater knows that last revision was 802706968, and yet when requesting RDF via https://www.wikidata.org/wiki/Special:EntityData/Q15277881.ttl?flavor=dump it got revision ID 736343558.

This is problematic since this means whatever data is updated between 736343558 and 802706968 is lost. It is not result of the caching, since the request has special URL addition that is different for each request (timestamp-driven).

I wonder if there's any way to ensure we get the freshest RDF and not stale information from Wikidata.

Event Timeline

I wonder if there's any way to ensure we get the freshest RDF and not stale information from Wikidata.

we could add another param to ensure the latest is retrieved?
How often does this happen?
It might be better to only "ensure" the latest is retrieved if it has already failed once? This could possibly be done inside wdqs?
We could also specify a specific rev id to be retrieved as you have that info?
We currently don't enable access to old revisions via any API T40971

We currently don't enable access to old revisions via any API T40971

I don’t follow – isn’t that what https://www.wikidata.org/wiki/Special:EntityData/Q15277881.ttl?flavor=dump&oldid=802706968 does?

Oh, so we do already have that :D

we could add another param to ensure the latest is retrieved?

The problem is that if the DB does not have this revision, it can't be retrieved...

We could also specify a specific rev id to be retrieved as you have that info?

I don't want to get the specific one, I want to get the latest one. Retrieving specific one would produce significant performance hit, as it means for 10 quick edits in a row I'd do 10 full updates and not one.

Additionally, replica not (yet) having this revision gets us 404, which is the same as item being deleted. Also not an ideal situation.

Change 478135 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[wikidata/query/rdf@master] Repeat updates that come in with stale revision

https://gerrit.wikimedia.org/r/478135

Change 478135 merged by jenkins-bot:
[wikidata/query/rdf@master] Repeat updates that come in with stale revision

https://gerrit.wikimedia.org/r/478135

Change 479762 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[wikidata/query/rdf@master] Fix delayed updates

https://gerrit.wikimedia.org/r/479762

Change 479762 merged by jenkins-bot:
[wikidata/query/rdf@master] Fix delayed updates

https://gerrit.wikimedia.org/r/479762

Should be resolved now.