Currently the updater picks a date to start updating from based on the data in the RDF repository. Its almost certainly not doing it right. Maybe close enough for now, but this task is to make sure its doing it right and to write a couple of tests around it. With mock dumps and stuff.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | • Manybubbles | T89852 Build a tool for synchronizing Wikidata changes into Blazegraph | |||
Resolved | • Manybubbles | T95194 Validate how the updater determines where it left off |
Event Timeline
There's another peculiar problem here. The API is driven by timestamps, however several changes can have the same timestamp. Moreover, it can happen that if changes a, b and c have the same timestamp, we've read a and b but not c. Since we now use last timestamp in the query, next time we'll get a, b and c - and since we've seen b last we'll skip it but we won't skip a and c. Next time, we'll skip c (since it was last) but not a and b - meaning, we'll be stuck on fetching a, b, c and trying to apply a and b for a while until any new change comes in. This doesn't look like right thing - we should be able to tell the exact change we've seen last and ask the API "give me everything after that". Timestamps are very bad IDs for such things, we should use change IDs instead. If Wiki doesn't have the API for that we should add such API.
As you pointed out its not nice, but if we use timestamp we only may ever skip last seen timestamp - 1. Otherwise one might accidentally skip changes that were made later but will be sorted earlier because they happened within the same timestamp.