Page MenuHomePhabricator

Keep global "last seen revision" map for Updater
Open, MediumPublic

Description

If we kept a map of all latest revision IDs for all items we've recently updated (not derived from events but from actually fetched data sent to the database), we could eliminate a lot of stale updates - especially when we're catching up after the lag. The first mention of the item would fetch the latest rev, and then all the following events would basically be ignored.

Right now we do something like that within the batch, and again match the revision IDs against the database after the fetches - but this way we can do it cross-batch and eliminate the unnecessary fetches. Basically that'd solve the problem of lots of fetches (while the cache is active) since each item will be fetched only once per backlog. I think with proper data structure (like SparseArray maybe?) we could keep a lot of history there relatively cheaply (we just need one 64-bit int per item). Also probably won't work for changes that lack revision ID - like deletes - but we could either ignore those (they are relatively rare) or also use timestamps (dangerous).

It's a bit risky since we'd be basing updates on non-database information (i.e. if the database somehow fails the update but we think it's successful we'd be wrongly dropping the updates) but I think it's acceptable and since the map would be ephemeral, it would be gone after restart.

We could optimize it by only keeping the map for Q-ids - we could probably then use integer keys, and 2G of integer space would last us for a while. Or maybe more efficient to use regular HashMap and benefit from cache eviction support built in.

Details

Related Gerrit Patches:
wikidata/query/rdf : masterAdd global revisions map

Event Timeline

Restricted Application added a project: Wikidata. · View Herald TranscriptMar 8 2019, 10:02 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

To clarify: this would all be internal to the updater, right? Because the database already has this information, as far as I’m aware (schema:version for the revision ID and schema:dateModified for its timestamp), but I assume we don’t want to use that.

To clarify: this would all be internal to the updater, right?

Right.

Smalyshev triaged this task as Medium priority.Mar 20 2019, 8:51 PM
Smalyshev moved this task from Backlog to Next on the User-Smalyshev board.
Smalyshev moved this task from Next to Doing on the User-Smalyshev board.Mar 22 2019, 6:10 AM

Change 498512 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[wikidata/query/rdf@master] Add global revisions map

https://gerrit.wikimedia.org/r/498512