If we kept a map of all latest revision IDs for all items we've recently updated (not derived from events but from actually fetched data sent to the database), we could eliminate a lot of stale updates - especially when we're catching up after the lag. The first mention of the item would fetch the latest rev, and then all the following events would basically be ignored.
Right now we do something like that within the batch, and again match the revision IDs against the database after the fetches - but this way we can do it cross-batch and eliminate the unnecessary fetches. Basically that'd solve the problem of lots of fetches (while the cache is active) since each item will be fetched only once per backlog. I think with proper data structure (like SparseArray maybe?) we could keep a lot of history there relatively cheaply (we just need one 64-bit int per item). Also probably won't work for changes that lack revision ID - like deletes - but we could either ignore those (they are relatively rare) or also use timestamps (dangerous).
It's a bit risky since we'd be basing updates on non-database information (i.e. if the database somehow fails the update but we think it's successful we'd be wrongly dropping the updates) but I think it's acceptable and since the map would be ephemeral, it would be gone after restart.
We could optimize it by only keeping the map for Q-ids - we could probably then use integer keys, and 2G of integer space would last us for a while. Or maybe more efficient to use regular HashMap and benefit from cache eviction support built in.