Create or adapt an API that shows which entities changed since a specific change instead of only supporting time as paging. Multiple changes happen during the same timestamp.
This might not be needed as https://www.wikidata.org/w/api.php?action=query&list=recentchanges&rcnamespace=0&rctoponly&rclimit=50 https://www.wikidata.org/w/api.php?action=help&modules=query%2Brecentchanges already offers part of this functionality.
(Faster access can be gained via MySQL to the recentchanges or wbchanges table directly.)
I think we still need this one (or maybe different one?). The main problem still existing is that recentchanges is based on time only, not revision. We kind of work around being unable to specify the revision, since trying the same update more than once is not doing much harm (we don't do much when we discover we already have latest revid) but it's be nicer if we had proper API that allows to tell revid where we stopped. Timestamp is not a good ID for revisions since many revisions can happen at the same time.
I think supporting revision for recentchanges would mean two queries, first one to find the rc_id. So how about rc_id instead? ( https://www.mediawiki.org/wiki/Manual:Recentchanges_table )
Now it seems that the api already supports that, via rccontinue. Example: https://www.wikidata.org/wiki/Special:ApiSandbox#action=query&list=recentchanges&format=json&rcdir=older&rcprop=timestamp|ids&rclimit=50&rccontinue=20150504185236|214796318&rawcontinue=
The format of it is timestamp|rc_id, but the API is designed for that to be opaque, so I don't know if we may just rely on it.
I suspect its safe to use it rccontinue to set the rc_id minimum for now but the point of these opaque continue operations is that they might change. If we're ok with this just failing at some point then its ok.
I'm not super clear on why we can't just do what we were doing though - for the most part we'd just always use rccontinue from the api blindly but if we restarted the poller we just pick up a second or two behind some changes we'd already polled. Getting those changes twice isn't a big deal for us.
The problem is not getting the change twice. The problem is since we don't have the ending point, if we ask by time only we can't know if we have updates or not. That means, if there are no updates but last 5 updates were at the same timestamp, current code will ask these 5 updates over and over, instead of sleeping as it should be until new ones come in.