Page MenuHomePhabricator

have API that shows which entities changed since a specific change instead of time
Closed, ResolvedPublic

Description

Create or adapt an API that shows which entities changed since a specific change instead of only supporting time as paging. Multiple changes happen during the same timestamp.

Event Timeline

JanZerebecki raised the priority of this task from to Needs Triage.
JanZerebecki updated the task description. (Show Details)
JanZerebecki changed Security from none to None.
JanZerebecki added subscribers: Aklapper, JanZerebecki.

This might not be needed as https://www.wikidata.org/w/api.php?action=query&list=recentchanges&rcnamespace=0&rctoponly&rclimit=50 https://www.wikidata.org/w/api.php?action=help&modules=query%2Brecentchanges already offers part of this functionality.

(Faster access can be gained via MySQL to the recentchanges or wbchanges table directly.)

Personally I don't think we would want a new API module for this

JanZerebecki closed this task as Invalid.May 4 2015, 1:00 PM
JanZerebecki claimed this task.

The SPARQL endpoint currently doesn't need anything more.

I think we still need this one (or maybe different one?). The main problem still existing is that recentchanges is based on time only, not revision. We kind of work around being unable to specify the revision, since trying the same update more than once is not doing much harm (we don't do much when we discover we already have latest revid) but it's be nicer if we had proper API that allows to tell revid where we stopped. Timestamp is not a good ID for revisions since many revisions can happen at the same time.

JanZerebecki renamed this task from create API that shows which entities changed since time/revision x to have API that shows which entities changed since a specific change instead of time.May 4 2015, 6:46 PM
JanZerebecki reopened this task as Open.
JanZerebecki updated the task description. (Show Details)

I think supporting revision for recentchanges would mean two queries, first one to find the rc_id. So how about rc_id instead? ( https://www.mediawiki.org/wiki/Manual:Recentchanges_table )

Now it seems that the api already supports that, via rccontinue. Example: https://www.wikidata.org/wiki/Special:ApiSandbox#action=query&list=recentchanges&format=json&rcdir=older&rcprop=timestamp|ids&rclimit=50&rccontinue=20150504185236|214796318&rawcontinue=

The format of it is timestamp|rc_id, but the API is designed for that to be opaque, so I don't know if we may just rely on it.

Hmm, if timestamp|rc_id works then it should be enough for us. I'm not sure if it's safe to use rccontinue while not actually continuing but if it is, then it should work. I'll check.

Smalyshev triaged this task as Normal priority.May 6 2015, 11:42 PM
Smalyshev added a project: Discovery.
Smalyshev claimed this task.May 7 2015, 4:15 AM

I suspect its safe to use it rccontinue to set the rc_id minimum for now but the point of these opaque continue operations is that they might change. If we're ok with this just failing at some point then its ok.

I'm not super clear on why we can't just do what we were doing though - for the most part we'd just always use rccontinue from the api blindly but if we restarted the poller we just pick up a second or two behind some changes we'd already polled. Getting those changes twice isn't a big deal for us.

The problem is not getting the change twice. The problem is since we don't have the ending point, if we ask by time only we can't know if we have updates or not. That means, if there are no updates but last 5 updates were at the same timestamp, current code will ask these 5 updates over and over, instead of sleeping as it should be until new ones come in.

Got it. Makes sense to me.

We could just always store the continue parameters from the last API request as it is intended and only after a full dump load do a few unnecessary but harmless updates.

Smalyshev closed this task as Resolved.May 7 2015, 10:15 PM

I think storing is optional, since a) dump doesn't have it and b) restarts of the updater are rare. Right now rccontinue works fine (@JanZerebecki thanks again for suggesting it!) so I consider this resolved. We can always improve it later if necessary but I think it's good for now.