Page MenuHomePhabricator

Make sure apps can continue /changes/since where they left off
Closed, ResolvedPublic

Description

Apps must provide a timestamp for GET /changes/since/{timestamp} so that changes starting from that time are shown. This is currently problematic for multiple reasons:

  • if the app wants to continue from the time it last did a similar request, it must rely on its own clock which might be out of sync with the server
  • if the app wants to use the last timestamp returned in the previous request, it cannot do that sanely because the request returns list and list entry changes in a separate queue, so it must take the older of the two timestamps, which could potentially mean resending many changes in the other queue.
  • either way, the app can miss changes which were inserted in the DB but not committed at the time the app made the previous request.

(See T182706#3933410 for more detail.)

Suggested solution:

  • Add a continue-from timestamp field to every GET /changes/since/{timestamp} and GET /lists/ response, telling the last timestamp it is safe to request a sync from. This will be the time the server has received the request, minus some margin for race conditions / transactions (probably also slave lag?).
    • On continuation, don't return such a timestamp (have the client retain the timestamp they got in the first step). In the future this could be improved but probably not worth it (would have to remember during the sequence of request what was the point when the timestamp in the continuation parameter was close enough to real time that there was a possibility of skipping an event; or would have to apply the same backdating logic to the timestamp inside the continuation parameter).
    • When apps do a full sync (a GET /lists/ request followed by a GET /lists/{id}/entries/ for every list received), they should treat the timestamp returned by GET /lists/ as the last time they have synced the DB. When not receiving any timestamp (on continuation), they should preserve the previous timestamp.
  • Document that clients need to be able to deal with duplicate events when syncing.
  • (optional) Have sync queries use LOCK IN SHARE MODE to make sure there cannot be "submarine transactions" where a change is inserted in the database with a certain timestamp but not visible due to being in an uncommitted transaction, so both the initial request and the followup miss it. This allows reducing the amount of timestamp backdating to 1s (just needs to avoid continuation race conditions). - T187133: Prevent "submarine transactions" in ReadingLists sync

Event Timeline

First of all, the timestamps actually have second precision, not millisecond like the example suggests (but there was no way to tell that to the Swagger doc generator), so multiple items with the same timestamp shouldn't be unusual. (And once we implement T182052: Batch reading list operations it will be even more frequent.) So either you should be prepared to receive sync events that you have already seen (which shouldn't be too problematic as far as I can see), or we should add a more specific position. (The next field does include the list ID as a tiebreaker, but it is intended to be opaque to the client; and there is no input parameter for the ID.)

Second, if the clients rely on the timestamp (as opposed to taking it, backdating N seconds and accepting duplicates), we should make sure there cannot be race conditions (where another device has submitted a record with an earlier timestamp but it's still in an uncommitted transaction and so invisible to the SELECT). I think this is the case already but would have to test.

Third, lists and list items are being iterated in parallel; they have different continuation parameters, but only a shared "start from" parameter. If, say, all lists can be returned in the first response but list items take three responses to iterate, and by the end of the third response, we have an older lists.updated and a more recent entries.updated timestamp, then neither of those is really useful - using the older one could mean repeating the last two requests, using the newer one might miss lists created since the first request.

One way to hide all this complexity is to always return a next field and have the client use that for sync, instead of having both a continuation parameter and a timestamp parameter. That way, clients would not have to worry about non-unique timestamps and whatnot; and it seems reasonably easy to implement internally, too.

@Tgr with the timestamp implementation, I would be OK with duplicates. I would re-sync the lists returned instead of assuming it was a deduped list of changes.

The next field would work as well.

Tgr renamed this task from [Feature Request] Return current timestamp in /changes/since to Make sure apps can continue /changes/since where they left off .Jan 30 2018, 7:54 PM
Tgr updated the task description. (Show Details)

I'm not sure of the best approach here. Getting new changes when you have polled for changes before and have a continuation token from that is easy, but what to do when that's not the case? What are the use cases for that? It could be after a full sync (via GET /lists/ and then GET /lists/{id}/entries for each list), in which case the client has a timestamp for the last lists request (or could get a continuation token from there) but does not have anything meaningful for the entries. Are there other use cases for polling changes without a continuation token?

Another thing I haven't thought of is, if the API always returns a next token, how do clients tell if the query can be continued right now? I could add an extra flag but it seems inelegant...

Makes me wonder if we really should have a logical clock instead of last updated timestamps.

So, these are the problematic scenarios I can think of:

1. "Submarine change": update gets missed due to transaction2. Change gets missed on timestamp conflict (id is a logical clock for inserts but not for updates)3. Item enumerated twice when changed during continuation4. Change gets missed due to entry sync happening separately per list
Submarine change (1).png (658×665 px, 50 KB)
Race condition in continuation .png (592×500 px, 43 KB)
Double enumeration when sorting by updated.png (519×500 px, 43 KB)
No continuation token for entry sync.png (742×537 px, 50 KB)
sourcesourcesourcesource

The first would either require backdating every timestamp with a security margin of 5s (which is the max transaction duration IIRC) and thus sending duplicate events to the client, which is not great; or making the semaphore behavior more agressive so that any write to a user's list entries locks that user's data for the duration of the transaction (ie. all or most sync queries should use LOCK IN SHARE MODE); or using some kind of sequence as a logical clock.

The second can be solved by backdating (it only happens when a change, a subsequent sync, and a subsequent change all happen in the same second) or using a logical clock for continuation.

The third probably should not be fixed (the alternative is to miss changes when doing full sync), just listing here for completeness.

The fourth could be solved by having an endpoint for fetching all entries of all lists in a single (possibly continued) request, maybe? Or just taking the timestamp from the list query and doing a changed-since query with that at the end of the full sync. (That might again result in a bunch of duplication but it's probably not avoidable.)

So I guess the minimum-effort solution is this: all sync-related requests return a timestamp, which is the server time at the start of the request minus the max transaction timeout. The client should be get duplicated results. When doing a full sync, the client should use the timestamp from the request for the lists (not the list entries) as the "last synced time" of the received data.

Thanks for putting all this together…

I'm not sure of the best approach here. Getting new changes when you have polled for changes before and have a continuation token from that is easy, but what to do when that's not the case? What are the use cases for that? It could be after a full sync (via GET /lists/ and then GET /lists/{id}/entries for each list), in which case the client has a timestamp for the last lists request (or could get a continuation token from there) but does not have anything meaningful for the entries. Are there other use cases for polling changes without a continuation token?

I see… would a potential solution here be to require clients that "sync" (read: apps) always use the changes API?

The more RESTy (GETing lists and GETting items in a list) endpoints are most for web clients anyways, and looks like we could just ignore for clients that store all this information locally?

Another thing I haven't thought of is, if the API always returns a next token, how do clients tell if the query can be continued right now? I could add an extra flag but it seems inelegant...

Do you mean if they try to sync with that token, but there is nothing new to sync, or something else?

@JoeWalsh @Dbrant if we can't do a token and we must have timestamps, does @Tgr's minimum solution seem ok? Is knowing that you are getting duplicates hurt app logic in any way?

So I guess the minimum-effort solution is this: all sync-related requests return a timestamp, which is the server time at the start of the request minus the max transaction timeout. The client should be get duplicated results. When doing a full sync, the client should use the timestamp from the request for the lists (not the list entries) as the "last synced time" of the received data.

I see… would a potential solution here be to require clients that "sync" (read: apps) always use the changes API?

Yeah, the sync endpoint is basically the lists endpoint plus the entries endpoint (except not filtered to a single list) in parallel, with some small differences (shows deleted items, has a date limit) so modifying that would also be an option.

Do you mean if they try to sync with that token, but there is nothing new to sync, or something else?

Yeah, it's an unnecessary extra request. It's not a serious problem (we could always use an extra flag to say whether there's more or the next field is just there in case the client wants to continue from there some time in the future), I just find it annoying.

Sure, in principle the app can deal with receiving duplicate items in a sync update. The minimum solution sounds fine, as long as the "idle norm" still means receiving an empty set. IOW, the app can receive duplicate items when there's a flurry of updates in the remote database, but in the general case the update will be empty, and won't necessitate the app to perform any local db queries. (am I understanding correctly?)

Yeah, only updates that the server received in the last few seconds would be duplicated.

Minimum solution with the timestamp seems ok to me.

The fourth could be solved by having an endpoint for fetching all entries of all lists in a single (possibly continued) request, maybe? Or just taking the timestamp from the list query and doing a changed-since query with that at the end of the full sync. (That might again result in a bunch of duplication but it's probably not avoidable.)

^ That kind of endpoint would be useful, maybe by passing "0" to the since endpoint you would get a paginated list of every list and entry ever created (including deleted)

@Tgr it seems like syncing for apps can be simplified to just using changes since endpoint, if we enable this behavior of passing "0" (distant past). The other REST APIs are still needed for Web and other applications, but this should at least solve the issue of getting the first time stamp.

Any issues here?

Change 409731 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/ReadingLists@master] Provide timestamp to continue sync from

https://gerrit.wikimedia.org/r/409731

Change 409731 merged by jenkins-bot:
[mediawiki/extensions/ReadingLists@master] Provide timestamp to continue sync from

https://gerrit.wikimedia.org/r/409731

@Tgr it seems like syncing for apps can be simplified to just using changes since endpoint, if we enable this behavior of passing "0" (distant past). The other REST APIs are still needed for Web and other applications, but this should at least solve the issue of getting the first time stamp.

Filed that as T187135: Allow /changes/since endpoint to take a "since the beginning of time" parameter. I don't think it solves any problem that's not already solved, although it's slightly more convenient to use.