Page MenuHomePhabricator

Missing updates due to recentchanges backfilling
Closed, ResolvedPublic

Description

I uploaded coordinates (P625) for Maldon (Q26272025) an hour and a half ago (diff).

But when I use WDQS to ASK whether it has coordinates (query), the answer is coming back FALSE -- even though WDQS says it was updated 30 seconds ago.

Similarly with ten other sets of coordinates I added at the same time. I can't spot anything wrong with my query, so is there something wrong with the database?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Looks like something weird is going on with recentchanges stream. Looking at the relevant fragment:

{
type: "edit",
ns: 0,
title: "Q29002776",
pageid: 30662598,
revid: 469529345,
old_revid: 469529329,
rcid: 498001207,
timestamp: "2017-03-22T19:14:14Z"
},
{
type: "edit",
ns: 0,
title: "Q26272025",
pageid: 28223090,
revid: 469529353,
old_revid: 468296753,
rcid: 498001280,
timestamp: "2017-03-22T19:14:14Z"
},
{
type: "edit",
ns: 0,
title: "Q29002776",
pageid: 30662598,
revid: 469529346,
old_revid: 469529345,
rcid: 498001208,
timestamp: "2017-03-22T19:14:15Z"
},

We can see that the entry for Q26272025 has rcid 498001280, which means it was inserted much later than Q29002776 entries, even though the timestamp is the same. Looking at this:

{
type: "edit",
ns: 0,
title: "Q29002778",
pageid: 30662600,
revid: 469529418,
old_revid: 469529412,
rcid: 498001279,
timestamp: "2017-03-22T19:14:27Z"
}

it appears it has been inserted full 13 seconds later than its timestamp suggests. WDQS updater currently uses backoff time of 10 seconds, which explains how it missed the update - by the time it was processing rcstream for that time period, the record for Q26272025 did not exist yet, and by the time it was backpatched in, the updater was too far ahead to check back at this time window.

The problem here is that the only real way of reading recentchanges stream for now is by timestamp. And turns out entries can be backpatched into it basically without time limit, and making too wide window makes it impossible to work for the updater, since it can only load maximum of 500 entries apiece.

So we either need to find a way to reliably get updates without missing stuff or switch to some API that is not based on recentchanges stream.

Ouch. That sounds quite nasty. The only thing I can think of from the user side that was perhaps slightly different about this set of edits was they were made with QuickStatements while I already had a different QuickStatements run open and going in another window (a big batch of 10,000 corrections in capitalization of item labels, which took about two and a half hours to run, all of which went through fine).

So conceivably could that have perhaps delayed some handshake in the writing of the coordinates statements? I don't have anything solid to support that as an idea, it's just a complete guess.

I suspect QS might be triggering it since I've had other instances of the same problem with QS edits. However, it's not QS' fault here as I see it, it's the combination of recentchanges preferred way being to fetch by time coupled with complete unreliability of its timeline.

The way to fix it now is to edit the entry again or to manually update the entries (I can do it if you give me the list of IDs). But I'll be working on finding a better solution for it, I though 10 sec backoff would be enough but clearly it's not.

Thanks for your speedy diagnosis. I've gone through and reverted and then un-reverted the ten edits by hand, so they are now fine.

Hope you can get a bit more rigour instilled into the recentchanges logging.

Smalyshev renamed this task from Some data not uploaded to WDQS ? to Missing updates due to recentchanges backfilling.Mar 29 2017, 6:15 PM
Smalyshev triaged this task as Medium priority.

Change 347565 had a related patch set uploaded (by Smalyshev):
[operations/puppet@production] Enable "trailing poller" functionality for production.

https://gerrit.wikimedia.org/r/347565

Change 347565 merged by Gehel:
[operations/puppet@production] Enable "trailing poller" functionality for production.

https://gerrit.wikimedia.org/r/347565

Trailing poller should fix it for now, if it's still not enough please reopen with newly affected IDs.