Page MenuHomePhabricator

Match 700,000 edits to their revision ID
Open, LowPublic

Description

The Hashtags tool has a database of edit data for edits which use a #hashtag in the edit summary. When building the tool I didn't properly consider the difference between a recentchanges ID and a revision ID, and recorded the recentchanges ID.

I now have a database of 700,000 edits, for which I've stored username, page title, edit summary, timestamp, and recentchanges ID. I'd like to update this so that each edit has the infinitely more useful revision ID stored too. Work on this has been completed for edits going forward, but I still have all this data that needs to have revision IDs attached retroactively.

It would be relatively straightforward to do 700,000 individual database queries on the Wiki replicas to match edit metadata and retrieve a revision ID, but I'm not sure if this is excessive usage of the databases (by my estimates this would take approximately 20 hours to run). An additional complication is that for some % of edits in my testing, the timestamp in the revision table appeared to be 1 second off what it was in the recentchanges table, though I haven't explored that fully. If individual queries per-edit isn't feasible, I'm not sure what the best way to go about doing this matching is. Advice appreciated.

Event Timeline

Samwalton9-WMF added a project: Data-Services.

Adding Data-Services in the hope that someone can provide advice :)

What quer(ies) are you planning to use to find the data?

It would be relatively straightforward to do 700,000 individual database queries

It would almost certainly be quicker doing batches of queries... But it all depends on how you're going to find the revisions you're looking for

A quick look suggests you might be missing the page id (can do a join if necessary).. Which would be useful for querying revision with the id and timestamp to get the rev id...

CREATE INDEX /*i*/page_timestamp ON /*_*/revision (rev_page,rev_timestamp);

What quer(ies) are you planning to use to find the data?

I'd been poking around with something simple like

SELECT rev_id FROM revision
WHERE rev_timestamp = "{timestamp}"
AND rev_user_text = "{user}"

I'm very much an SQL novice so there's likely a better way of doing this... In particular, of the 200 edits I tested with this, 77 weren't matched. A spot check implies that the revision table has a different (approx one second) timestamp for that edit. Not sure what the source of that issue is, because the timestamp was stored directly from the recentchanges feed, but see for example https://quarry.wmflabs.org/query/33851, which doesn't find the edit with the logged timestamp ("20180831213343"), instead finding it one second later ("20180831213344").

I wonder if perhaps it would be a more sensible approach to throw this database out entirely and recollect the data from scratch from the revision table.

Might be worth filing a task for the drift if there isn’t one already. I’ve not checked the reason why, but I’m guessing it’s because of the time stamps being generated at he point of insertion to the different tables at slightly different times

Definitely worth checking

It’s upto you. If you don’t care about the specific data, it might be easier to just recollect it. Rather than potentially messing around to match up a few extra rows at the end. Might be less labour overall

Beyond that, it’s doing a substring of the date, and trying to match. Which might get slightly inaccurate at times

> but I'm not sure if this is excessive usage of the databases

Don't worry about resource usage until we worry for you, and when that happens, we will be telling you.

but I’m guessing it’s because of the time stamps being generated at he point of insertion to the different tables at slightly different times

Yeah, in an ideal world, because they are inserted on the same transaction, now() should see the time freezed at the beginning of the transaction (or the first query), but because client-based timestamps are used and strings calculated on client side, those doesn't necessarily match.

I can indeed see 780K mismatched timestamps on enwiki, out of 8 million recent change rows:

SELECT * FROM recentchanges JOIN revision ON rc_this_oldid = rev_id WHERE rc_timestamp <> rev_timestamp;
780576 rows in set (2 min 53.30 sec)
Samwalton9-WMF lowered the priority of this task from Medium to Low.Jul 5 2019, 11:51 AM