Page MenuHomePhabricator

Order mediawiki_history dumps by event_timestamp
Closed, ResolvedPublic3 Estimated Story Points

Description

It seems that events are not always in order, reported by a user of the dumps, it makes processing harder. Checking with @mforns if there's a reason for this or just an oversight.

Event Timeline

mforns added a project: Analytics-Kanban.
mforns moved this task from Next Up to In Progress on the Analytics-Kanban board.

Indeed, events are not strictly in order.
For large and medium wikis, whose dumps are split in monthly or yearly files,
each file does only contain event corresponding to the specified period.
But within the files, the order is not guaranteed.
See: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/mediawikihistory/MediawikiHistoryDumper.scala#L203
The repartitioning of events shuffles any prior order that the data might have had.
We should add an order to the repartitioning or a sort operation post-repartitioning,
so that events are strictly ordered by timestamp, even within dump files.
Working on this!

Change 602343 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery/source@master] Sort mediawiki history dumps by timestamp

https://gerrit.wikimedia.org/r/602343

I tested the change and it works, data looks good.

But most interestingly: ordered data is about 20% smaller in size after compression!
Which makes sense, because order helps bz2 associate similar fields/records and compress them together,
as opposed of expressing unordered events, which will have a higher entropy overall.

Double win! \o/

Milimetric moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

Change 602343 merged by jenkins-bot:
[analytics/refinery/source@master] Sort mediawiki history dumps by timestamp

https://gerrit.wikimedia.org/r/602343

Nuria set the point value for this task to 3.

Sorry to bother, but I am using the dumps and I see the same problem on gnwiki. This is a really small wikipedia which is all in one file. The rows are not sorted by timestamp and the last ones are from 2010. Maybe the problem is fixed for other sizes but not for all-in-one-file. Would you please take a look at it? Thank you.

Thanks @marcmiquel for letting us know! I reopen and will investigate.

JAllemandou claimed this task.
JAllemandou moved this task from Done to In Progress on the Analytics-Kanban board.

You're welcome.

El dt., 18 d’ag. 2020, 18:21, JAllemandou <
no-reply@phabricator.wikimedia.org> va escriure:

JAllemandou claimed this task. View Task
https://phabricator.wikimedia.org/T254233
*TASK DETAIL*
https://phabricator.wikimedia.org/T254233

*EMAIL PREFERENCES*
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

*To: *JAllemandou
*Cc: *marcmiquel, JAllemandou, Milimetric, mforns, Aklapper, Alter-paule,
Beast1978, Un1tY, 4748kitoko, Hook696, Kent7301, joker88john, CucyNoiD,
Akovalyov, Gaboe420, Giuliamocci, Cpaulf30, Af420, Bsandipan, Lewizho99,
Maathavan, terrrydactyl, jeremyb

By the way, I just found that in cawiki files for years 2011 and 2016 (I've just checked this one now) rows are not sorted. The file starts with timestamps of 2016-01 and then it follows with 2016-04 and it continues with another one from 2016-01. I guess that since it happened with a one-file-dump like gnwiki, and with one-file-per-year dump, the bug might also affect one-file-per-month, as it is the case for enwiki.
Please, let me know when it is fixed, as I rely on it for creating a dashboard.
Thank you very much.

Change 621511 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery/source@master] Update mediawiki-history dumper to fix sorting bug

https://gerrit.wikimedia.org/r/621511

Change 621702 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Update mediawiki-history-dumps parameters

https://gerrit.wikimedia.org/r/621702

Change 621511 merged by jenkins-bot:
[analytics/refinery/source@master] Update mediawiki-history dumper to fix sorting bug

https://gerrit.wikimedia.org/r/621511

Change 621702 merged by Mforns:
[analytics/refinery@master] Update mediawiki-history-dumps parameters

https://gerrit.wikimedia.org/r/621702