Indeed, events are not strictly in order.
For large and medium wikis, whose dumps are split in monthly or yearly files,
each file does only contain event corresponding to the specified period.
But within the files, the order is not guaranteed.
The repartitioning of events shuffles any prior order that the data might have had.
We should add an order to the repartitioning or a sort operation post-repartitioning,
so that events are strictly ordered by timestamp, even within dump files.
Working on this!
The [[ https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html#sortWithinPartitions-org.apache.spark.sql.Column...- | sortWithinPartition ]] function might of help :)
I tested the change and it works, data looks good.
But most interestingly: ordered data is about 20% smaller in size after compression!
Which makes sense, because order helps bz2 associate similar fields/records and compress them together,
as opposed of expressing unordered events, which will have a higher entropy overall.
Double win! \o/
Sorry to bother, but I am using the dumps and I see the same problem on gnwiki. This is a really small wikipedia which is all in one file. The rows are not sorted by timestamp and the last ones are from 2010. Maybe the problem is fixed for other sizes but not for all-in-one-file. Would you please take a look at it? Thank you.
El dt., 18 d’ag. 2020, 18:21, JAllemandou <
email@example.com> va escriure:
*Cc: *marcmiquel, JAllemandou, Milimetric, mforns, Aklapper, Alter-paule,
Beast1978, Un1tY, 4748kitoko, Hook696, Kent7301, joker88john, CucyNoiD,
Akovalyov, Gaboe420, Giuliamocci, Cpaulf30, Af420, Bsandipan, Lewizho99,
Maathavan, terrrydactyl, jeremyb
By the way, I just found that in cawiki files for years 2011 and 2016 (I've just checked this one now) rows are not sorted. The file starts with timestamps of 2016-01 and then it follows with 2016-04 and it continues with another one from 2016-01. I guess that since it happened with a one-file-dump like gnwiki, and with one-file-per-year dump, the bug might also affect one-file-per-month, as it is the case for enwiki.
Please, let me know when it is fixed, as I rely on it for creating a dashboard.
Thank you very much.