Page MenuHomePhabricator

Monthly Wikimedia pageviews dumps cann't be decompressed
Closed, DuplicatePublicBUG REPORT

Description

The Wikimedia pageviews complete dumps include not only the daily pageviews but also the monthly pageviews in a bz2 file (for instance these are the pageviews for the month of June by human users, spiders and automated users: https://dumps.wikimedia.org/other/pageview_complete/monthly/2021/2021-06/). However, it is not possible to decompressed these files, while there is no problem with the daily bz2 files.

For instance, in Ubuntu 21.04, when executing the command bzip2 -dk pageviews-202106-user.bz2, it returns the following error: is not a bzip2 file.

Event Timeline

Wences updated the task description. (Show Details)
ArielGlenn subscribed.

The monthly files are not in bz2 format; they are in https://en.wikipedia.org/wiki/Apache_Parquet. That's always been the format for the monthly files, as far as I can tell. The file name extension is misleading; tagging Analytics to see what they want to do about that.

MusikAnimal subscribed.

Pageviews-Anomaly is intended for anomalies with the pageviews data itself, such unusual spikes in traffic. This task describes an issue with dumps generation.