- Go to https://dumps.wikimedia.org/other/pageview_complete/monthly/ and fetch one of the dumps
- Try to read it with command like bzip2 -d pageviews-202106-user.bz2
What happens?:
File cannot be uncompressed.
$ bzip2 -dc pageviews-202106-user.bz2 bzip2: pageviews-202106-user.bz2 is not a bzip2 file.
What should have happened instead?:
Either publish these dumps as bzip2 archives (daily archives *are* correctly published). Or document their type and format and possible change the extension. In this particular case it would be probably
pageviews-202106-user.parquet because these files are Parquet files with schema:
{ "type" : "record", "name" : "hive_schema", "fields" : [ { "name" : "line", "type" : [ "null", "string" ], "default" : null } ] }
Software version (if not a Wikimedia wiki), browser information, screenshots, other information, etc: