Page MenuHomePhabricator

wikistats: montly pageview dumps are not bz2 files
Closed, ResolvedPublicBUG REPORT

Description

What happens?:

File cannot be uncompressed.

$ bzip2 -dc pageviews-202106-user.bz2 
bzip2: pageviews-202106-user.bz2 is not a bzip2 file.

What should have happened instead?:

Either publish these dumps as bzip2 archives (daily archives *are* correctly published). Or document their type and format and possible change the extension. In this particular case it would be probably
pageviews-202106-user.parquet because these files are Parquet files with schema:

{
  "type" : "record",
  "name" : "hive_schema",
  "fields" : [ {
    "name" : "line",
    "type" : [ "null", "string" ],
    "default" : null
  } ]
}

Software version (if not a Wikimedia wiki), browser information, screenshots, other information, etc:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

BTW: Parquet compression would be significantly more effective if the line was splitted into its parts, i.e. with fields for wiki code, article, pageId, type, count, hourly.

My apologies for this. The intended format is bz2, not parquet. Clearly a miss of mine when configuring the job, looking into options to regenerate/convert in the least disruptive way possible.

(Resetting inactive assignee account)

The fix has been found and data regeneration is on its way. It will take a few days to get done, please be patient :)