Page MenuHomePhabricator

hourly pageview dumps can contain empty title
Closed, DeclinedPublic

Description

Both new and old format pagecount files can contain empty title.

e.g. in stat1002:/mnt/data/xmldatadumps/public/other/pagecounts-all-sites/2015/2015-01>

zgrep -P "^ab.*28331" pagecounts-20150127-110000.gz
yields
'ab 1 28331' (notice two consecutive spaces)

As this occurs in both old and new files this is probably a very old bug.
This may break post-processing scripts that always expect four non-empty fields
(it did break the daily/monthly aggregation script).

I'll patch this in my script, so for now I'm good, but empty title can't occur so something got lost.

Event Timeline

ezachte raised the priority of this task from to Medium.
ezachte updated the task description. (Show Details)
ezachte subscribed.

@ezachte: Please associate projects when creating tasks. Assuming this is about Analytics (feel free to correct via "Edit task"). Thanks!

We no longer maintain these datasets, please take a look at the new pageviews dataset: https://dumps.wikimedia.org/other/analytics/ and specifically https://dumps.wikimedia.org/other/pageviews/