Page MenuHomePhabricator

Pagecounts merged archive with incorrect encoding and weird content
Closed, ResolvedPublic

Description

Hi, the January 2015 archive is not encoded in UTF8 but in ISO 8859-1: this seems to be the only archive with this exception (I downloaded all the archives from December 2018 to January 2015).

Can be downloaded here, file: pagecounts-2015-01-views-ge-5-totals.bz2.

Also, the content is weird: we can found in the archive the pagecounts for the article -field-empty- or index.html in many projects, but these articles never seem to have existed. This is the first time I've come across stuff like this.

Can the archive be rebuilt?

Event Timeline

It's the same with the file pagecounts-2015-01-views-ge-5.bz2.

Milimetric moved this task from Incoming to Data Quality on the Analytics board.
Milimetric subscribed.

We don't have the source data any more, so best we can do is change the encoding. This will be lower priority than our current work.

Thank you. It seems that all the archives before the data for January 2015 are not in UTF8.