Page MenuHomePhabricator

pagecounts-ez of month 2020-08 is incomplete
Open, HighPublic

Description

The file pagecounts-ez/merged/pagecounts-2020-08-views-ge-5 is incomplete, it stops in the Italian language. I need the Portuguese language to update the database of my tools monthly.

We can also see the file is incomplete comparing the size of the file in https://dumps.wikimedia.org/other/pagecounts-ez/merged/ , the months before 2020-08 have size above 12GB, and 2020-08 has 8GB. The .bz2 are apparently complete judging by the size, but as I use the file in Toolforge (I don't need to download it) and my script use random access (not possible in .bz2 files) to find the pt language in the file, the non compressed file is the best option in my case.

Event Timeline

@Danilo: You should use the bz2 compressed version of the file, they are complete (I checked). The availability of the uncompressed version seems a bug.

Compressed files are available on the computation machine stat1007:/srv/dumps/pagecounts-ez/projectviews for 2020-08, with the same size as the one in the dumps-website, but there are no uncompressed files there. I assume the uncompressed data gets computed in a raw file in the folder, copied while it's there by rsync, and once the generation job is done the data is compressed and the raw file deleted. This can lead to incorrect data being copied depending on data-generation timing and copy timing.

follow up: dumps::web::fetches::stat_dumps might be changed to force the copy only of the files that we want, so temporary ones are not transferred to labstore nodes..

I will make my script use the bz2 file when the uncompressed file is not complete.

But maybe it is also a good idea someone decompress the complete bz2 file that is in the same folder to replace the incomplete uncompressed one, it can fix the problem for the last month while the bug that caused the problem is not fixed.

NOTE: Talking about pagecounts-ez folder below, not other pageview/pagecount folders.

Things to discuss/fix:

I assume the 2 pages reference data generated using 2 different datasets and therefore could exist in different places. However such a difference is not made for hourly-pagecounts (see https://dumps.wikimedia.org/other/pagecounts-ez/merged/). Shall we merge both projectviews and projectcounts in the same folder to mimic pagecounts.

  • rsync only bz2 files in the merged folder - currently some not-yet compressed files are synced (possibly partially) onto the destination.

rsync example command (using / at the end source and dest as it is explicitly set on puppet):

rsync --dry-run -v -rt --include '/projectviews/' --include '/projectviews/**' --include '/merged/' --include '*/' --include '*.bz2' --exclude '*' /srv/dumps/pagecounts-ez/ /home/joal/test_rsync_pageview/
fdans triaged this task as High priority.
fdans added a project: Analytics-Kanban.
fdans moved this task from Incoming to Datasets on the Analytics board.
Aklapper added a subscriber: fdans.

(Resetting inactive assignee account)