Page MenuHomePhabricator

[Bug] Wikidata JSON dumps gets deleted after every new Wikidata dump
Closed, ResolvedPublic

Description

The Wikidata JSON dumps are currently being rsynced from the datasets server to /public/dumps/public/wikidatawiki/entities on a daily basis via cron. However, it conflicts with the rsync job that pushes new dumps to its parent directory, which is a mirror and thus deletes other files that are not the main database dumps.

Basically, this means that the Wikidata JSON dumps are being copied to the directory mentioned, but are deleted when new database dumps for wikidatawiki is made available. This thus causes the JSON dumps to be re-copied again and only to be deleted about a month later via the main database dumps rsync job, which generates a lot of bandwidth with no gain and blocks T101639.

Looking at the script that does the rsync job for the main database dumps, it is unlikely that it would be modified to accommodate the JSON dumps being in the same directory. I propose that the JSON dumps should be pushed to a different directory (such as /public/dumps/wikibase) just like the other miscellaneous files we have.

Event Timeline

Hydriz raised the priority of this task from to Medium.
Hydriz updated the task description. (Show Details)
Hydriz added subscribers: Hydriz, Aklapper.
Lydia_Pintscher renamed this task from Wikidata JSON dumps gets deleted after every new Wikidata dump to [Bug] Wikidata JSON dumps gets deleted after every new Wikidata dump.Aug 17 2015, 3:51 PM
Lydia_Pintscher moved this task from incoming to consider for next sprint on the Wikidata board.
Lydia_Pintscher set Security to None.
JanZerebecki raised the priority of this task from Medium to High.Sep 3 2015, 10:47 AM
Hydriz claimed this task.

The most recent Wikidata dump (October 2, 2015) was successfully copied over without this issue occurring, possibility due to a change in the way dumps are copied over to Labs.

Clearly it hasn't been fixed.

So the solution here is to have them in /public/dumps/wikibase ?
/public/dumps/wikidata would probably make more sense?

Putting it in /public/dumps/wikibase made more sense to me as it uses the name of the extension, which I felt would be more accurate in this context. As long as it's not mixed with the usual XML dumps, it's fine.

It seems that what would need to be changed is modules/dataset/files/labs/labs-rsync-cron.sh and specifically the line

do_rsync "wikibase/wikidatawiki/" "public/wikidatawiki/entities/"

as soon as folks agree on where to put the files. And I suppose something needs to be set up in puppet for labs making sure that location exists.

Hydriz claimed this task.
Hydriz moved this task from Incoming to Done on the Datasets-Archiving board.

This has been resolved quite a while ago, but this task was not updated.