Page MenuHomePhabricator

wikidata-todo tool copies large dump files from dumps.wikimedia.org to /shared/dumps eating up NFS space
Open, Needs TriagePublic

Description

Did a quick check, following the previous task comments, it seems that there's a bot logging lots and lots of error:

root@labstore1004:~# cat tools_large_files_20210203.txt  | sort -n | tail -n 10
[...snip..]
75763044 KB /srv/tools/shared/tools/project/.shared/dumps/20210201.json.gz
89481636 KB /srv/tools/shared/tools/project/.shared/dumps/20210104.json.gz
89831120 KB /srv/tools/shared/tools/project/.shared/dumps/20210118.json.gz
[...snip..]

The massive dump files in /data/project/.shared/dumps/ were placed there by the wikidata-todo tool that @Magnus operates. This tool has an update_dumps.php which appears to parse the HTML of https://dumps.wikimedia.org/other/wikidata to find files and then downloads those files to /shared/dumps.

These exact files are already mounted on Toolforge hosts at /public/dumps/public/other/wikidata/ via the Dumps NFS shares.

@Magnus can your tool be modified to use the existing wikidata dump files from the Dumps NFS share rather than adding this duplicate 259G of content to the Toolforge NFS share?