Change Details

My Toolforge tool, `archive-things`, is currently downloading 7 GB of data per day, when it shouldn't be downloading anything. Its purpose is to periodically archive some web pages that frequently change to the Internet Archive, using Save Page Now (`https://web.archive.org/save/$url`). (At least some of this archival is unnecessary, but I haven't had the time to make it more efficient.) The reason this data is being downloaded to Toolforge is because the Internet Archive has been returning 404 errors for all HTTP header-only requests to `/save/*` (since about Sunday morning). The default behaviour of `--spider` appears to be to only request the page header, which now returns 404s, so I have had to turn the flag off. For reference, the current code is `cat /data/project/archive-things/urls.txt | xargs -n 1 -P 7 wget -x --max-redirect=0 --retry-connrefused -t 10 -T 2`, which is submitted with `jsub` daily at 17:15 UTC. Unless there is a flag I haven't figured out how to use which would solve this issue, the only flags I can use to avoid downloads are `--max-redirect` and `--start-pos`. However, the `--start-pos` flag is not included in Toolforge's version of wget (1.15; the current version is 1.19.5 and `--start-pos` was added in 1.16). This means that all of the pages which don't redirect are currently being downloaded. Updating `wget` would remedy this issue, and it would also allow the script to retry archival on some HTTP errors (`--retry-on-http-error`, added in 1.19.1).