My Toolforge tool, `archive-things`, is currently downloading 7 GB of data per day, when it shouldn't be downloading anything. Its purpose is to periodically archive some web pages that frequently change to the Internet Archive, using Save Page Now (`https://web.archive.org/save/$url`). (At least some of this archival is unnecessary, but I haven't had the time to make it more efficient.)
The reason this data is being downloaded to Toolforge is because the Internet Archive has been returning 404 errors for all header-only requests to `/save/*` (since about Sunday morning). The default behaviour of `--spider` appears to be to only request the page header, which now returns 404s, so I have had to turn the flag off.
For reference, the current code is `cat /data/project/archive-things/urls.txt | xargs -n 1 -P 7 wget -x --max-redirect=0 --retry-connrefused -t 10 -T 2`, which is submitted with `jsub` daily at 17:15 UTC.
Unless there is a flag I haven't figured out how to use which would solve this issue, the only flags I can use to avoid downloads are `--max-redirect` and `--start-pos`. However, the `--start-pos` flag is not included in Toolforge's version of wget (1.15; the current version is 1.19.5 and `--start-pos` was added in 1.16). This means that all of the pages which don't redirect are currently being downloaded. Updating `wget` would remedy this issue, and it would also allow the script to retry archival on some HTTP errors (`--retry-on-http-error`, added in 1.19.1).