Page MenuHomePhabricator

Please update wget on Toolforge
Closed, InvalidPublic

Description

My Toolforge tool, archive-things, is currently downloading 7 GB of data per day, when it shouldn't be downloading anything. Its purpose is to periodically archive some web pages that frequently change to the Internet Archive, using Save Page Now (https://web.archive.org/save/$url). (At least some of this archival is unnecessary, but I haven't had the time to make it more efficient.)

The reason this data is being downloaded to Toolforge is because the Internet Archive has been returning 404 errors for all HTTP header-only requests to /save/* (since about Sunday morning). The default behaviour of --spider appears to be to only request the page header, which now returns 404s, so I have had to turn the flag off.

For reference, the current main code is the one-line shell script cat /data/project/archive-things/urls.txt | xargs -n 1 -P 7 wget -x --max-redirect=0 --retry-connrefused -t 10 -T 2, which is submitted by cron through jsub daily at 17:15 UTC.

Unless there is a flag I haven't figured out how to use which would solve this issue, the only flags I can use to avoid downloads are --max-redirect and --start-pos. However, the --start-pos flag is not included in Toolforge's version of wget (1.15, from January 2014; the current version is 1.19.5 and --start-pos was added in 1.16). This means that all of the pages which don't redirect are currently being downloaded. Updating wget would remedy this issue, and it would also allow the script to retry archival on some HTTP errors (--retry-on-http-error, added in 1.19.1).

Related Objects

StatusSubtypeAssignedTask
Resolvedbd808
InvalidNone

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Jc86035 renamed this task from Please update the version of wget to Please update wget on Toolforge.Oct 10 2018, 11:19 AM

If this isn't going to be done due to the version of Ubuntu, is there another way this could be resolved?

If this isn't going to be done due to the version of Ubuntu, is there another way this could be resolved?

Eg: compile your own wget or get a pre-conmpiled binary from somewhere.

Not commenting on the actionable of the task, just trying to suggest a better alternative solution.

Unless there is a flag I haven't figured out how to use which would solve this issue

Have you tried the flag "use curl instead"? XD I found personally curl superior in all ways except on certain usages for command line usage, but I don't know about your needs, so I am not sure if it would work for you.

@jcrespo It looks like curl might work a bit better for this particular issue, actually, so I'll try using it instead.

@jcrespo I didn't realise that because the version of cURL on Toolforge is also four years old, it lacks the --retry-connrefused flag (added in 7.52.0 in December 2016), which I've generally used with wget (in the same manner) to maximise the rate of requests. I've switched back to the previous configuration.

I've installed wget 1.19.5 locally and I've figured out how to use it through jsub. If it works I'll mark this as resolved.