Maniphest T206616

Please update wget on Toolforge
Closed, InvalidPublic
Actions

Assigned To

None

Authored By

	Jc86035
	Oct 10 2018, 11:03 AM

Description

My Toolforge tool, archive-things, is currently downloading 7 GB of data per day, when it shouldn't be downloading anything. Its purpose is to periodically archive some web pages that frequently change to the Internet Archive, using Save Page Now (https://web.archive.org/save/$url). (At least some of this archival is unnecessary, but I haven't had the time to make it more efficient.)

The reason this data is being downloaded to Toolforge is because the Internet Archive has been returning 404 errors for all HTTP header-only requests to /save/* (since about Sunday morning). The default behaviour of --spider appears to be to only request the page header, which now returns 404s, so I have had to turn the flag off.

For reference, the current main code is the one-line shell script cat /data/project/archive-things/urls.txt | xargs -n 1 -P 7 wget -x --max-redirect=0 --retry-connrefused -t 10 -T 2, which is submitted by cron through jsub daily at 17:15 UTC.

Unless there is a flag I haven't figured out how to use which would solve this issue, the only flags I can use to avoid downloads are --max-redirect and --start-pos. However, the --start-pos flag is not included in Toolforge's version of wget (1.15, from January 2014; the current version is 1.19.5 and --start-pos was added in 1.16). This means that all of the pages which don't redirect are currently being downloaded. Updating wget would remedy this issue, and it would also allow the script to retry archival on some HTTP errors (--retry-on-http-error, added in 1.19.1).

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		bd808	T55704 Packages to be added to toollabs puppet
		Invalid		None	T206616 Please update wget on Toolforge

Event Timeline

Jc86035 created this task.Oct 10 2018, 11:03 AM

Restricted Application added a project: Internet-Archive. · View Herald TranscriptOct 10 2018, 11:03 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Jc86035 updated the task description. (Show Details)Oct 10 2018, 11:10 AM

Jc86035 updated the task description. (Show Details)Oct 10 2018, 11:12 AM

Jc86035 updated the task description. (Show Details)Oct 10 2018, 11:16 AM

Jc86035 renamed this task from Please update the version of wget to Please update wget on Toolforge.Oct 10 2018, 11:19 AM

Jc86035 removed a parent task: T55704: Packages to be added to toollabs puppet.Oct 10 2018, 12:22 PM

Jc86035 added a parent task: T55704: Packages to be added to toollabs puppet.

If this isn't going to be done due to the version of Ubuntu, is there another way this could be resolved?

In T206616#4654659, @Jc86035 wrote:

If this isn't going to be done due to the version of Ubuntu, is there another way this could be resolved?

Eg: compile your own wget or get a pre-conmpiled binary from somewhere.

Not commenting on the actionable of the task, just trying to suggest a better alternative solution.

Unless there is a flag I haven't figured out how to use which would solve this issue

Have you tried the flag "use curl instead"? XD I found personally curl superior in all ways except on certain usages for command line usage, but I don't know about your needs, so I am not sure if it would work for you.

@jcrespo It looks like curl might work a bit better for this particular issue, actually, so I'll try using it instead.

@jcrespo I didn't realise that because the version of cURL on Toolforge is also four years old, it lacks the --retry-connrefused flag (added in 7.52.0 in December 2016), which I've generally used with wget (in the same manner) to maximise the rate of requests. I've switched back to the previous configuration.

I've installed wget 1.19.5 locally and I've figured out how to use it through jsub. If it works I'll mark this as resolved.

Jc86035 closed this task as Invalid.Oct 15 2018, 3:04 PM

Please update wget on ToolforgeClosed, InvalidPublicActions

Description

Related ObjectsSearch...

Event Timeline

Please update wget on Toolforge
Closed, InvalidPublic
Actions

Related Objects
Search...