Page MenuHomePhabricator

archive-things running a large number of parallel grid jobs
Closed, ResolvedPublic

Description

At 2019-01-04 22:09 UTC the archive-things tool is/was running 75 active grid engine jobs:

JobTotal seenActiveLast seen (exit)
save44Currently running
yt-channels-3140034Currently running
yt-channels-5830030Currently running
yt-daily1752Currently running
yt-lists1675Currently running

This seems like a lot of resources to be devoting to the purpose of "Automatically and periodically archive things to the Internet Archive Wayback Machine through Save Page Now." Can the concurrency (number of simultaneously active jobs) be scaled back?

Event Timeline

bd808 created this task.Jan 4 2019, 10:26 PM
Restricted Application added a project: Internet-Archive. · View Herald TranscriptJan 4 2019, 10:26 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Jc86035 added a comment.EditedJan 5 2019, 6:30 AM

Would it be beneficial to use larger job sizes (and thus fewer jobs)? If not then this could only be done by reducing the archival frequency, although if it's a load issue I could change the number of concurrent xargs processes.

yt-channels-5 has many more jobs but this is because it runs 25 jobs (~310 URLs each) every half hour, whereas yt-channels-3 runs 25 jobs (~2790 URLs each) every three hours. Pretty much all of it is just wget HTTP requests to the Internet Archive, without downloading anything to Tool Labs.

The reason they're split up like this is because the Internet Archive servers refuse too many connections from one IP. Presumably there are benefits to spreading the load over the grid nodes. (Incidentally this also helps for the Alexa tasks, since alexa.com also seems to rate-limit per IP, and IA's Save Page Now – https://web.archive.org/save/* – seems to use the same outgoing IP for archival of all incoming requests from the same IP.)

If it's desirable I could reduce 3 to every 12 hours and/or 5 to every hour, or even more. However, I assumed in setting this up that what I was doing was not a significant fraction of the load on Tool Labs.

However, it also seems that there are a couple of jobs that have been stuck:

614389 0.30095 yt-channel tools.archiv r     01/03/2019 15:33:58 task@tools-exec-1436.tools.eqi     1        
614391 0.30095 yt-channel tools.archiv r     01/03/2019 15:33:58 task@tools-exec-1409.eqiad.wmf     1        
614394 0.30095 yt-channel tools.archiv r     01/03/2019 15:33:59 task@tools-exec-1412.tools.eqi     1        
614397 0.30095 yt-channel tools.archiv r     01/03/2019 15:34:00 task@tools-exec-1439.tools.eqi     1        
614399 0.30095 yt-channel tools.archiv r     01/03/2019 15:34:01 task@tools-exec-1416.tools.eqi     1        
614508 0.30095 yt-lists   tools.archiv r     01/03/2019 15:38:03 task@tools-exec-1405.eqiad.wmf     1        
616534 0.30092 yt-lists   tools.archiv r     01/03/2019 16:38:03 task@tools-exec-1434.tools.eqi     1        
616564 0.30092 yt-daily   tools.archiv r     01/03/2019 16:39:12 task@tools-exec-1422.tools.eqi     1        
616580 0.30092 yt-daily   tools.archiv r     01/03/2019 16:39:16 task@tools-exec-1407.eqiad.wmf     1        
618978 0.30089 yt-channel tools.archiv r     01/03/2019 17:50:01 task@tools-exec-1420.tools.eqi     1        
619078 0.30089 yt-channel tools.archiv r     01/03/2019 17:50:14 task@tools-exec-1403.eqiad.wmf     1        
620497 0.30088 yt-channel tools.archiv r     01/03/2019 18:28:55 task@tools-exec-1409.eqiad.wmf     1        
620504 0.30088 yt-channel tools.archiv r     01/03/2019 18:28:56 task@tools-exec-1411.tools.eqi     1        
620507 0.30088 yt-channel tools.archiv r     01/03/2019 18:28:57 task@tools-exec-1417.tools.eqi     1        
620514 0.30088 yt-channel tools.archiv r     01/03/2019 18:28:58 task@tools-exec-1410.eqiad.wmf     1        
621085 0.30087 yt-channel tools.archiv r     01/03/2019 18:44:13 task@tools-exec-1430.tools.eqi     1        
621101 0.30087 yt-channel tools.archiv r     01/03/2019 18:44:15 task@tools-exec-1408.eqiad.wmf     1        
663763 0.30036 yt-lists   tools.archiv r     01/04/2019 15:38:02 task@tools-exec-1428.tools.eqi     1        
692872 0.30002 yt-lists   tools.archiv r     01/05/2019 05:38:03 task@tools-exec-1435.tools.eqi     1

Usually most of these should take less than half an hour. The issue is probably some unmatched character somewhere. I don't know if I've fixed it.

bd808 added a comment.Jan 5 2019, 5:28 PM

If it's desirable I could reduce 3 to every 12 hours and/or 5 to every hour, or even more. However, I assumed in setting this up that what I was doing was not a significant fraction of the load on Tool Labs.

In the last 7 days this tool has accounted for 4% of the total number of jobs ran on the grid and is the 3rd most active tool by job count of the 948 tools using the grid. The actual count of jobs is not overly concerning to me at this point, but consistently ranking as the tool running the most concurrent jobs brought it to my attention. Ultimately we should implement a solution for T67777: Limit number of jobs users can execute in parallel that sets reasonable limits in the grid itself.

bd808 closed this task as Resolved.Feb 10 2019, 8:56 PM
bd808 claimed this task.

Concurrency looks better right now. Ultimately this will be solved when the tool moves to the new Stretch job grid where T67777 has applied a cap on the number of concurrent jobs that a single tool can run.

bd808 reassigned this task from bd808 to Jc86035.Feb 10 2019, 8:56 PM