Page MenuHomePhabricator

Linkwatcher spawns many processes without parent
Closed, ResolvedPublic

Description

On tools-exec-1202:

init(1)-
...
        |-perl(595)
        |-perl(608)
        |-perl(3123)
        |-perl(3171)
        |-perl(3174)
        |-perl(4337)
        |-perl(4382)
        |-perl(5389)
        |-perl(6593)
        |-perl(8924)
        |-perl(8933)
        |-perl(9390)
        |-perl(9921)
        |-perl(12606)
        |-perl(17345)
        |-perl(19282)
        |-perl(19310)
        |-perl(19343)
        |-perl(19345)
        |-perl(19415)
        |-perl(19471)
        |-perl(19473)
        |-perl(19489)
        |-perl(19548)
        |-perl(19563)
        |-perl(19630)

all of these are various linkwatcher scripts:

valhallasw@tools-exec-1202:~$ ps aux | grep Link
51230      608  0.7  1.0 234148 85092 ?        SN   Jan05  30:53 perl LinkAnalyser.pl LiWa3 15
51230     3171  0.1  0.3 178992 31708 ?        SN   Jan05   9:16 perl LinkReporter.pl LiWa3 2 2
51230     3174  0.1  0.3 178992 31876 ?        SN   Jan05   9:11 perl LinkReporter.pl LiWa3 3 3
51230     4382  0.1  0.3 178992 31604 ?        SN   Jan05   8:32 perl LinkReporter.pl LiWa3 6 1
51230     5389  3.8  1.1 250392 96924 ?        SN   09:00  19:19 perl LinkAnalyser.pl LiWa3 32
51230     6593  1.0  0.6 197032 50696 ?        SN   09:17   4:54 perl LinkAnalyser.pl LiWa3 33
1092     15571  0.0  0.0  33340   948 pts/0    S+   17:22   0:00 grep --color=auto Link
51230    17345  0.5  0.5 183440 41188 ?        SN   11:30   1:54 perl LinkAnalyser.pl LiWa3 34
51230    19282  8.9  1.3 174480 111716 ?       SN   11:39  30:37 perl LinkParser.pl LiWa3 295
51230    19310  8.8  1.3 172324 109484 ?       SN   11:41  30:04 perl LinkParser.pl LiWa3 296
51230    19343  9.0  1.3 172132 108484 ?       SN   11:43  30:34 perl LinkParser.pl LiWa3 297
51230    19345  9.0  1.3 177168 114380 ?       SN   11:43  30:33 perl LinkParser.pl LiWa3 298
51230    19415  9.1  1.4 178900 114704 ?       RN   11:48  30:29 perl LinkParser.pl LiWa3 299
51230    19471  8.8  1.3 170764 107820 ?       SN   11:52  29:13 perl LinkParser.pl LiWa3 300
51230    19473  8.9  1.3 175928 112784 ?       SN   11:52  29:38 perl LinkParser.pl LiWa3 301
51230    19489  9.0  1.3 173916 111056 ?       SN   11:53  29:44 perl LinkParser.pl LiWa3 302
51230    19548  8.6  1.2 168600 105656 ?       SN   11:57  28:17 perl LinkParser.pl LiWa3 303
51230    19563  9.0  1.3 177424 113312 ?       SN   11:58  29:14 perl LinkParser.pl LiWa3 304
51230    19630  8.9  1.3 171756 108752 ?       SN   11:59  28:48 perl LinkParser.pl LiWa3 305
51230    19632  9.0  1.3 171948 108820 ?       SN   11:59  29:07 perl LinkParser.pl LiWa3 306
51230    19855  9.0  1.3 169528 106708 ?       SN   11:59  29:23 perl LinkParser.pl LiWa3 307
51230    20462  8.9  1.3 174068 110140 ?       SN   12:00  28:42 perl LinkParser.pl LiWa3 308
51230    21750  8.8  1.2 168640 105428 ?       SN   12:01  28:22 perl LinkParser.pl LiWa3 309
51230    21768  8.6  1.2 164576 100880 ?       SN   12:03  27:30 perl LinkParser.pl LiWa3 310
51230    21801  8.7  1.2 170160 105996 ?       SN   12:04  28:00 perl LinkParser.pl LiWa3 311
51230    21819  8.7  1.3 172372 108356 ?       SN   12:05  27:41 perl LinkParser.pl LiWa3 312
51230    21822  8.9  1.3 169696 106492 ?       SN   12:05  28:20 perl LinkParser.pl LiWa3 313
51230    21837  8.7  1.3 173524 110732 ?       SN   12:06  27:48 perl LinkParser.pl LiWa3 314
51230    21887  8.9  1.2 169928 105692 ?       SN   12:09  28:01 perl LinkParser.pl LiWa3 315
51230    21902  8.7  1.2 163948 100924 ?       SN   12:10  27:21 perl LinkParser.pl LiWa3 316
51230    21917  8.8  1.2 163268 100404 ?       RN   12:11  27:34 perl LinkParser.pl LiWa3 317
51230    22102  9.0  1.2 163640 100604 ?       SN   12:24  27:06 perl LinkParser.pl LiWa3 318
51230    22104  9.0  1.2 167164 104344 ?       RN   12:24  26:52 perl LinkParser.pl LiWa3 319
51230    24354  9.0  1.2 163740 100720 ?       SN   12:33  26:09 perl LinkParser.pl LiWa3 320
51230    26992  9.2  1.1 153988 90796 ?        SN   13:08  23:35 perl LinkParser.pl LiWa3 321
51230    27068  9.3  1.0 151480 87912 ?        SN   13:13  23:17 perl LinkParser.pl LiWa3 322
51230    29534  9.0  1.0 147752 84796 ?        SN   13:36  20:30 perl LinkParser.pl LiWa3 323
51230    29694  9.1  0.9 144920 81320 ?        SN   13:47  19:38 perl LinkParser.pl LiWa3 324

Because it spawns a large number of processes not under SGE command, LiWa is effectively overloading tools-exec-1202. I have rescheduled the other continuous jobs on that host:

$ qhost -h 'tools-exec-1202' -j
tools-exec-1202.eqiad.wmflabs lx26-amd64      4  5.77    7.8G    4.6G   23.9G     0.0
   job-ID  prior   name       user         state submit/start at     queue      master ja-task-ID
   ----------------------------------------------------------------------------------------------
       314 0.32354 BCBot4     tools.betaco r     12/30/2015 03:59:30 continuous MASTER
      5580 0.80000 vandalstat tools.cluest Rr    12/30/2015 03:54:44 continuous MASTER
     41876 0.35471 ghaher69   tools.dexbot Rr    12/30/2015 03:54:44 continuous MASTER
    165405 0.54393 analytics- tools.morebo Rr    12/30/2015 03:54:44 continuous MASTER
    287009 0.36718 rmiw.w1    tools.yifeib Rr    12/30/2015 03:54:44 continuous MASTER
   1518701 0.43664 foo        tools.pirsqu Rr    12/30/2015 03:54:44 continuous MASTER
   1967777 0.30862 linkwatche tools.linkwa r     01/05/2016 05:25:51 continuous MASTER
   1808912 0.41793 gpy        tools.gpy    r     11/21/2015 20:19:43 task@tools MASTER

$ qmod -rj 314 5580 41876 165405 287009 1518701
Pushed rescheduling of job 314 on host tools-exec-1202.eqiad.wmflabs
Pushed rescheduling of job 5580 on host tools-exec-1202.eqiad.wmflabs
Pushed rescheduling of job 41876 on host tools-exec-1202.eqiad.wmflabs
Pushed rescheduling of job 165405 on host tools-exec-1202.eqiad.wmflabs
Pushed rescheduling of job 287009 on host tools-exec-1202.eqiad.wmflabs
Pushed rescheduling of job 1518701 on host tools-exec-1202.eqiad.wmflabs

valhallasw@tools-bastion-02:~$ qhost -h 'tools-exec-1202' -j
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
tools-exec-1202.eqiad.wmflabs lx26-amd64      4  6.24    7.8G    4.6G   23.9G     0.0
   job-ID  prior   name       user         state submit/start at     queue      master ja-task-ID
   ----------------------------------------------------------------------------------------------
   1967777 0.30862 linkwatche tools.linkwa r     01/05/2016 05:25:51 continuous MASTER
   1808912 0.41793 gpy        tools.gpy    r     11/21/2015 20:19:43 task@tools MASTER

so it's now effectively a linkwatcher-only host. I haven't killed any processes.

@Beetstra, can you make sure this doesn't happen?

Event Timeline

valhallasw raised the priority of this task from to Needs Triage.
valhallasw updated the task description. (Show Details)
valhallasw added a project: Toolforge.
valhallasw added subscribers: valhallasw, Beetstra.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

Valhallasw, it spawns many subprocesses to be able to keep up with
wikipedia editing. It needs to parse in real time as anti-spam bots and
work depend on it.

Can we set up a separate exec node for linkwatcher processes?

Let me first stress that we value your work. Anti-vandalism is important, and it's clear that LiWa makes the jobs of editors much easier. At the same time, Tool Labs is a cooperative environment, and tools should take care not to disrupt other tools. I've now made sure your tool is currently not disrupting other tools, but after a restart it is likely to disrupt other tools again, and it is your responsibility to make sure it plays nice in the future. The current setup is disruptive and it is at the same time not efficient in using the available resources.

There are two reasons the current setup is disruptive. Most importantly, spawning many subprocesses (or threads) that use significant amounts of CPU time is problematic because of the way SGE schedules jobs. Roughly speaking, SGE assumes every job will at most use one CPU fully, and uses this heuristic to make sure not too many jobs get scheduled on the same host. If a job spawns 50 subprocesses that all use lots of CPU, this heuristic fails, and the host gets overloaded. This then affects the other jobs running on that host.

The current implementation has a second issue, which is that the processes are not subprocesses, but daemonized processes (the parent process is not linkwatcher.pl, but init). This means they are no longer under control of SGE, and means that they will stay running after the job is stopped or rescheduled.

Finally, this method of spawning jobs is ineffective, because it limits you to the CPU power of a single host. If the different tasks are scheduled as seperate grid jobs, the jobs are spread out over many hosts, which allows you to use more CPU power in total, and it allows us to provide for more execution hosts when necessary. It's currently not yet possible to do this from existing exec hosts, but this should be possible after T123270 is resolved.

Another option, as @Legoktm suggested, is to move LiWa to it's own host. This might be the best option if rewriting the bot to make effective use of the grid would take too much time. We would rather not have another one-off exec node, so my suggestion would then be to create a seperate Labs project, and move it there. That has the extra advantage that it's possible to run your own local mysql server, which would offload labsdb1003.

@valhallasw - thank you for the lengthy explanation. This bot has now been running on labs for a long time (and sometimes for long uptimes without problems - it has at least once managed to run for more than 6 months in a row), and has been running smoothly here. The main thing that I see from running this system on a multi-bot environment is the interaction indeed with the other bots. When it was privately hosted, sometimes the other bots were 'munching' too much and the bot started lagging - I see that here as well (and obviously, and my apologies for that, the opposite also happens). In the early times of Labs, it has indeed been running on an own instance for some time., both to avoid bringing down other bots, as well as being brought down by other bots.

For now, I should note that it seemed to need so many subprocesses (to be able to keep up) because of the growing lagging due to expensive MySQL queries. I have now sped that part up (still working on it), and will see what happens if the bot runs on less of its sub-processes (I have brought the number of parsers down from 30 to 10, lately the number of analysers is brought down from 10 to 4, and with those it already is just strong enough). (note added later, it has now been running for a couple of hours on 10 parsers, and is keeping up just nicely - will cut it down further).

I understand that the system would be more effective running over multiple hosts, but the way the communication is running between the different parts the communication would get lost there (or would need even more MySQL). Such a change would result in a massive rewrite of the system, which I am willing to work on, but for which I may not have any time to do (my on-wiki time is closer to 'minutes per day' than 'hours per day').

By the way, there is no problem with the jobs staying running after the job is stopped or rescheduled - if it is a clean stop the bot will kill all it's processes itself, otherwise they will die soon after as they lose communication with the 'master'.

@valhallasw: Is it possible to tell SGE that certain bots need more power than just one CPU. As I see it (using top) the bot overall indeed uses more than 1 cpu, but the max seems to be lower than 200% in sum - the processes peak at about 40% CPU use (but never all at the same time that high, at most 2 go over 30% at the same time), generally running at about 15%, with many only running at 0-3%). Also, is it possible to 'throttle' processes to a max-processor amount to use (resulting in that processes just have to wait until there is 'space')?

I will also have a look where the more processor-intensive processes use all that power for - it may be the regexing that is 'expensive' there. We can also look how to go from deamonized processes to real subprocesses, but I think it will be more efficient to run the processes all on one instance (that would take care of rare cases of lost processes that are not under the control of SGE, though I haven't seen that happening in the long time this system has been operating here).

We can consider moving it to a separate project (in that case in combination with coibot - coibot is very dependent on linkwatcher). Note that other users and processes do make use of the databases generated here as well.

@valhallasw: taking the number of parsers down from 10 to 8 resulted in formation of a backlog within 10 minutes. Trying 9 .. (the parsers are the processor intensive processes, the others hardly ever take more than 3% each, and often are 0).

I know that it isn't ready yet and it probably shouldn't be beta-tested with such a complex tool, but isn't the goal of the Kubernetes setup to provide better isolation/scheduling for such tools? If so, I'd prefer if we could cope with the status quo for a bit and reevaluate later.

AFAIUI, the problem is that the grid doesn't anticipate the load of the tool; as a corollary this should mean that if the tool "expands" its load, all we need to do is reschedule the other jobs on that instance as the grid shouldn't schedule any new jobs there due to the high load. While imperfect, if this only happens every few months or so, that'd be "good enough" for me for a tool that isn't developed from scratch.

@valhallasw - I have added 2 more parsers (total now 12) - the bot is creating a backlog, likely during the American daytime, which it does not munch away at night.

The bot is still eating away its (old) backlog, which goes slowly. Bot seems to operate fine now with way less processes. Still it uses 200-250% of processor power, which seems to be necessary for a bot doing all this work. As earlier, we could consider a rewrite making the sub-processes running independently, or I could split the bot into three smaller bots - but both actions require significant rewrites for which I do not have time.

I understand that the system would be more effective running over multiple hosts, but the way the communication is running between the different parts the communication would get lost there (or would need even more MySQL). Such a change would result in a massive rewrite of the system, which I am willing to work on, but for which I may not have any time to do (my on-wiki time is closer to 'minutes per day' than 'hours per day').

I see. In that case, I don't think it's reasonable to expect you to rewrite it in such a way that it spreads the load around SGE hosts, and we should look for a different solution.

By the way, there is no problem with the jobs staying running after the job is stopped or rescheduled - if it is a clean stop the bot will kill all it's processes itself, otherwise they will die soon after as they lose communication with the 'master'.

OK, this is good to know. In that case, it should indeed not be an issue.

@valhallasw: Is it possible to tell SGE that certain bots need more power than just one CPU.

In theory, it should be possible, but I can't figure out how. In principe, consumables would allow you to reserve cores, but I don't think our SGE is set up correctly for that

Also, is it possible to 'throttle' processes to a max-processor amount to use (resulting in that processes just have to wait until there is 'space')?

Yes, you can use nice or renice to decrease the priority of the processes.

For now, I think scfc's suggestion

AFAIUI, the problem is that the grid doesn't anticipate the load of the tool; as a corollary this should mean that if the tool "expands" its load, all we need to do is reschedule the other jobs on that instance as the grid shouldn't schedule any new jobs there due to the high load. While imperfect, if this only happens every few months or so, that'd be "good enough" for me for a tool that isn't developed from scratch.

might be the best one until we figure out what the best situation is (changing SGE, moving to kubernetes or to a seperate project). We can reschedule continuous tasks to another host, and if you use max 2 cores in practice, the few one-off jobs on the host can just finish their task. Could you give us a heads-up (via a reply to this task) when you restart the bot, and (as you seem to be able to change the amount of processes it uses while its running) start it with a low number of processes until we move the other jobs away?

@valhallasw: a good solution would be assigning 200-300% processor to the whole task. I found http://wiki.crc.nd.edu/wiki/index.php/Submitting_Batch/SGE_jobs - which suggests "-pe mpi-# #" would be the option .. (I'm not a specialist in this)

Otherwise, keep the load low on the instance it is running on for now (though, it only starts to be a problem if the other tasks are actually all using their whole 100% of processor time

I'll ping when the bot moves.

@valhallasw - the bot crashed (no clue why, it seems to have troubles with MySQL). I restarted it this morning, and it is now on 1215

Thanks, I have rescheduled the other jobs.

@valhallasw - I had to move the bot to another instance, it is now on 1205 (if I become linkwatcher I can't ssh to 1209, access denied).

linkwatcher was one of the jobs on the hosts affected by T136495: Stale NFS handle breaks puppet on tools-exec-1204, -1205 and -1218; I have resubmitted continuous jobs on the host where it's running now (tools-exec-1205).

@valhallasw can you please resubmit jobs on tools-exec-1203 .. linkwatcher seems to interfere with other scripts running there.

Resubmitted the continuous jobs.

@valhallasw - it crashed, and is now on 1213. Do you mind moving the other tasks (it is back making backlogs again)?

Sorry for the late response -- done!

@valhallasw The bot yesterday moved to 1216. It is not backlogging, but maybe it is good to make sure other tasks do not run on this instance.

I have rescheduled the other continuous jobs on the instance. Thanks!

@valhallasw The bot moved two days ago again, and I had to restart it now .. it is now on tools-exec-1417 (2 x edited comment).

@valhallasw, do you mind to clear the instance linkwatcher is on, there
are three heavy python scripts there as well, and LiWa3 is building up a
massive backlog. Thanks.

No, because those are tasks that cannot be restarted. I rescheduled linkwatcher yesterday, but due to the massive memory requirement (26GB, while we only have 8GB mem + 25GB swap per host!) it took ages to reschedule, and was rescheduled on the same host in the end.

hmm. Any idea how long those 3 python scripts will stay? linkwatcher will
munch away its backlog in time. Until the wikimedia linklog system comes
online I don't foresee a way of making linkwatcher smaller.

Thanks for trying, anyway!

@valhallasw Do you mind to make sure that linkwatcher is the only bot on 1403? I had to start it this morning, it apparently crashed. Thanks!

@valhallasw Do you mind clearing the instance that linkwatcher is on .. it does not have enough resources and starts to build up a backlog. It is currently on 1438. Thanks!

Another option, as @Legoktm suggested, is to move LiWa to it's own host. This might be the best option if rewriting the bot to make effective use of the grid would take too much time. We would rather not have another one-off exec node, so my suggestion would then be to create a seperate Labs project, and move it there. That has the extra advantage that it's possible to run your own local mysql server, which would offload labsdb1003.

Does the bot actually use any grid engine specific things or is it just using jsub/jstart to launch the core process? Forking into a project of its own seems like more reasonable thing to do than having an effectively dedicated exec node via evicting competing jobs and also placing load on labs-db (assuming labsdb1003 was the labs-db primary at the time that comment was written).

@bd808 It is not that trivial, the new project would need to run coibot and linkwatcher, as they both do their share of analysis on the created db.

On the other hand, an own project would maybe have the possibility to have a bit more power as well, as linkwatcher is on the edge of what a node can pull, even when it is alone on the exec-node.

We have to consider that work is underway to have a linkadditions db in the MediaWiki software as well some time in the future. If that can be read 'in real time' (that is, turned into an IRC feed), and can be used to get proper statistics in near-real-time, then the workload of the bot would be greatly diminished (it would just need to do the statistics and pick up 'bad' patterns).

Mentioned in SAL (#wikimedia-cloud) [2018-06-29T17:11:46Z] <bd808> Rescheduled jobs away from toole-exec-1404 where linkwatcher is currently stealing most of the CPU (T123121)

Can someone please move all other bots away from the instance that runs linkwatcher? @valhallasw?

Can someone please move all other bots away from the instance that runs linkwatcher? @valhallasw?

valhallasw@tools-sgebastion-07:~$ qhost -j -h tools-sgeexec-0918
HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
global                  -               -    -    -    -     -       -       -       -       -
tools-sgeexec-0918.tools.eqiad.wmflabs lx-amd64        4    4    4    4  4.20    7.8G    7.3G   23.9G  779.1M
   job-ID  prior   name       user         state submit/start at     queue      master ja-task-ID
   ----------------------------------------------------------------------------------------------
   1645330 0.56086 linkwatche tools.linkwa Rr    04/10/2019 10:33:50 continuous MASTER
   3271159 0.29946 archivebot tools.thanat r     05/16/2019 02:00:27 task@tools MASTER

The only other job (still) running on that host is a task (rather than a continuous job), which cannot be restarted.

Beetstra claimed this task.

LinkWatcher has moved to an own instance (VM). Not an issue anymore on the instances.