Page MenuHomePhabricator

jsub/jstart take 60 s due to /usr/local/bin/log-command-invocation CPU hunger
Closed, DuplicatePublic

Description

At least when I become bub, jstart typically takes a minute. In top, I was surprised to see /usr/local/bin/log-command-invocation consuming a lot of CPU. Maybe this replacement isn't very efficient?

with open('/proc/{gpid}/cmdline'.format(gpid=gpid)) as f:
    return f.read().replace('\0', ' ')

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

On the other hand, right now it takes a whopping 5 minutes and most of it isn't CPU according to time -v:

Command being timed: "jsub -once -continuous -l release=trusty -N mworker3-4 -o /data/project/bub/public_html/BUB/bot/mass_worker.log -e /data/project/bub/public_html/BUB/bot/mass_worker.log -cwd -mem 1000m ./mass_worker.py 3"
User time (seconds): 0.32
System time (seconds): 5.47
Percent of CPU this job got: 1%
Elapsed (wall clock) time (h:mm:ss or m:ss): 4:59.95
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 11388
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 12523
Voluntary context switches: 56962
Involuntary context switches: 1173
Swaps: 0
File system inputs: 2272
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

Update: today the situation isn't usual, iowait very high.

PRC | sys    0.98s | user   0.68s  | #proc    332 | #trun      2 | #tslpi   354  | #tslpu     1 | #zombie    2 | clones     2  | #exit      ? |
CPU | sys       7% | user      8%  | irq       2% | idle    278% | wait    105%  | steal     0% | guest     0% | curf 2.69GHz  | curscal   ?% |
cpu | sys       3% | user      3%  | irq       2% | idle     16% | cpu002 w 76%  | steal     0% | guest     0% | curf 2.69GHz  | curscal   ?% |
cpu | sys       3% | user      4%  | irq       0% | idle     60% | cpu000 w 32%  | steal     0% | guest     0% | curf 2.69GHz  | curscal   ?% |
cpu | sys       1% | user      0%  | irq       0% | idle     99% | cpu001 w  0%  | steal     0% | guest     0% | curf 2.69GHz  | curscal   ?% |
cpu | sys       1% | user      0%  | irq       0% | idle     99% | cpu003 w  0%  | steal     0% | guest     0% | curf 2.69GHz  | curscal   ?% |
CPL | avg1    5.63 | avg5    5.03  | avg15   4.60 |              |               | csw    23674 | intr   24078 |               | numcpu     4 |
MEM | tot     7.8G | free  724.0M  | cache   2.2G | dirty   5.0M | buff  170.4M  | slab    3.8G |              |               |              |
SWP | tot   488.0M | free  486.4M  |              |              |               |              |              | vmcom   1.8G  | vmlim   7.9G |
DSK |          vda | busy      0%  | read      20 | write      1 | KiB/r     29  | KiB/w      8 | MBr/s   0.06 | MBw/s   0.00  | avio 0.38 ms |
NET | transport    | tcpi   13100  | tcpo   16165 | udpi      43 | udpo      45  | tcpao      5 | tcppo      0 | tcprs      0  | udpip      0 |
NET | network      | ipi    13144  | ipo    13587 | ipfrw      0 | deliv  13144  |              |              | icmpi      0  | icmpo      0 |
NET | eth0    ---- | pcki   13364  | pcko   13632 | si 7722 Kbps | so 7659 Kbps  | erri       0 | erro       0 | drpi       1  | drpo       0 |
NET | ifb0    ---- | pcki   13255  | pcko   13254 | si 7484 Kbps | so 7484 Kbps  | erri       0 | erro       0 | drpi       1  | drpo       0 |

This is a more typical timing (around 40 seconds total), now that valhallasw fixed the iowait:

Your job 4932246 ("mworker3-4") has been submitted
        Command being timed: "jsub -once -continuous -l release=trusty -N mworker3-4 -o /data/project/bub/public_html/BUB/bot/mass_worker.log -e /data/project/bub/public_html/BUB/bot/mass_worker.log -cwd -mem 1000m ./mass_worker.py 3"
        User time (seconds): 0.28
        System time (seconds): 5.01
        Percent of CPU this job got: 14%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:37.39
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 11384
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 12512
        Voluntary context switches: 54192
        Involuntary context switches: 1352
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
chasemp triaged this task as Medium priority.Apr 4 2016, 1:37 PM
chasemp added subscribers: yuvipanda, chasemp.

@yuvipanda, any thoughts on making this async to the actual submission process?

I'm unable to reproduce this now. I'm happy to make it async (probably!), but I am not fully sure it is the cause of the slowdown, rather than just a symptom.

I'm now near 100 seconds wait, during which the machine had a significant iowait.

$ /usr/bin/time -v jsub -once -continuous -l release=trusty -N ...
Command exited with non-zero status 1
        Command being timed: "jsub -once -continuous -l release=trusty -N mworker3-2 -o /data/project/bub/public_html/BUB/bot/mass_worker.log -e /data/project/bub/public_html/BUB/bot/mass_worker.log -cwd -mem 1000m ./mass_worker.py 3"
        User time (seconds): 0.26
        System time (seconds): 5.84
        Percent of CPU this job got: 6%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:38.81
        Average shared text size (kbytes): 0                                                                                                           Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 11428
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 13953
        Voluntary context switches: 99833
        Involuntary context switches: 2171
        Swaps: 0
        File system inputs: 0
        File system outputs: 8
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 1