Page MenuHomePhabricator

pybal stops logging
Closed, ResolvedPublic

Description

It seems that pybal stops logging after a while. For example, at the time of this writing:

lvs1007.eqiad.wmnet: -rw-r--r-- 1 root root 0 Mar  8 06:25 /var/log/pybal.log
lvs1008.eqiad.wmnet: -rw-r--r-- 1 root root 0 Mar  8 06:25 /var/log/pybal.log
lvs1011.eqiad.wmnet: -rw-r--r-- 1 root root 0 Mar  8 06:25 /var/log/pybal.log
lvs1010.eqiad.wmnet: -rw-r--r-- 1 root root 0 Mar  8 06:25 /var/log/pybal.log
lvs1004.wikimedia.org: -rw-r----- 1 root adm 0 Mar 10 06:25 /var/log/pybal.log
lvs1005.wikimedia.org: -rw-r----- 1 root adm 0 Mar 13 06:25 /var/log/pybal.log
lvs1002.wikimedia.org: -rw-r----- 1 root adm 0 Mar  8 06:25 /var/log/pybal.log
lvs1001.wikimedia.org: -rw-r----- 1 root adm 0 Mar 10 06:25 /var/log/pybal.log
lvs1003.wikimedia.org: -rw-r----- 1 root adm 1366881 Mar 14 11:33 /var/log/pybal.log
lvs1009.eqiad.wmnet: -rw-r--r-- 1 root root 257997 Mar 14 11:30 /var/log/pybal.log
lvs1012.eqiad.wmnet: -rw-r--r-- 1 root root 272986 Mar 14 11:33 /var/log/pybal.log
lvs1006.wikimedia.org: -rw-r----- 1 root adm 1380554 Mar 14 11:33 /var/log/pybal.log

While trying to figure out what's happening, I've found that pybal's logrotate script is using reload rsyslog instead of service rsyslog rotate:

postrotate
    reload rsyslog >/dev/null 2>&1 || true
endscript

Although that's certainly wrong, it doesn't seem to be the cause of pybal's silence, which needs to be investigated further.

Event Timeline

ema created this task.Mar 14 2017, 11:47 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 14 2017, 11:47 AM
ema triaged this task as Medium priority.Mar 14 2017, 11:47 AM

Change 342608 had a related patch set uploaded (by Ema):
[operations/puppet] pybal: use 'rsyslog rotate' in logrotate configuration

https://gerrit.wikimedia.org/r/342608

ema moved this task from Triage to LoadBalancer on the Traffic board.Mar 14 2017, 11:55 AM
ema added a comment.Mar 14 2017, 3:40 PM

On machines where pybal doesn't log anything, the pybal process is indeed not doing any meaningful writes to standard out/err:

$ sudo strace -e trace=write -e write=1,2 -p50964 2>&1 | ts
Mar 14 15:38:45 Process 50964 attached
Mar 14 15:39:37 write(11, "x", 1)                       = 1
Mar 14 15:39:42 write(11, "x", 1)                       = 1

Mentioned in SAL (#wikimedia-operations) [2017-03-14T19:14:06Z] <ema> restarting pybal on lvs1010-11 T160405

Change 342608 merged by Ema:
[operations/puppet] logrotate: use 'rsyslog rotate' instead of 'reload rsyslog'

https://gerrit.wikimedia.org/r/342608

ema closed this task as Resolved.Mar 15 2017, 10:14 AM
ema claimed this task.

In the past we've always had a bunch of hosts constantly depooled because of various issues (eg: T148891) thus we got used to seeing frequent logs in pybal.log (host down but depooled, etc).

As it turns out, pybal hasn't logged anything for a while because it had nothing to report. I've tried briefly depooling a varnish-fe and the event was logged properly.