pybal stops logging
Closed, ResolvedPublic

Description

It seems that pybal stops logging after a while. For example, at the time of this writing:

lvs1007.eqiad.wmnet: -rw-r--r-- 1 root root 0 Mar  8 06:25 /var/log/pybal.log
lvs1008.eqiad.wmnet: -rw-r--r-- 1 root root 0 Mar  8 06:25 /var/log/pybal.log
lvs1011.eqiad.wmnet: -rw-r--r-- 1 root root 0 Mar  8 06:25 /var/log/pybal.log
lvs1010.eqiad.wmnet: -rw-r--r-- 1 root root 0 Mar  8 06:25 /var/log/pybal.log
lvs1004.wikimedia.org: -rw-r----- 1 root adm 0 Mar 10 06:25 /var/log/pybal.log
lvs1005.wikimedia.org: -rw-r----- 1 root adm 0 Mar 13 06:25 /var/log/pybal.log
lvs1002.wikimedia.org: -rw-r----- 1 root adm 0 Mar  8 06:25 /var/log/pybal.log
lvs1001.wikimedia.org: -rw-r----- 1 root adm 0 Mar 10 06:25 /var/log/pybal.log
lvs1003.wikimedia.org: -rw-r----- 1 root adm 1366881 Mar 14 11:33 /var/log/pybal.log
lvs1009.eqiad.wmnet: -rw-r--r-- 1 root root 257997 Mar 14 11:30 /var/log/pybal.log
lvs1012.eqiad.wmnet: -rw-r--r-- 1 root root 272986 Mar 14 11:33 /var/log/pybal.log
lvs1006.wikimedia.org: -rw-r----- 1 root adm 1380554 Mar 14 11:33 /var/log/pybal.log

While trying to figure out what's happening, I've found that pybal's logrotate script is using reload rsyslog instead of service rsyslog rotate:

postrotate
    reload rsyslog >/dev/null 2>&1 || true
endscript

Although that's certainly wrong, it doesn't seem to be the cause of pybal's silence, which needs to be investigated further.

ema created this task.Tue, Mar 14, 11:47 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptTue, Mar 14, 11:47 AM
ema triaged this task as "Normal" priority.Tue, Mar 14, 11:47 AM

Change 342608 had a related patch set uploaded (by Ema):
[operations/puppet] pybal: use 'rsyslog rotate' in logrotate configuration

https://gerrit.wikimedia.org/r/342608

ema moved this task from Triage to LoadBalancer on the Traffic board.Tue, Mar 14, 11:55 AM
ema added a comment.Tue, Mar 14, 3:40 PM

On machines where pybal doesn't log anything, the pybal process is indeed not doing any meaningful writes to standard out/err:

$ sudo strace -e trace=write -e write=1,2 -p50964 2>&1 | ts
Mar 14 15:38:45 Process 50964 attached
Mar 14 15:39:37 write(11, "x", 1)                       = 1
Mar 14 15:39:42 write(11, "x", 1)                       = 1

Mentioned in SAL (#wikimedia-operations) [2017-03-14T19:14:06Z] <ema> restarting pybal on lvs1010-11 T160405

Change 342608 merged by Ema:
[operations/puppet] logrotate: use 'rsyslog rotate' instead of 'reload rsyslog'

https://gerrit.wikimedia.org/r/342608

ema closed this task as "Resolved".Wed, Mar 15, 10:14 AM
ema claimed this task.

In the past we've always had a bunch of hosts constantly depooled because of various issues (eg: T148891) thus we got used to seeing frequent logs in pybal.log (host down but depooled, etc).

As it turns out, pybal hasn't logged anything for a while because it had nothing to report. I've tried briefly depooling a varnish-fe and the event was logged properly.