Page MenuHomePhabricator

check_trafficserver_log_fifo: false positives when changing log format
Closed, ResolvedPublic

Description

Changing the log format of ATS, for instance by modifying profile::trafficserver::tls::log_formats triggers a trafficserver reload to apply the changes. When this happens, check_trafficserver_log_fifo thinks that ATS stopped writing to the pipe due to a change in lsof output, although writing and reading still keeps on happening without problems.

Valid output example:

root@cp3050:~# lsof -E -u trafficserver -c fifo-log-demux -a /srv/trafficserver/tls/var/log/tls.pipe 
COMMAND     PID          USER   FD   TYPE DEVICE SIZE/OFF    NODE NAME
fifo-log- 27095 trafficserver    5r  FIFO    9,0      0t0 3539000 /srv/trafficserver/tls/var/log/tls.pipe 27111,[TS_MAIN],164w

Output causing false positives, notice that the very last field includes a *:

root@cp3050:~# lsof -E -u trafficserver -c fifo-log-demux -a /srv/trafficserver/tls/var/log/analytics.pipe 
COMMAND     PID          USER   FD   TYPE DEVICE SIZE/OFF    NODE NAME
fifo-log- 27096 trafficserver    5r  FIFO    9,0      0t0 3539001 /srv/trafficserver/tls/var/log/analytics.pipe 27111,[TS_MAIN],*524w

We should find a more robust way to verify that named pipe logging works as expected.

Details

Related Gerrit Patches:
operations/puppet : productionATS: increase check_trafficserver_log_fifo timeout
operations/puppet : productionATS: check if fifo-log-demux is logging
operations/puppet : productionconftool::scripts: add ispooled

Event Timeline

ema created this task.Thu, Mar 19, 9:57 AM
Restricted Application added a project: Operations. · View Herald TranscriptThu, Mar 19, 9:57 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema triaged this task as Medium priority.Thu, Mar 19, 9:58 AM
ema moved this task from Triage to Caching on the Traffic board.

Mentioned in SAL (#wikimedia-operations) [2020-03-19T10:01:02Z] <ema> cp: rolling ats-tls-restart to apply log format changes T248067 T237993

During the weekend we had issues with ats-tls and the number of open file descriptors, see T248736. Due to this, check_trafficserver_log_fifo started timing out. @CDanis noticed the problem and increased the timeout (thanks!), but this shows once more that the philosophy behind the check doesn't really work.

@Vgutierrez and I discussed this and came to the conclusion that a viable approach would be to instead have a check that connects to the fifo-log-demux unix socket and tries to read some logs. If no log is read for a certain amount of time, we consider logging to be broken. The strategy would not work on depooled nodes, given that they produce no logs. To this end, we are going to introduce a script (which seems to be generally useful) to find out whether a given service (ats-tls, varnish-fe, ats-be, ...) is pooled on the host the script is running from. Such information can be then used in the icinga check to decide whether or not to report an issue.

Change 584613 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] conftool::scripts: add pooled

https://gerrit.wikimedia.org/r/584613

Change 584613 merged by Ema:
[operations/puppet@production] conftool::scripts: add ispooled

https://gerrit.wikimedia.org/r/584613

Change 585208 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: check if fifo-log-demux is logging

https://gerrit.wikimedia.org/r/585208

Change 585208 merged by Ema:
[operations/puppet@production] ATS: check if fifo-log-demux is logging

https://gerrit.wikimedia.org/r/585208

ema closed this task as Resolved.Wed, Apr 1, 1:56 PM
ema claimed this task.

Change 585256 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: increase check_trafficserver_log_fifo timeout

https://gerrit.wikimedia.org/r/585256

Change 585256 merged by Ema:
[operations/puppet@production] ATS: increase check_trafficserver_log_fifo timeout

https://gerrit.wikimedia.org/r/585256