Today after merging a log format change icinga started alerting on all ats-tls hosts:
PROBLEM - check_trafficserver_log_fifo_tls_tls on cp3062 is CRITICAL: CRITICAL: /srv/trafficserver/tls/var/log/tls.pipe - TS_MAIN not writing to pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
We confirmed that, despite the alert, atslog-tls worked fine and emitted logs according to the updated logging format.
/usr/local/lib/nagios/plugins/check_trafficserver_log_fifo boils down to running lsof -E -u trafficserver -c fifo-log-demux -a /srv/trafficserver/tls/var/log/tls.pipe and making sure that the ATS side of the pipe matches TS_MAIN\\],[0-9]+w$ when it comes to the NAME field. On a happy host, the output of lsof is:
root@cp3054:~# lsof -E -u trafficserver -c fifo-log-demux -a /srv/trafficserver/tls/var/log/tls.pipe COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME fifo-log- 182554 trafficserver 5r FIFO 9,0 0t0 9306167 /srv/trafficserver/tls/var/log/tls.pipe 182648,[TS_MAIN],164w
Indeed ATS' file descriptor 164 is alive and well:
root@cp3054:~# ls -l /proc/182648/fd/164 l-wx------ 1 trafficserver trafficserver 64 Nov 27 15:20 /proc/182648/fd/164 -> /srv/trafficserver/tls/var/log/tls.pipe
On a troublesome host instead:
root@cp3050:~# lsof -E -u trafficserver -c fifo-log-demux -a /srv/trafficserver/tls/var/log/tls.pipe COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME fifo-log- 60217 trafficserver 5r FIFO 9,0 0t0 14417975 /srv/trafficserver/tls/var/log/tls.pipe 60305,[TS_MAIN],*241w
The * before the FD number indicates that it has been removed, and indeed FD 241 is not used for tls.pipe anymore:
root@cp3050:~# ls -l /proc/60305/fd/241 lrwx------ 1 trafficserver trafficserver 64 Dec 17 07:57 /proc/60305/fd/241 -> socket:[4170691185]
I suspect that reloading ats would break logging at this point. This seems similar to https://github.com/apache/trafficserver/issues/4635, but different.