Page MenuHomePhabricator

Write side of ats-tls named pipe deleted upon logging config change reload
Open, MediumPublic

Description

Today after merging a log format change icinga started alerting on all ats-tls hosts:

PROBLEM - check_trafficserver_log_fifo_tls_tls on cp3062 is CRITICAL: CRITICAL: /srv/trafficserver/tls/var/log/tls.pipe - TS_MAIN not writing to pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server

We confirmed that, despite the alert, atslog-tls worked fine and emitted logs according to the updated logging format.

/usr/local/lib/nagios/plugins/check_trafficserver_log_fifo boils down to running lsof -E -u trafficserver -c fifo-log-demux -a /srv/trafficserver/tls/var/log/tls.pipe and making sure that the ATS side of the pipe matches TS_MAIN\\],[0-9]+w$ when it comes to the NAME field. On a happy host, the output of lsof is:

root@cp3054:~# lsof -E -u trafficserver -c fifo-log-demux -a /srv/trafficserver/tls/var/log/tls.pipe
COMMAND      PID          USER   FD   TYPE DEVICE SIZE/OFF    NODE NAME
fifo-log- 182554 trafficserver    5r  FIFO    9,0      0t0 9306167 /srv/trafficserver/tls/var/log/tls.pipe 182648,[TS_MAIN],164w

Indeed ATS' file descriptor 164 is alive and well:

root@cp3054:~# ls -l /proc/182648/fd/164
l-wx------ 1 trafficserver trafficserver 64 Nov 27 15:20 /proc/182648/fd/164 -> /srv/trafficserver/tls/var/log/tls.pipe

On a troublesome host instead:

root@cp3050:~# lsof -E -u trafficserver -c fifo-log-demux -a /srv/trafficserver/tls/var/log/tls.pipe
COMMAND     PID          USER   FD   TYPE DEVICE SIZE/OFF     NODE NAME
fifo-log- 60217 trafficserver    5r  FIFO    9,0      0t0 14417975 /srv/trafficserver/tls/var/log/tls.pipe 60305,[TS_MAIN],*241w

The * before the FD number indicates that it has been removed, and indeed FD 241 is not used for tls.pipe anymore:

root@cp3050:~# ls -l /proc/60305/fd/241
lrwx------ 1 trafficserver trafficserver 64 Dec 17 07:57 /proc/60305/fd/241 -> socket:[4170691185]

I suspect that reloading ats would break logging at this point. This seems similar to https://github.com/apache/trafficserver/issues/4635, but different.

Event Timeline

ema created this task.Dec 17 2019, 2:04 PM
Restricted Application added a project: Operations. · View Herald TranscriptDec 17 2019, 2:04 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema triaged this task as Medium priority.Dec 17 2019, 2:04 PM
ema added a comment.EditedDec 17 2019, 2:12 PM

I suspect that reloading ats would break logging at this point.

That's not the case, reload works perfectly fine. Changes to the log format are reflected correctly on the emitted logs. This however is what lsof has to say now after a few reloads:

root@cp3050:~# lsof -E -u trafficserver -c fifo-log-demux -a /srv/trafficserver/tls/var/log/tls.pipe
COMMAND     PID          USER   FD   TYPE DEVICE SIZE/OFF     NODE NAME
fifo-log- 60217 trafficserver    5r  FIFO    9,0      0t0 14417975 /srv/trafficserver/tls/var/log/tls.pipe 60305,[TS_MAIN],5415w 60305,[TS_MAIN],*776w

Mentioned in SAL (#wikimedia-operations) [2019-12-17T14:15:58Z] <ema> cp: rolling ats-tls-restart to clear issues caused by T240950

ema moved this task from Triage to TLS on the Traffic board.Dec 19 2019, 2:17 PM

Mentioned in SAL (#wikimedia-operations) [2020-02-10T15:03:23Z] <vgutierrez> rolling restart of ats-tls - T240950