Page MenuHomePhabricator

Write side of ats-tls named pipe deleted upon logging config change reload
Open, MediumPublic

Description

Today after merging a log format change icinga started alerting on all ats-tls hosts:

PROBLEM - check_trafficserver_log_fifo_tls_tls on cp3062 is CRITICAL: CRITICAL: /srv/trafficserver/tls/var/log/tls.pipe - TS_MAIN not writing to pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server

We confirmed that, despite the alert, atslog-tls worked fine and emitted logs according to the updated logging format.

/usr/local/lib/nagios/plugins/check_trafficserver_log_fifo boils down to running lsof -E -u trafficserver -c fifo-log-demux -a /srv/trafficserver/tls/var/log/tls.pipe and making sure that the ATS side of the pipe matches TS_MAIN\\],[0-9]+w$ when it comes to the NAME field. On a happy host, the output of lsof is:

root@cp3054:~# lsof -E -u trafficserver -c fifo-log-demux -a /srv/trafficserver/tls/var/log/tls.pipe
COMMAND      PID          USER   FD   TYPE DEVICE SIZE/OFF    NODE NAME
fifo-log- 182554 trafficserver    5r  FIFO    9,0      0t0 9306167 /srv/trafficserver/tls/var/log/tls.pipe 182648,[TS_MAIN],164w

Indeed ATS' file descriptor 164 is alive and well:

root@cp3054:~# ls -l /proc/182648/fd/164
l-wx------ 1 trafficserver trafficserver 64 Nov 27 15:20 /proc/182648/fd/164 -> /srv/trafficserver/tls/var/log/tls.pipe

On a troublesome host instead:

root@cp3050:~# lsof -E -u trafficserver -c fifo-log-demux -a /srv/trafficserver/tls/var/log/tls.pipe
COMMAND     PID          USER   FD   TYPE DEVICE SIZE/OFF     NODE NAME
fifo-log- 60217 trafficserver    5r  FIFO    9,0      0t0 14417975 /srv/trafficserver/tls/var/log/tls.pipe 60305,[TS_MAIN],*241w

The * before the FD number indicates that it has been removed, and indeed FD 241 is not used for tls.pipe anymore:

root@cp3050:~# ls -l /proc/60305/fd/241
lrwx------ 1 trafficserver trafficserver 64 Dec 17 07:57 /proc/60305/fd/241 -> socket:[4170691185]

I suspect that reloading ats would break logging at this point. This seems similar to https://github.com/apache/trafficserver/issues/4635, but different.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema triaged this task as Medium priority.Dec 17 2019, 2:04 PM

I suspect that reloading ats would break logging at this point.

That's not the case, reload works perfectly fine. Changes to the log format are reflected correctly on the emitted logs. This however is what lsof has to say now after a few reloads:

root@cp3050:~# lsof -E -u trafficserver -c fifo-log-demux -a /srv/trafficserver/tls/var/log/tls.pipe
COMMAND     PID          USER   FD   TYPE DEVICE SIZE/OFF     NODE NAME
fifo-log- 60217 trafficserver    5r  FIFO    9,0      0t0 14417975 /srv/trafficserver/tls/var/log/tls.pipe 60305,[TS_MAIN],5415w 60305,[TS_MAIN],*776w

Mentioned in SAL (#wikimedia-operations) [2019-12-17T14:15:58Z] <ema> cp: rolling ats-tls-restart to clear issues caused by T240950

Mentioned in SAL (#wikimedia-operations) [2020-02-10T15:03:23Z] <vgutierrez> rolling restart of ats-tls - T240950

BBlack added a subscriber: BBlack.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!