Page MenuHomePhabricator

Create metrics/monitoring of fifo-log-demux
Open, Stalled, MediumPublic

Description

It seems like fifo-log-demux is losing some lines when attached to trafficserver: ATS is spewing lines like:

Sep 07 17:58:16 cp4052 trafficserver[1962]: [Sep  7 17:58:16.062] [LOG_FLUSH] ERROR: The following message was suppressed 987 times.
Sep 07 17:58:16 cp4052 trafficserver[1962]: [Sep  7 17:58:16.062] [LOG_FLUSH] ERROR: Failed to write log to /var/log/trafficserver/notpurge.pipe: [tried 869, wrote 0, Resource temporarily unavailable]
Sep 07 17:59:17 cp4052 trafficserver[1962]: [Sep  7 17:59:17.033] [LOG_FLUSH] ERROR: The following message was suppressed 1157 times.
Sep 07 17:59:17 cp4052 trafficserver[1962]: [Sep  7 17:59:17.033] [LOG_FLUSH] ERROR: Failed to write log to /var/log/trafficserver/notpurge.pipe: [tried 806, wrote 0, Resource temporarily unavailable]
Sep 07 18:00:18 cp4052 trafficserver[1962]: [Sep  7 18:00:18.431] [LOG_FLUSH] ERROR: The following message was suppressed 1030 times.
Sep 07 18:00:18 cp4052 trafficserver[1962]: [Sep  7 18:00:18.431] [LOG_FLUSH] ERROR: Failed to write log to /var/log/trafficserver/notpurge.pipe: [tried 827, wrote 0, Resource temporarily unavailable]

fifo-log-demux failing to consume ATS logs means that mtail is losing some requests, and that's bad for SLO tracking and some stats as well. We should add some alerting / monitoring / metrics into fifo-log-demux.

Considering it's a tiny golang application embedding a prometheus exporter and just export a metric reporting the number of read logs should be enough. (prometheus golang lib: https://github.com/prometheus/client_golang/)

Note: https://gerrit.wikimedia.org/r/c/operations/debs/trafficserver/+/458195 was introduced to reduce the spam but the patch was removed for the 9.2.x upgrade (T339134). So while it's more visible in 9.2.x it's still an issue with our current 9.1.x deployment.

Related Objects

Event Timeline

BCornwall triaged this task as Medium priority.Sep 8 2023, 6:55 PM
BCornwall moved this task from Backlog to Scheduled incidental work on the Traffic board.

Change 967293 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/software/fifo-log-demux@master] Add Prometheus metrics for fifo-log-demux

https://gerrit.wikimedia.org/r/967293

BCornwall changed the task status from Open to Stalled.Oct 31 2023, 10:31 PM

Stalled until the patch can be tested which currently requires a not-broken wmcs traffic stack

Change 967293 abandoned by BCornwall:

[operations/software/fifo-log-demux@master] Add Prometheus metrics for fifo-log-demux

Reason:

Migrated to gitlab

https://gerrit.wikimedia.org/r/967293

I went on a different tangent with this problem, and tried to figure out why we're having ATS fail writes to the notpurge log pipe in the first place. After some hours of digging around this problem (I'll spare you endless details of temporary test configs and strace outputs of various daemons' behavior, etc), these are the basic issues I see:

  1. The basic data flow goes: ATS -> named_pipe -> fifo_log_demux -> named_unix_socket -> fifo_log_tailer -> anon_pipe -> mtail. Every single stage of this pipeline has some kind of buffering in kernel and/or application land.
  2. fifo-log-demux, as described by its own code commentary, is synchronous in nature. If a socket reader is slow on the output side of it, its own reads from the named pipe are also slowed down, because it's just a blocking loop over reads and writes.
  3. The default size of the kernel pipe buffer in the named pipe is 64k on our linuxes, and this is very small relative to the size and frequency of our log messages, making it easy to temporarily max out the pipe buffer with routine bursts.
  4. mtail doesn't seem well-suited to reading high-speed data from a socket or pipe. The design seems to be centered on disk files and it has a polling interval that defaults to 250ms for checking for new data. We can decrease the polling interval at the cost of more syscall churn, but there doesn't seem to be a parameter to have it simply use a true kernel event readiness mechanism like select or epoll. At the default current value of 250ms, it reads from the pipe rather slowly at times!

My recommendations for how to remediate this:

  1. In the ATS logging.yaml config of the notpurge.pipe output named pipe, add the parameter pipe_buffer_size: 1048576. This is 16x larger than the default, and is the maximum we can configure here without a non-default ulimit and/or a kernel sysctl config change, I think.
  2. In the mtail arguments deployed to /lib/systemd/system/atsmtail@backend.service, add the mtail CLI param -poll_interval 10ms, which is 25x quicker than the default 250ms.
  3. Amend fifo-log-demux to use an async event loop and perhaps a larger (and configurable) application-layer buffer to help smooth this over further in edge cases.

I've tried (1) and (2) in isolation and they each reduced the error message counts, but they weren't enough to eliminate the error messages entirely. With both together, at least on cp4052 as a test host (with real pooled traffic), it made the errors go away completely over ~10 minutes or more while I was testing.

Followup: did a 3-minute test of the same pair of parameter changes on cp3066 for a higher-traffic case. No write failures detected via strace in this case (we don't have the error log outputs to go by in 9.1 builds). mtail CPU usage at 10ms polling interval was significantly higher than it was in ulsfo, but still seems within reason overall and not saturating anything.