It seems like fifo-log-demux is losing some lines when attached to trafficserver: ATS is spewing lines like:
Sep 07 17:58:16 cp4052 trafficserver[1962]: [Sep 7 17:58:16.062] [LOG_FLUSH] ERROR: The following message was suppressed 987 times. Sep 07 17:58:16 cp4052 trafficserver[1962]: [Sep 7 17:58:16.062] [LOG_FLUSH] ERROR: Failed to write log to /var/log/trafficserver/notpurge.pipe: [tried 869, wrote 0, Resource temporarily unavailable] Sep 07 17:59:17 cp4052 trafficserver[1962]: [Sep 7 17:59:17.033] [LOG_FLUSH] ERROR: The following message was suppressed 1157 times. Sep 07 17:59:17 cp4052 trafficserver[1962]: [Sep 7 17:59:17.033] [LOG_FLUSH] ERROR: Failed to write log to /var/log/trafficserver/notpurge.pipe: [tried 806, wrote 0, Resource temporarily unavailable] Sep 07 18:00:18 cp4052 trafficserver[1962]: [Sep 7 18:00:18.431] [LOG_FLUSH] ERROR: The following message was suppressed 1030 times. Sep 07 18:00:18 cp4052 trafficserver[1962]: [Sep 7 18:00:18.431] [LOG_FLUSH] ERROR: Failed to write log to /var/log/trafficserver/notpurge.pipe: [tried 827, wrote 0, Resource temporarily unavailable]
fifo-log-demux failing to consume ATS logs means that mtail is losing some requests, and that's bad for SLO tracking and some stats as well. We should add some alerting / monitoring / metrics into fifo-log-demux.
Considering it's a tiny golang application embedding a prometheus exporter and just export a metric reporting the number of read logs should be enough. (prometheus golang lib: https://github.com/prometheus/client_golang/)
Note: https://gerrit.wikimedia.org/r/c/operations/debs/trafficserver/+/458195 was introduced to reduce the spam but the patch was removed for the 9.2.x upgrade (T339134). So while it's more visible in 9.2.x it's still an issue with our current 9.1.x deployment.