We are currently using mtail to deliver basic ncredir metrics. Explore using benthos instead of mtail to simplify our setup and avoid certain performance issues that we are experiencing on bookworm
Description
Details
Event Timeline
Change #1021485 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):
[operations/puppet@production] ncredir,benthos: Provide benthos support on ncredir
Change #1021502 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):
[operations/puppet@production] profile::benthos: Don't require kafka config
Testing benthos on ncredir2001 shows some concerning results (TL;DR it looks like benthos drops some messages and metrics aren't as accurate as expected).
nginx is configured to send log messages to benthos using its syslog logging feature: access_log syslog:server=127.0.0.1:1221 ncredir_syslog
bast2003 works as an http client using hey to produce requests using the following cmd:
./hey_linux_amd64 -disable-redirects -disable-keepalive -n=10000 -H "Host: en.wikipedia.com" https://10.192.0.131/
I ran this cmd 4 times, generating a grand total of 40k HTTP/2 requests between bast2003 and ncredir2001. tpcdump running on ncredir2001 shows 40k packets sent by nginx to port 1221:
vgutierrez@ncredir2001:~$ sudo tcpdump -v -i lo port 1221 -w syslog-benthos.pcap tcpdump: listening on lo, link-type EN10MB (Ethernet), snapshot length 262144 bytes 40000 packets captured
benthos shows the following metrics:
vgutierrez@ncredir2001:/var/log/nginx$ curl http://127.0.0.1:4153/metrics -s |egrep "_count|requests_total" |grep -v "#" buffer_latency_ns_count{label="",path="root.buffer"} 264 input_latency_ns_count{label="syslog",path="root.input"} 39987 ncredir_requests_total{label="requests_metric",method="GET",path="root.pipeline.processors.2",scheme="https",status="301"} 39987 output_latency_ns_count{label="",path="root.output"} 264 processor_latency_ns_count{label="",path="root.pipeline.processors.0"} 264 processor_latency_ns_count{label="parse_ncredir_log_format",path="root.pipeline.processors.1"} 264 processor_latency_ns_count{label="syslog_format",path="root.input.processors.0"} 39987
Benthos configuration is available here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1021485/10/modules/profile/files/benthos/instances/ncredir.yaml
@fgiunchedi / @Fabfur any suggestions on how to mitigate this?
running another test this time with 3x10k requests it looks like the culprit is the socket_server UDP input that drops packets:
processor_latency_ns_count{label="syslog_format",path="root.input.processors.0"} 29963 vgutierrez@ncredir2001:/var/log/nginx$ cat /proc/net/udp sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode ref pointer drops 1330: 0100007F:04C5 00000000:0000 07 00000000:00000000 00:00000000 00000000 18837 0 64597138 2 0000000000000000 37
30000 - 37 = 29963
Change #1021502 merged by Vgutierrez:
[operations/puppet@production] profile::benthos: Don't require kafka config
Change #1021485 merged by Vgutierrez:
[operations/puppet@production] ncredir,benthos: Provide benthos support on ncredir
Change #1023428 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):
[operations/puppet@production] hiera: Enable benthos on ncredir@ulsfo
Change #1023428 merged by Vgutierrez:
[operations/puppet@production] hiera: Enable benthos on ncredir@ulsfo
Change #1023430 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):
[operations/puppet@production] fifo_log_demux: Create fifo iff ensure = present
Change #1023430 merged by Vgutierrez:
[operations/puppet@production] fifo_log_demux: Create fifo iff ensure = present
Change #1023844 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):
[operations/puppet@production] hiera: Enable benthos on ncredir@eqsin
Change #1023844 merged by Vgutierrez:
[operations/puppet@production] hiera: Enable benthos on ncredir@eqsin