Page MenuHomePhabricator

replace mtail with benthos on ncredir instances
Open, Needs TriagePublic

Description

We are currently using mtail to deliver basic ncredir metrics. Explore using benthos instead of mtail to simplify our setup and avoid certain performance issues that we are experiencing on bookworm

Event Timeline

Change #1021485 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] ncredir,benthos: Provide benthos support on ncredir

https://gerrit.wikimedia.org/r/1021485

Change #1021502 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] profile::benthos: Don't require kafka config

https://gerrit.wikimedia.org/r/1021502

Testing benthos on ncredir2001 shows some concerning results (TL;DR it looks like benthos drops some messages and metrics aren't as accurate as expected).

nginx is configured to send log messages to benthos using its syslog logging feature: access_log syslog:server=127.0.0.1:1221 ncredir_syslog

bast2003 works as an http client using hey to produce requests using the following cmd:

./hey_linux_amd64 -disable-redirects -disable-keepalive -n=10000 -H "Host: en.wikipedia.com" https://10.192.0.131/

I ran this cmd 4 times, generating a grand total of 40k HTTP/2 requests between bast2003 and ncredir2001. tpcdump running on ncredir2001 shows 40k packets sent by nginx to port 1221:

vgutierrez@ncredir2001:~$ sudo tcpdump -v -i lo port 1221 -w syslog-benthos.pcap
tcpdump: listening on lo, link-type EN10MB (Ethernet), snapshot length 262144 bytes
40000 packets captured

benthos shows the following metrics:

vgutierrez@ncredir2001:/var/log/nginx$ curl http://127.0.0.1:4153/metrics -s |egrep "_count|requests_total" |grep -v "#"
buffer_latency_ns_count{label="",path="root.buffer"} 264
input_latency_ns_count{label="syslog",path="root.input"} 39987
ncredir_requests_total{label="requests_metric",method="GET",path="root.pipeline.processors.2",scheme="https",status="301"} 39987
output_latency_ns_count{label="",path="root.output"} 264
processor_latency_ns_count{label="",path="root.pipeline.processors.0"} 264
processor_latency_ns_count{label="parse_ncredir_log_format",path="root.pipeline.processors.1"} 264
processor_latency_ns_count{label="syslog_format",path="root.input.processors.0"} 39987

Benthos configuration is available here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1021485/10/modules/profile/files/benthos/instances/ncredir.yaml

@fgiunchedi / @Fabfur any suggestions on how to mitigate this?

running another test this time with 3x10k requests it looks like the culprit is the socket_server UDP input that drops packets:

processor_latency_ns_count{label="syslog_format",path="root.input.processors.0"} 29963
vgutierrez@ncredir2001:/var/log/nginx$ cat /proc/net/udp
   sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode ref pointer drops            
 1330: 0100007F:04C5 00000000:0000 07 00000000:00000000 00:00000000 00000000 18837        0 64597138 2 0000000000000000 37

30000 - 37 = 29963

Change #1021502 merged by Vgutierrez:

[operations/puppet@production] profile::benthos: Don't require kafka config

https://gerrit.wikimedia.org/r/1021502

Change #1021485 merged by Vgutierrez:

[operations/puppet@production] ncredir,benthos: Provide benthos support on ncredir

https://gerrit.wikimedia.org/r/1021485

Change #1023428 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] hiera: Enable benthos on ncredir@ulsfo

https://gerrit.wikimedia.org/r/1023428

Change #1023428 merged by Vgutierrez:

[operations/puppet@production] hiera: Enable benthos on ncredir@ulsfo

https://gerrit.wikimedia.org/r/1023428

Change #1023430 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] fifo_log_demux: Create fifo iff ensure = present

https://gerrit.wikimedia.org/r/1023430

Change #1023430 merged by Vgutierrez:

[operations/puppet@production] fifo_log_demux: Create fifo iff ensure = present

https://gerrit.wikimedia.org/r/1023430

Change #1023844 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] hiera: Enable benthos on ncredir@eqsin

https://gerrit.wikimedia.org/r/1023844

Change #1023844 merged by Vgutierrez:

[operations/puppet@production] hiera: Enable benthos on ncredir@eqsin

https://gerrit.wikimedia.org/r/1023844