Page MenuHomePhabricator

benthos mw-accesslog-metrics kafka lag and interpolation errors
Open, Needs TriagePublic

Description

Since last week there's constant kafka lag observed for benthos@mw_accesslog_metrics.service on centrallog1002:

https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus%2Fops&orgId=1&var-topic=All&var-consumer_group=benthos-mw-accesslog-metrics&from=1717299153548&to=1718033612971

2024-06-10-175925_2071x1316_scrot.png (1×2 px, 234 KB)

(note the consumer group uses - and the systemd unit name uses _)

There's also a stream of constant errors in the journal (which might or might not be related, I'm not sure)

Jun 05 22:04:29 centrallog1002 benthos@mw_accesslog_metrics[926]: level=error msg="Value interpolation error: expected number value, got null" @service=benthos label=requests_by_endpoint_duration path=root.pipeline.processors.2
Jun 05 22:04:29 centrallog1002 benthos@mw_accesslog_metrics[926]: level=error msg="Value interpolation error: expected number value, got null" @service=benthos label=requests_by_endpoint_duration path=root.pipeline.processors.2
Jun 05 22:04:34 centrallog1002 benthos@mw_accesslog_metrics[926]: level=error msg="Value interpolation error: expected number value, got null" @service=benthos label=response_size_bytes path=root.pipeline.processors.0
Jun 05 22:04:34 centrallog1002 benthos@mw_accesslog_metrics[926]: level=error msg="Value interpolation error: expected number value, got null" @service=benthos label=requests_duration path=root.pipeline.processors.1
Jun 05 22:04:34 centrallog1002 benthos@mw_accesslog_metrics[926]: level=error msg="Value interpolation error: expected number value, got null" @service=benthos label=response_size_bytes path=root.pipeline.processors.0
Jun 05 22:04:34 centrallog1002 benthos@mw_accesslog_metrics[926]: level=error msg="Value interpolation error: expected number value, got null" @service=benthos label=requests_by_endpoint_duration path=root.pipeline.processors.2
Jun 05 22:04:34 centrallog1002 benthos@mw_accesslog_metrics[926]: level=error msg="Value interpolation error: expected number value, got null" @service=benthos label=response_size_bytes path=root.pipeline.processors.0
Jun 05 22:04:34 centrallog1002 benthos@mw_accesslog_metrics[926]: level=error msg="Value interpolation error: expected number value, got null" @service=benthos label=requests_duration path=root.pipeline.processors.1
Jun 05 22:04:34 centrallog1002 benthos@mw_accesslog_metrics[926]: level=error msg="Value interpolation error: expected number value, got null" @service=benthos label=requests_duration path=root.pipeline.processors.1
Jun 05 22:04:34 centrallog1002 benthos@mw_accesslog_metrics[926]: level=error msg="Value interpolation error: expected number value, got null" @service=benthos label=requests_by_endpoint_duration path=root.pipeline.processors.2

@kamila would you mind taking a look at both of these ? thank you!

Event Timeline

I believe the errors are unrelated (they are due to T340935 and we've had bad messages before and they didn't cause the problem).

The benthos instance does not appear resource-starved based on eyeballing htop and its dashboards, and the webrequest-live instance running next to it processes 10x the amount of messages with no issue.

I have not been able to find any changes correlated with when this started happening.

Change #1050367 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] benthos/mw_accesslog_metrics: more batching

https://gerrit.wikimedia.org/r/1050367

Change #1050373 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] benthos/mw_accesslog_metrics: test for T367076

https://gerrit.wikimedia.org/r/1050373

Change #1050373 merged by Kamila Součková:

[operations/puppet@production] benthos/mw_accesslog_metrics: test for T367076

https://gerrit.wikimedia.org/r/1050373

Change #1050367 merged by Kamila Součková:

[operations/puppet@production] benthos/mw_accesslog_metrics: more batching

https://gerrit.wikimedia.org/r/1050367