Page MenuHomePhabricator

benthos mw-accesslog-metrics kafka lag and interpolation errors
Closed, ResolvedPublic

Description

Since last week there's constant kafka lag observed for benthos@mw_accesslog_metrics.service on centrallog1002:

https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus%2Fops&orgId=1&var-topic=All&var-consumer_group=benthos-mw-accesslog-metrics&from=1717299153548&to=1718033612971

2024-06-10-175925_2071x1316_scrot.png (1×2 px, 234 KB)

(note the consumer group uses - and the systemd unit name uses _)

There's also a stream of constant errors in the journal (which might or might not be related, I'm not sure)

Jun 05 22:04:29 centrallog1002 benthos@mw_accesslog_metrics[926]: level=error msg="Value interpolation error: expected number value, got null" @service=benthos label=requests_by_endpoint_duration path=root.pipeline.processors.2
Jun 05 22:04:29 centrallog1002 benthos@mw_accesslog_metrics[926]: level=error msg="Value interpolation error: expected number value, got null" @service=benthos label=requests_by_endpoint_duration path=root.pipeline.processors.2
Jun 05 22:04:34 centrallog1002 benthos@mw_accesslog_metrics[926]: level=error msg="Value interpolation error: expected number value, got null" @service=benthos label=response_size_bytes path=root.pipeline.processors.0
Jun 05 22:04:34 centrallog1002 benthos@mw_accesslog_metrics[926]: level=error msg="Value interpolation error: expected number value, got null" @service=benthos label=requests_duration path=root.pipeline.processors.1
Jun 05 22:04:34 centrallog1002 benthos@mw_accesslog_metrics[926]: level=error msg="Value interpolation error: expected number value, got null" @service=benthos label=response_size_bytes path=root.pipeline.processors.0
Jun 05 22:04:34 centrallog1002 benthos@mw_accesslog_metrics[926]: level=error msg="Value interpolation error: expected number value, got null" @service=benthos label=requests_by_endpoint_duration path=root.pipeline.processors.2
Jun 05 22:04:34 centrallog1002 benthos@mw_accesslog_metrics[926]: level=error msg="Value interpolation error: expected number value, got null" @service=benthos label=response_size_bytes path=root.pipeline.processors.0
Jun 05 22:04:34 centrallog1002 benthos@mw_accesslog_metrics[926]: level=error msg="Value interpolation error: expected number value, got null" @service=benthos label=requests_duration path=root.pipeline.processors.1
Jun 05 22:04:34 centrallog1002 benthos@mw_accesslog_metrics[926]: level=error msg="Value interpolation error: expected number value, got null" @service=benthos label=requests_duration path=root.pipeline.processors.1
Jun 05 22:04:34 centrallog1002 benthos@mw_accesslog_metrics[926]: level=error msg="Value interpolation error: expected number value, got null" @service=benthos label=requests_by_endpoint_duration path=root.pipeline.processors.2

@kamila would you mind taking a look at both of these ? thank you!

Event Timeline

I believe the errors are unrelated (they are due to T340935 and we've had bad messages before and they didn't cause the problem).

The benthos instance does not appear resource-starved based on eyeballing htop and its dashboards, and the webrequest-live instance running next to it processes 10x the amount of messages with no issue.

I have not been able to find any changes correlated with when this started happening.

Change #1050367 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] benthos/mw_accesslog_metrics: more batching

https://gerrit.wikimedia.org/r/1050367

Change #1050373 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] benthos/mw_accesslog_metrics: test for T367076

https://gerrit.wikimedia.org/r/1050373

Change #1050373 merged by Kamila Součková:

[operations/puppet@production] benthos/mw_accesslog_metrics: test for T367076

https://gerrit.wikimedia.org/r/1050373

Change #1050367 merged by Kamila Součková:

[operations/puppet@production] benthos/mw_accesslog_metrics: more batching

https://gerrit.wikimedia.org/r/1050367

Change #1051415 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] benthos/mw_accesslog_metrics: Add buffer

https://gerrit.wikimedia.org/r/1051415

Change #1051415 merged by Kamila Součková:

[operations/puppet@production] benthos/mw_accesslog_metrics: Add buffer

https://gerrit.wikimedia.org/r/1051415

Raine claimed this task.

Increasing batch size slightly improved the situation, very slowly clearing the backlog, suggesting that this was caused by some flavour of a performance issue:

Screenshot from 2024-07-03 16-26-45.png (480×1 px, 73 KB)

Because I did not see any direct resource starvation on the host, my hunch was that this was something about the message delivery.

Based on this hunch, I enabled a buffer, which allows for sending ACKs earlier in the processing pipeline at the cost of the possibility of losing messages (thus risking undercounts in the MW access log metrics).

This made a huge difference in throughput and in clearing the backlog, while still not losing messages (aside from a few at restart):

Screenshot from 2024-07-03 16-33-33.png (351×942 px, 23 KB)
Screenshot from 2024-07-03 16-34-11.png (349×945 px, 24 KB)
Screenshot from 2024-07-03 16-46-19.png (483×1 px, 67 KB)

Wit this in place, the current throughput gives us plenty of margin, so we shouldn't make kafka unhappy again. I also don't expect to actually run into losing messages anytime soon, given that we can clear this backlog at ~3x the usual message rate.

FTR, I have reverted the buffer patch, as it shouldn't be necessary now that we have more partitions thanks to T369256, and I cannot rule it out as a possible cause for some of the "benthos is wedged" weirdness observed last week. That means we no longer know how much reserve capacity we have. In case of trouble, revert^2 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1055399 .