benthos mw-accesslog-metrics kafka lag and interpolation errors
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	fgiunchedi
	Mon, Jun 10, 3:47 PM

Description

Since last week there's constant kafka lag observed for benthos@mw_accesslog_metrics.service on centrallog1002:

https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus%2Fops&orgId=1&var-topic=All&var-consumer_group=benthos-mw-accesslog-metrics&from=1717299153548&to=1718033612971

2024-06-10-175925_2071x1316_scrot.png (1×2 px, 234 KB)

(note the consumer group uses - and the systemd unit name uses _)

There's also a stream of constant errors in the journal (which might or might not be related, I'm not sure)

Jun 05 22:04:29 centrallog1002 benthos@mw_accesslog_metrics[926]: level=error msg="Value interpolation error: expected number value, got null" @service=benthos label=requests_by_endpoint_duration path=root.pipeline.processors.2
Jun 05 22:04:29 centrallog1002 benthos@mw_accesslog_metrics[926]: level=error msg="Value interpolation error: expected number value, got null" @service=benthos label=requests_by_endpoint_duration path=root.pipeline.processors.2
Jun 05 22:04:34 centrallog1002 benthos@mw_accesslog_metrics[926]: level=error msg="Value interpolation error: expected number value, got null" @service=benthos label=response_size_bytes path=root.pipeline.processors.0
Jun 05 22:04:34 centrallog1002 benthos@mw_accesslog_metrics[926]: level=error msg="Value interpolation error: expected number value, got null" @service=benthos label=requests_duration path=root.pipeline.processors.1
Jun 05 22:04:34 centrallog1002 benthos@mw_accesslog_metrics[926]: level=error msg="Value interpolation error: expected number value, got null" @service=benthos label=response_size_bytes path=root.pipeline.processors.0
Jun 05 22:04:34 centrallog1002 benthos@mw_accesslog_metrics[926]: level=error msg="Value interpolation error: expected number value, got null" @service=benthos label=requests_by_endpoint_duration path=root.pipeline.processors.2
Jun 05 22:04:34 centrallog1002 benthos@mw_accesslog_metrics[926]: level=error msg="Value interpolation error: expected number value, got null" @service=benthos label=response_size_bytes path=root.pipeline.processors.0
Jun 05 22:04:34 centrallog1002 benthos@mw_accesslog_metrics[926]: level=error msg="Value interpolation error: expected number value, got null" @service=benthos label=requests_duration path=root.pipeline.processors.1
Jun 05 22:04:34 centrallog1002 benthos@mw_accesslog_metrics[926]: level=error msg="Value interpolation error: expected number value, got null" @service=benthos label=requests_duration path=root.pipeline.processors.1
Jun 05 22:04:34 centrallog1002 benthos@mw_accesslog_metrics[926]: level=error msg="Value interpolation error: expected number value, got null" @service=benthos label=requests_by_endpoint_duration path=root.pipeline.processors.2

@kamila would you mind taking a look at both of these ? thank you!

Details

	Subject	Repo	Branch	Lines +/-
	benthos/mw_accesslog_metrics: more batching	operations/puppet	production	+2 -2
	benthos/mw_accesslog_metrics: test for T367076	operations/puppet	production	+3 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T198901 Migrate production services to kubernetes using the pipeline
Resolved	Clement_Goubert	T238770 Deploy MediaWiki to Wikimedia production in containers
Resolved	Jdforrester-WMF	T238771 Get production MW-land images built and published
Duplicate	None	T238773 Create initial production MW-land images with blubber
Open	None	T238747 Migrate www.wikipedia.org (and other www portals) to be its own service
Resolved	akosiaris	T238774 Provide the official production base images for Wikimedia use
Resolved	Joe	T265324 Create the base container images for running MediaWiki in a production environment
Resolved	Clement_Goubert	T265876 Logging options for apache httpd in k8s
Resolved	kamila	T276095 Keep calculating latencies for MediaWiki requests in the WikiKube environment
Open	None	T367076 benthos mw-accesslog-metrics kafka lag and interpolation errors

Event Timeline

fgiunchedi created this task.Mon, Jun 10, 3:47 PM

fgiunchedi updated the task description. (Show Details)Mon, Jun 10, 3:59 PM

fgiunchedi updated the task description. (Show Details)Tue, Jun 11, 10:53 AM

lmata moved this task from Inbox to Radar on the Observability-Logging board.Tue, Jun 11, 2:40 PM

I believe the errors are unrelated (they are due to T340935 and we've had bad messages before and they didn't cause the problem).

The benthos instance does not appear resource-starved based on eyeballing htop and its dashboards, and the webrequest-live instance running next to it processes 10x the amount of messages with no issue.

I have not been able to find any changes correlated with when this started happening.

Change #1050367 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] benthos/mw_accesslog_metrics: more batching

https://gerrit.wikimedia.org/r/1050367

gerritbot added a project: Patch-For-Review.Thu, Jun 27, 1:57 PM

Change #1050373 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] benthos/mw_accesslog_metrics: test for T367076

https://gerrit.wikimedia.org/r/1050373

Change #1050373 merged by Kamila Součková:

[operations/puppet@production] benthos/mw_accesslog_metrics: test for T367076

https://gerrit.wikimedia.org/r/1050373

Change #1050367 merged by Kamila Součková: