MW-on-k8s traffic logs fewer errors than expected (increase in jsonTruncated)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Krinkle
	Aug 2 2023, 10:28 PM

Description

Follows-up T288851: Make logging work for mediawiki in k8s.

I noticed that since the increase in MW-on-k8s traffic, that this dashboard has gone from generally having zero jsonTruncated entries, to consistently having a trickle of some messages.

This means that message aggregation is less useful by default, as well as that a lot of noise from non-error messages ends up here (because we don't know the channel or severity of truncated messages, and display them here out of caution).

https://logstash.wikimedia.org/app/dashboards#/view/mediawiki-errors

Upon expanding the ones there, they indeed appear to all be from kubernetes pods. The messages themselves are not problematic. The ones I saw were mostly informational diagnostic messages from the MediaWiki-Rdbms component (about TransactionProfiler violations, formerly known as "DBPerformance"). I believe the reason these are disproportionally implicated, is because they are the most commonly hit diagnostic message that includes a stacktrace.

Looking a bit closer, I was wondering why it is that we see these only get truncated from kubernetes. Copying a few of them and checking their length, I see most of them end up as invalid JSON truncated upto about 8000 chars.

When switching context to older and more rare truncated messages from appservers, they typically report an incomplete JSON string containing 32K characters.

I'm guessing that as part of T288851, the intake pipeline for mw-on-k8s didn't carry over some of the settings or thresholds somewhere along the way!

I further further guess that these do not just affect DBPerformance warnings, but also (more importantly) fatal errors. This is problematic for deployment confidence because Scap and Prometheus alerts rely on Logstash reporting the exception count. But for truncated messages, the host and channel would not be set, as those reside in the truncated part of the JSON blob.

Details

Subject	Repo	Branch	Lines +/-
mediawiki::logging: MaxMessageSize unit is lowercase	operations/puppet	production	+2 -2
mediawiki::logging: k8s syslog max message size	operations/puppet	production	+10 -7
mediawiki: Raise syslog max_message_size	operations/deployment-charts	master	+4 -1

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Stalled		None	T255792 Quibble runs core:unit tests twice!
Open		None	T328919 Upgrade to PHPUnit 10
Open		None	T338103 Micro-optimize ApiResult::isMetadataKey with str_starts_with once we support PHP8+
Open		None	T328921 Drop PHP 7.4 support from MediaWiki
Stalled		None	T334726 Use return type `never` in Wikibase
Open		None	T328922 Drop PHP 8.0 support from MediaWiki
Stalled		None	T319055 Upgrade to psr/container 2.x
Stalled	Feature	None	T364249 New upstream release for Pygments (2.18.0)
Stalled		Krinkle	T319432 Migrate WMF production from PHP 7.4 to PHP 8.1
Open		None	T291916 Tracking task for Bullseye migrations in production
Stalled		None	T356293 Migrate MW appservers' base images to bullseye
Open		None	T290536 Serve production traffic via Kubernetes
Resolved		Clement_Goubert	T343390 MW-on-k8s traffic logs fewer errors than expected (increase in jsonTruncated)

Event Timeline

Krinkle created this task.Aug 2 2023, 10:28 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 2 2023, 10:28 PM

RhinosF1 subscribed.Aug 2 2023, 10:29 PM

After a quick check, it appears we are setting

$MaxMessageSize 64k

on mw bare metal hosts and not on kubernetes.
Patch incoming.

Clement_Goubert changed the task status from Open to In Progress.Aug 3 2023, 10:44 AM

Clement_Goubert claimed this task.

Clement_Goubert triaged this task as High priority.

Clement_Goubert moved this task from Incoming 🐫 to Doing 😎 on the serviceops board.

Change 945561 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Raise syslog max_message_size

https://gerrit.wikimedia.org/r/945561

gerritbot added a project: Patch-For-Review.Aug 3 2023, 10:55 AM

Change 945561 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Raise syslog max_message_size

https://gerrit.wikimedia.org/r/945561

Change 945584 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mediawiki::logging: k8s syslog max message size

https://gerrit.wikimedia.org/r/945584

Change 945584 merged by Clément Goubert:

[operations/puppet@production] mediawiki::logging: k8s syslog max message size

https://gerrit.wikimedia.org/r/945584

Deployed, I'll check logstash periodically to see if it was enough to fix the majority of cases.

Change 945589 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mediawiki::logging: MaxMessageSize unit is lowercase

https://gerrit.wikimedia.org/r/945589

The configuration option hasn't been taken up by the rsyslog containers in the pod, because it's a configmap change. I'll need to bounce all deployments when I have a deployment-clear window.

I will update once done.

Change 945589 merged by Clément Goubert:

[operations/puppet@production] mediawiki::logging: MaxMessageSize unit is lowercase

https://gerrit.wikimedia.org/r/945589

Maintenance_bot removed a project: Patch-For-Review.Aug 3 2023, 1:11 PM

The patch has been live for a few hours, and jsontruncated messages from mw-on-k8s are now on the same baseline as bare-metal.
Resolving, feel free to reopen if there are still issues.

MW-on-k8s traffic logs fewer errors than expected (increase in jsonTruncated) Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

MW-on-k8s traffic logs fewer errors than expected (increase in jsonTruncated)
Closed, ResolvedPublic
Actions

Related Objects
Search...