Page MenuHomePhabricator

MW-on-k8s traffic logs fewer errors than expected (increase in jsonTruncated)
Closed, ResolvedPublic

Description

Follows-up T288851: Make logging work for mediawiki in k8s.

I noticed that since the increase in MW-on-k8s traffic, that this dashboard has gone from generally having zero jsonTruncated entries, to consistently having a trickle of some messages.

This means that message aggregation is less useful by default, as well as that a lot of noise from non-error messages ends up here (because we don't know the channel or severity of truncated messages, and display them here out of caution).

https://logstash.wikimedia.org/app/dashboards#/view/mediawiki-errors

Upon expanding the ones there, they indeed appear to all be from kubernetes pods. The messages themselves are not problematic. The ones I saw were mostly informational diagnostic messages from the MediaWiki-Rdbms component (about TransactionProfiler violations, formerly known as "DBPerformance"). I believe the reason these are disproportionally implicated, is because they are the most commonly hit diagnostic message that includes a stacktrace.

Looking a bit closer, I was wondering why it is that we see these only get truncated from kubernetes. Copying a few of them and checking their length, I see most of them end up as invalid JSON truncated upto about 8000 chars.

When switching context to older and more rare truncated messages from appservers, they typically report an incomplete JSON string containing 32K characters.

I'm guessing that as part of T288851, the intake pipeline for mw-on-k8s didn't carry over some of the settings or thresholds somewhere along the way!

I further further guess that these do not just affect DBPerformance warnings, but also (more importantly) fatal errors. This is problematic for deployment confidence because Scap and Prometheus alerts rely on Logstash reporting the exception count. But for truncated messages, the host and channel would not be set, as those reside in the truncated part of the JSON blob.

Event Timeline

After a quick check, it appears we are setting

$MaxMessageSize 64k

on mw bare metal hosts and not on kubernetes.
Patch incoming.

Clement_Goubert changed the task status from Open to In Progress.Aug 3 2023, 10:44 AM
Clement_Goubert claimed this task.
Clement_Goubert triaged this task as High priority.
Clement_Goubert moved this task from Incoming 🐫 to Doing 😎 on the serviceops board.

Change 945561 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Raise syslog max_message_size

https://gerrit.wikimedia.org/r/945561

Change 945561 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Raise syslog max_message_size

https://gerrit.wikimedia.org/r/945561

Change 945584 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mediawiki::logging: k8s syslog max message size

https://gerrit.wikimedia.org/r/945584

Change 945584 merged by Clément Goubert:

[operations/puppet@production] mediawiki::logging: k8s syslog max message size

https://gerrit.wikimedia.org/r/945584

Deployed, I'll check logstash periodically to see if it was enough to fix the majority of cases.

Change 945589 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mediawiki::logging: MaxMessageSize unit is lowercase

https://gerrit.wikimedia.org/r/945589

The configuration option hasn't been taken up by the rsyslog containers in the pod, because it's a configmap change. I'll need to bounce all deployments when I have a deployment-clear window.

I will update once done.

Change 945589 merged by Clément Goubert:

[operations/puppet@production] mediawiki::logging: MaxMessageSize unit is lowercase

https://gerrit.wikimedia.org/r/945589

The patch has been live for a few hours, and jsontruncated messages from mw-on-k8s are now on the same baseline as bare-metal.
Resolving, feel free to reopen if there are still issues.