Page MenuHomePhabricator

logspam-watch doesn’t handle normalized exceptions well
Closed, ResolvedPublic

Description

DiscussionTools currently has a somewhat frequent exception (T343859) that recently had some details added:

-					throw new LogicException( "Database can't find our row and won't let us insert it" );
+					throw new NormalizedException(
+						"Database can't find our row and won't let us insert it on page {page} revision {revision}",
+						[
+							'page' => $rev->getPageId(),
+							'revision' => $rev->getId(),
+						]
+					);

On the face of it, this is a textbook use of NormalizedException. However, the result in logspam-watch looks like this:

image.png (570×1 px, 256 KB)

It looks like logspam-watch is using the unnormalized message instead of the normalized message, thus negating the benefit of message normalization: there’s a wall of similar but different messages with low hit counts, rather than a single entry with an aggregate hit count.

This might be limited to this particular exception (though I don’t know how or why) – notice that the DBTransactionSizeError in the screenshot shows a normalized message. (That class is also implemented using the NormalizedException library, but using INormalizedException + NormalizedExceptionTrait instead of the NormalizedException class…)

Event Timeline

(Not sure which tags logspam-watch belongs to, I grabbed some relevant-seeming ones from older tasks.)

The message seems to be normalized correctly in Logstash, at least:

image.png (136×911 px, 17 KB)

logspam-watch works by reading the /srv/mw-log/{exception,error}.log files which only have the final error message (no template). The reason some other messages are consolidated is that the script has some rules to handle them: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/role/files/logging/logspam.pl#131

Feel free to add additional patterns!

But the DBTransactionSizeError is properly normalized in /srv/mw-log/exception.log:

[Exception Wikimedia\Rdbms\DBTransactionSizeError] (/srv/mediawiki/php-1.41.0-wmf.26/includes/libs/rdbms/loadbalancer/LoadBalancer.php:1427) Transaction spent {time}s in writes, exceeding the 3s limit

I wonder if the different MediaWiki entry points log errors to that file differently? The “Transaction spent {time}s” messages seem to come from /w/index.php while the “Database can't find our row” ones seem to come from /rpc/RunSingleJob.php according to their stack traces.

I see what you mean. Interesting.

(At this point I agree it’s not a problem in logspam-watch, but I feel like I don’t know enough yet to retitle the task. I tried looking for more normalized exceptions in the log file but didn’t find any other examples that could support or disprove my entry point hypothesis.)

Change 959802 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] logspam.pl: Consolidate another database-related message

https://gerrit.wikimedia.org/r/959802

Change 959802 merged by Jbond:

[operations/puppet@production] logspam.pl: Consolidate another database-related message

https://gerrit.wikimedia.org/r/959802

thcipriani assigned this task to dancy.
thcipriani subscribed.

We've dealt with this particular log message. Closing this one out now.

IMHO the underlying issue (that unnormalized messages ended up in the log file) should be fixed, but it’s difficult to investigate.