Page MenuHomePhabricator

Reduce verbosity of DBReplication logs from non-debug requests
Closed, ResolvedPublicPRODUCTION ERROR

Description

T160832 - It took me not much more than 20 minutes from a server issue until that was mitigated on mediawiki level.

No problems AFAICT with the load balancer- it did its job womderfully.

However, during those 20 minutes, 5 million logs were registered, mostly at the DBReplication channel, complaining about db1094. The replication check is crazy- the architecture should be to check at most once per second (much better if it was less than that, speciall after failure. I understand the difficulty of the application server architecture to coordinate that, but I think there could be ways to mitigate that, like some kind of short-term caching or it being shared (or increase its TTL if that is already in place).

This is not the only issue- DBReplication generates an error if the lag is greater than 1 second- even if the replication check has a >1 second of error in its calculation. The threshold should be on something like 2 second notice, 5 second warning, 15 seconds error (or the amount configured on mediawiki).

Related Objects

Event Timeline

+1, as soon as one DB is slightly delayed (~10s) thousands of warnings are logged.

To confirm, which of these message(s) in the DBReplication channel is this task primarily about?

Wikimedia\Rdbms\LoadBalancer::getRandomNonLagged: server {host} is not replicating?
Wikimedia\Rdbms\LoadBalancer::pickReaderIndex: all replica DBs lagged. Switch to read-only mode
Wikimedia\Rdbms\LoadBalancer::doWait: Timed out waiting on {host} pos {pos}

Triaging as Meta/Low-Impact because the error message is indicating a real issue outside MediaWiki, and the issue is actionable. But we'd like to let MediaWiki check/report the issue less frequently.

From a quick look, it seems to me that the Wikimedia-RDBMS library already uses caching for the result of the replication-lag checks. However, the checking and reporting of that result happens in a layer on top of that. Which means there is a fair amount of echo-ing where those that use the memorised value, still report it as their own finding.

To some extent that might be intentional, in that the error might be useful to correlate cascading problems in the sam request, using the request ID to correlate them.

But, I am not entirely sure whether that is needed for this error. If it is, then we could still change the severity level based on which of the two scenarios it is under, e.g. level=INFO vs level=WARNING.

"Using cached lag value for {db_server} due to active transaction" was logged 3,356 times in the last hour. I wonder what is the value of something that logs (with constant rate) every second, and the performance impact it may have on both requests by servers and the logging infrastructure.

Krinkle added a subscriber: aaron.

@aaron We've tuned a few of these recently, but the link from Jaime still shows 5K entries in the last hour which suggests perhaps this one could still be improved. Assuming so, I'm moving this to the small backlog to pick up at some point.

Krinkle renamed this task from DBReplication logs are too verbose to Reduce verbosity of DBReplication logs from non-debug requests.Mar 3 2019, 4:13 PM

Change 494152 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Avoid cached lag logging spam from changes list pages

https://gerrit.wikimedia.org/r/494152

Change 494152 merged by jenkins-bot:
[mediawiki/core@master] Avoid cached lag logging spam from changes list pages

https://gerrit.wikimedia.org/r/494152

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:10 PM