Reduce verbosity of DBReplication logs from non-debug requests
Closed, ResolvedPublicPRODUCTION ERROR
Actions

Assigned To

Authored By

	jcrespo
	Mar 18 2017, 6:47 PM

Description

T160832 - It took me not much more than 20 minutes from a server issue until that was mitigated on mediawiki level.

No problems AFAICT with the load balancer- it did its job womderfully.

However, during those 20 minutes, 5 million logs were registered, mostly at the DBReplication channel, complaining about db1094. The replication check is crazy- the architecture should be to check at most once per second (much better if it was less than that, speciall after failure. I understand the difficulty of the application server architecture to coordinate that, but I think there could be ways to mitigate that, like some kind of short-term caching or it being shared (or increase its TTL if that is already in place).

This is not the only issue- DBReplication generates an error if the lag is greater than 1 second- even if the replication check has a >1 second of error in its calculation. The threshold should be on something like 2 second notice, 5 second warning, 15 seconds error (or the amount configured on mediawiki).

Details

	Subject	Repo	Branch	Lines +/-
	Avoid cached lag logging spam from changes list pages	mediawiki/core	master	+3 -8

Customize query in gerrit

Related Objects

Mentioned Here: T160832: db1094 crash

Event Timeline

jcrespo created this task.Mar 18 2017, 6:47 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 18 2017, 6:47 PM

Umherirrender edited projects, added MediaWiki-Debug-Logger; removed MediaWiki-Logevents.Mar 20 2017, 3:35 PM

Maybe even Wikimedia-production-error rather than MediaWiki-General?

Volans subscribed.Mar 21 2017, 6:59 PM

+1, as soon as one DB is slightly delayed (~10s) thousands of warnings are logged.

Krinkle edited projects, added MediaWiki-libs-Rdbms, Wikimedia-production-error, Performance-Team; removed MediaWiki-Debug-Logger, Performance Issue.Sep 14 2018, 5:41 PM

Krinkle moved this task from Untriaged to Rdbms library on the MediaWiki-libs-Rdbms board.

Krinkle moved this task from Untriaged to Mar 2021 on the Wikimedia-production-error board.

To confirm, which of these message(s) in the DBReplication channel is this task primarily about?

Wikimedia\Rdbms\LoadBalancer::getRandomNonLagged: server {host} is not replicating?
Wikimedia\Rdbms\LoadBalancer::pickReaderIndex: all replica DBs lagged. Switch to read-only mode
Wikimedia\Rdbms\LoadBalancer::doWait: Timed out waiting on {host} pos {pos}

Triaging as Meta/Low-Impact because the error message is indicating a real issue outside MediaWiki, and the issue is actionable. But we'd like to let MediaWiki check/report the issue less frequently.

From a quick look, it seems to me that the Wikimedia-RDBMS library already uses caching for the result of the replication-lag checks. However, the checking and reporting of that result happens in a layer on top of that. Which means there is a fair amount of echo-ing where those that use the memorised value, still report it as their own finding.

To some extent that might be intentional, in that the error might be useful to correlate cascading problems in the sam request, using the request ID to correlate them.

But, I am not entirely sure whether that is needed for this error. If it is, then we could still change the severity level based on which of the two scenarios it is under, e.g. level=INFO vs level=WARNING.

Krinkle moved this task from Inbox, needs triage to To-do: Goals, prioritized next 4 Quarters on the Performance-Team board.Sep 17 2018, 8:11 PM

"Using cached lag value for {db_server} due to active transaction" was logged 3,356 times in the last hour. I wonder what is the value of something that logs (with constant rate) every second, and the performance impact it may have on both requests by servers and the logging infrastructure.

@aaron We've tuned a few of these recently, but the link from Jaime still shows 5K entries in the last hour which suggests perhaps this one could still be improved. Assuming so, I'm moving this to the small backlog to pick up at some point.

Krinkle renamed this task from DBReplication logs are too verbose to Reduce verbosity of DBReplication logs from non-debug requests.Mar 3 2019, 4:13 PM

Change 494152 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Avoid cached lag logging spam from changes list pages

https://gerrit.wikimedia.org/r/494152

gerritbot added a project: Patch-For-Review.Mar 4 2019, 2:33 AM

Change 494152 merged by jenkins-bot:
[mediawiki/core@master] Avoid cached lag logging spam from changes list pages

https://gerrit.wikimedia.org/r/494152

ReleaseTaggerBot added a project: MW-1.33-notes (1.33.0-wmf.21; 2019-03-12).Mar 5 2019, 4:01 PM

Krinkle closed this task as Resolved.Mar 12 2019, 9:08 PM

Krinkle assigned this task to aaron.

Krinkle removed projects: MW-1.33-notes (1.33.0-wmf.21; 2019-03-12), Patch-For-Review.

• mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:10 PM

Krinkle moved this task from Mar 2021 to Untriaged on the Wikimedia-production-error board.Feb 10 2021, 7:25 PM

Reduce verbosity of DBReplication logs from non-debug requestsClosed, ResolvedPublicPRODUCTION ERRORActions

Description

Details

Related Objects

Event Timeline

Reduce verbosity of DBReplication logs from non-debug requests
Closed, ResolvedPublicPRODUCTION ERROR
Actions