Page MenuHomePhabricator

Incredible amount of logs from Wikimedia\Rdbms\LoadBalancer::runPrimaryTransactionIdleCallbacks
Open, In Progress, HighPublic

Description

During connectivity issues in production, these logs are generated at incredible volumes.

June 22 2024 22:00 UTC to June 23 2024 05:30 UTC for message:"found writes pending"

CountNormalized message
18,375,432Wikimedia\Rdbms\LoadBalancer::runPrimaryTransactionIdleCallbacks: found writes pending ().
11,027,272Wikimedia\Rdbms\LoadBalancer::runPrimaryTransactionIdleCallbacks: found writes pending (MediaWiki\Deferred\LinksUpdate\LinksUpdate::acquirePageLock).
5,145,993Wikimedia\Rdbms\LoadBalancer::runPrimaryTransactionIdleCallbacks: found writes pending (CategoryMembershipChangeJob::run).
2,841,780Wikimedia\Rdbms\LoadBalancer::runPrimaryTransactionIdleCallbacks: found writes pending (MediaWiki\Linter\Database::setForPage).
2,167,911Wikimedia\Rdbms\LoadBalancer::runPrimaryTransactionIdleCallbacks: found writes pending (MediaWiki\Deferred\LinksUpdate\LinksUpdate::updateLinksTimestamp, MediaWiki\Deferred\LinksUpdate\LinksUpdate::acquirePageLock, MediaWiki\Deferred\LinksUpdate\LinksUpdate
......

Event Timeline

Change #1049228 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Reduce log severity of "found writes pending"

https://gerrit.wikimedia.org/r/1049228

aaron changed the task status from Open to In Progress.Jun 24 2024, 5:54 PM
aaron moved this task from Incoming (Needs Triage) to In Progress on the MW-Interfaces-Team board.

Change #1049261 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.43.0-wmf.10] rdbms: Reduce log severity of "found writes pending"

https://gerrit.wikimedia.org/r/1049261

Change #1049261 merged by jenkins-bot:

[mediawiki/core@wmf/1.43.0-wmf.10] rdbms: Reduce log severity of "found writes pending"

https://gerrit.wikimedia.org/r/1049261

Change #1049228 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Reduce log severity of "found writes pending"

https://gerrit.wikimedia.org/r/1049228

Change #1049909 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.43.0-wmf.11] rdbms: Reduce log severity of "found writes pending"

https://gerrit.wikimedia.org/r/1049909

Change #1049909 merged by jenkins-bot:

[mediawiki/core@wmf/1.43.0-wmf.11] rdbms: Reduce log severity of "found writes pending"

https://gerrit.wikimedia.org/r/1049909

I've been trying to debug and demystify what caused this. Here are some preliminary notes:

FTR, after https://gerrit.wikimedia.org/r/c/operations/puppet/+/1048863 we dropped 99% of "found writes pending" logs randomly as they appeared from Kafka. I have deleted no ingested logs.

According to the Kafka topic, we got around 560,778 logs per second (src). This doesn't specifically select just "found writes pending" logs, but the udp_localhost-warning topic doesn't get a lot of traffic.

Given that 99% of the logs were dropped, it mean each job triggered 1000 "writes pending found" log. The only explanation is that it fell into an infinite loop when databases got lagged or slow. That part is not fixed yet and should be investigated further.

Change #1051229 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] rdbms: cleanup transaction write query tracking and reporting

https://gerrit.wikimedia.org/r/1051229

Change #1064410 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] rdbms: only mention write queries in "writes pending" warnings

https://gerrit.wikimedia.org/r/1064410

Change #1064410 merged by jenkins-bot:

[mediawiki/core@master] rdbms: only mention write queries in "writes pending" warnings

https://gerrit.wikimedia.org/r/1064410

@Ladsgroup -- do you need anything else from the MWI team on this? Or are you taking it from here? If the later, can we remove our team tag?

@Ladsgroup -- do you need anything else from the MWI team on this? Or are you taking it from here? If the later, can we remove our team tag?

We fixed the issue by reducing the log severity level but it was very likely that it ended up in an infinite loop until it got killed. And that was not fully investigated. So far we haven't seen it again, so maybe let's close and then reopen if it happens again.

That seems reasonable. @aaron -- any concerns with closing for now?

That seems reasonable. @aaron -- any concerns with closing for now?

I suppose we could try enabling the logging again and see if it re-occurs.

Change #1078474 had a related patch set uploaded (by Paladox; author: Amir Sarabadani):

[mediawiki/core@REL1_42] rdbms: Reduce log severity of "found writes pending"

https://gerrit.wikimedia.org/r/1078474

Change #1078475 had a related patch set uploaded (by Paladox; author: Aaron Schulz):

[mediawiki/core@REL1_42] rdbms: only mention write queries in "writes pending" warnings

https://gerrit.wikimedia.org/r/1078475

Change #1078474 merged by jenkins-bot:

[mediawiki/core@REL1_42] rdbms: Reduce log severity of "found writes pending"

https://gerrit.wikimedia.org/r/1078474

Change #1078475 merged by jenkins-bot:

[mediawiki/core@REL1_42] rdbms: only mention write queries in "writes pending" warnings

https://gerrit.wikimedia.org/r/1078475

That seems reasonable. @aaron -- any concerns with closing for now?

I suppose we could try enabling the logging again and see if it re-occurs.

That brought down all wikis multiple times, I don't think it's safe to revert it and see if all wikis go down. You can do stat or something else instead.

aaron lowered the priority of this task from High to Medium.Nov 7 2024, 4:48 PM
aaron moved this task from In Progress to Backlog (Triaged and Ready) on the MW-Interfaces-Team board.
daniel raised the priority of this task from Medium to High.Apr 4 2025, 7:27 AM

Change #1137040 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] rdbms: add StatsLib counter for danging writes during post-transaction callbacks

https://gerrit.wikimedia.org/r/1137040

Change #1137040 merged by jenkins-bot:

[mediawiki/core@master] rdbms: add StatsLib counter for danging writes during post-transaction callbacks

https://gerrit.wikimedia.org/r/1137040