Page MenuHomePhabricator

Wikimedia\Rdbms\LoadBalancer::{closure}: found writes pending
Closed, ResolvedPublic

Description

High increase of Wikimedia\Rdbms\LoadBalancer::{closure}: found writes pending errors. Probably related to train deployment: starting on 28 March, but getting higher on the 2 Apr . They seem to come mostly from the new job queue (although that could be a coincidence). Could also be transaction handling- difficult to say at this point.

https://logstash.wikimedia.org/goto/8263398d71a6ca491e21cecee900e425

Event Timeline

jcrespo created this task.Apr 3 2018, 7:40 AM
Restricted Application added a project: Analytics. · View Herald TranscriptApr 3 2018, 7:40 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
jcrespo updated the task description. (Show Details)Apr 3 2018, 11:23 AM
Pchelolo edited projects, added Services (watching); removed Services.EditedApr 3 2018, 1:55 PM
Pchelolo added a subscriber: Pchelolo.

Looking at the logs there're 2 sources of these:

  1. First one with url /rpc/RunSingleJob.php is from the new queue, and these started on Apr 02 after 21:00 UTC.

At 20:59 UTC that day there's a log entry

20:59 twentyafterfour: MediaWiki Train: rolling back to 1.31.0-wmf.26 refs T183966, T190960

And subsequent deploys back to wmf.27, so the correlation here is completely clear.

  1. One coming from the old queue mostly from the LocalRenameUserJob - these accounts for roughly 20% of the messages and there's a big increase in these correlates with a rollback as well.

So I believe this one is related to T190960 and rolling out of wmf.27, however afterward rMWceb7d61ee7ef3edc6705abd41ec86b3afcd9c491 was reverted and the train was rolled forward, the rate of the logs increased dramatically, which suggest the the rever didn't make it better, but only worse.

Also, this first ever started happening with a lower rate on Mar 28 19:07 which correlates exactly with:

twentyafterfour@tin: Synchronized php: group1 wikis to 1.31.0-wmf.27 (duration: 01m 17s)

Sorry to add Services, but you would be able, as you just did, to do a better triage than I did.

I think it's time to roll-back and abandon 1.31.0-wmf.27. I really don't see how 1.31.0-wmf.28 will be any better though.

Edited https://phabricator.wikimedia.org/T191282#4100576 to be more precise on the timings of the events.

Nuria moved this task from Incoming to Deprioritized on the Analytics board.Apr 5 2018, 4:45 PM
Nuria moved this task from Deprioritized to Radar on the Analytics board.
Tgr triaged this task as High priority.May 1 2018, 1:39 PM
Tgr added a subscriber: Tgr.

"High" at the point of filing this task was 100/hour. Now it's a something like 400,000/hour. This seems to have started with the previous train deployment.

This might have affected the time spent in MediaWiki::preOutputCommit, see T193417: ReadingLists performance degradation.

Tgr added a comment.May 1 2018, 1:42 PM

Almost all seem to come from the job queue, unfortunately I don't think the job name is recorded.

Almost all seem to come from the job queue, unfortunately, I don't think the job name is recorded.

Ye, that's not great. We include the job name into messages produced by the job queue itself, but we don't for messages produced downstream the code.

Tgr added a comment.May 1 2018, 3:54 PM

(T142313: Add global logging context is the task for adding context information to logs.)

Change 430093 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: improve log warnings in runMasterPostTrxCallbacks()

https://gerrit.wikimedia.org/r/430093

Change 430117 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[mediawiki/core@wmf/1.32.0-wmf.1] rdbms: improve log warnings in runMasterPostTrxCallbacks()

https://gerrit.wikimedia.org/r/430117

Change 430093 merged by jenkins-bot:
[mediawiki/core@master] rdbms: improve log warnings in runMasterPostTrxCallbacks()

https://gerrit.wikimedia.org/r/430093

Change 430117 merged by jenkins-bot:
[mediawiki/core@wmf/1.32.0-wmf.1] rdbms: improve log warnings in runMasterPostTrxCallbacks()

https://gerrit.wikimedia.org/r/430117

Change 430273 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Use AutoCommitUpdate in LinksUpdate::doUpdate

https://gerrit.wikimedia.org/r/430273

Most of the warnings (I suppose the logging was changed to understand which component relates to) seem to be related to wikibase client jobs right now:

https://logstash.wikimedia.org/app/kibana#/dashboard/DBQuery

Change 430273 merged by jenkins-bot:
[mediawiki/core@master] Use AutoCommitUpdate in LinksUpdate::doUpdate

https://gerrit.wikimedia.org/r/430273

Joe added a subscriber: Joe.May 7 2018, 4:45 PM

Most if not all of these seem to have gone since yesterday's train (need more time to check if absolutly all are gone or just the bulk of it).

jcrespo rescinded a token.May 11 2018, 12:13 PM
jcrespo awarded a token.

@jcrespo: should we close this then? Have you been able to verify that all are gone?

jcrespo lowered the priority of this task from High to Normal.May 15 2018, 4:29 PM

@mmodell Most of the errors are gone, but some are still happening. I think this is no longer high, but if it has an easy fix, those should be looked at by the code owners:

https://logstash.wikimedia.org/goto/dd624348a6d23ce0776f4cc66a0eac7e

Restricted Application added a project: Analytics. · View Herald TranscriptJul 11 2018, 3:20 AM
Krinkle moved this task from Meta to Untriaged on the WMF-JobQueue board.Jul 12 2018, 10:54 PM
Krinkle closed this task as Resolved.Sep 5 2018, 3:06 AM
Krinkle claimed this task.
Krinkle added a subscriber: Krinkle.

There are indeed a few hits still, but this has become an tracking task for a category of issues, which doesn't seem useful.

The found writes pending error is a symptom of an underlying problem. The only way to make the error go away permanently is to remove MediaWiki's ability to detect the issue.

Many specific problems in core and in various extensions that are detected by this warning have been mentioned here and fixed. I'll close this for now in favour of opening new tasks for remaining/future instances of this problem.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:09 PM