commitMasterChanges()
Closed, DuplicatePublicPRODUCTION ERROR
Actions

Assigned To

None

Authored By

	hashar
	Jul 18 2019, 3:23 PM

Description

Error

Request URL:
Request ID: INSERT_ID

message

[{exception_id}] {exception_url}   ErrorException from line 0 of : PHP Warning: Destructor threw an object exception: exception 'Wikimedia\Rdbms\DBTransactionError' with message 'Wikimedia\Rdbms\LBFactory::shutdown: transaction round 'MWCallableUpdate::do

trace

PHP Warning: Destructor threw an object exception: exception 'Wikimedia\Rdbms\DBTransactionError' with message 'Wikimedia\Rdbms\LBFactory::shutdown: transaction round 'MWCallableUpdate::doUpdate' still running' in /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/lbfactory/LBFactory.php:274
Stack trace:
#0 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/lbfactory/LBFactory.php(190): Wikimedia\Rdbms\LBFactory->commitMasterChanges()
#1 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/lbfactory/LBFactory.php(163): Wikimedia\Rdbms\LBFactory->shutdown()
#2 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/lbfactory/LBFactory.php(766): Wikimedia\Rdbms\LBFactory->destroy()
#3 (): Wikimedia\Rdbms\LBFactory->__destruct()
#4 {main}

Impact

No idea!?

Wikimedia\Rdbms\LBFactory->__destruct()
Wikimedia\Rdbms\LBFactory->commitMasterChanges()

Blocks somehow or take more than 200 seconds.

The request is killed.

Notes

Largely noticeable since we have promoted all wikis to 1.34.0-wmf.14 at 14:37 UTC

Screenshot 2019-07-18 at 19.19.21.png (752×2 px, 176 KB)

Details

	Subject	Repo	Branch	Lines +/-
	Revert "all wikis to 1.34.0-wmf.14"	operations/mediawiki-config	master	+299 -299

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved	Release	• LarsWirzenius	T220739 1.34.0-wmf.14 deployment blockers
		Duplicate	PRODUCTION ERROR	None	T228436 web request timeout after 200 seconds due to Wikimedia\Rdbms\LBFactory->__destruct() > Wikimedia\Rdbms\LBFactory->commitMasterChanges()

Event Timeline

hashar created this task.Jul 18 2019, 3:23 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 18 2019, 3:23 PM

hashar triaged this task as Unbreak Now! priority.Jul 18 2019, 3:24 PM

Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptJul 18 2019, 3:24 PM

hashar added a parent task: T220739: 1.34.0-wmf.14 deployment blockers.Jul 18 2019, 3:24 PM

hashar added a subscriber: • LarsWirzenius.

I was also reported high "lag" between edits and them showing into recentchanges on at least cswiki, but databases had virtually no lag at the moment. This could be related- transaction taking too long to commit.

jijiki updated the task description. (Show Details)Jul 18 2019, 3:28 PM

That happens mostly on enwiki. Over one hour 4k occurrences out of 4400 total matches.

I can see an abnormal number of transactions on the enwiki master in Sleep state with 195+ seconds of connection time. This is not normal. The errors may be the watchdog killing connections to prevent a larger issue.

I saw a few:

MediaWiki\Storage\PageEditStash::getAndWaitForStashValue X.X.X.X */ GET_LOCK('0ada9c

But that is a guess, maybe that is normal (no confirmation that is the root cause).

Krinkle added projects: MediaWiki-libs-Rdbms, Performance-Team.Jul 18 2019, 3:32 PM

Krinkle moved this task from Untriaged to Jul2019/1.34.wmf.13+ on the Wikimedia-production-error board.

https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1067&var-port=9104&from=1563453267456&to=1563464067457&panelId=37&fullscreen

Mentioned in SAL (#wikimedia-operations) [2019-07-18T15:40:39Z] <hashar@deploy1001> rebuilt and synchronized wikiversions files: Revert group2 wikis to 1.34.0-wmf.13 # T228436 T220739

Stashbot mentioned this in T220739: 1.34.0-wmf.14 deployment blockers.Jul 18 2019, 3:40 PM

Change 524252 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/mediawiki-config@master] Revert "all wikis to 1.34.0-wmf.14"

https://gerrit.wikimedia.org/r/524252

Change 524252 merged by jenkins-bot:
[operations/mediawiki-config@master] Revert "all wikis to 1.34.0-wmf.14"

https://gerrit.wikimedia.org/r/524252

See immediate effect: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1067&var-port=9104&from=1563449050608&to=1563468250610&panelId=37&fullscreen

In T228436#5346211, @jcrespo wrote:

I was also reported high "lag" between edits and them showing into recentchanges on at least cswiki, but databases had virtually no lag at the moment. This could be related- transaction taking too long to commit.

Indeed the rollback got rid of the spam of errors. To be investigated now is what kind of db transactions are taking a while in Wikimedia\Rdbms\LBFactory->commitMasterChanges() :-\ There might be informations in logstash log if MediaWiki logs anything for databases.

Maintenance_bot removed a project: Patch-For-Review.Jul 18 2019, 4:10 PM

Alternate stack trace, possibly the same issue:

trace (load.php)

PHP Warning: Destructor threw an object exception: exception 'Wikimedia\Rdbms\DBUnexpectedError' with message 'DB connection was already closed' in /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/database/Database.php:966
Stack trace:
#0 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/database/Database.php(1189): Wikimedia\Rdbms\Database->assertHasConnectionHandle()
#1 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/database/Database.php(1158): Wikimedia\Rdbms\Database->executeQuery()
#2 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/database/Database.php(4282): Wikimedia\Rdbms\Database->query()
#3 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/database/Database.php(687): Wikimedia\Rdbms\Database->ping()
#4 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/database/Database.php(677): Wikimedia\Rdbms\Database->pingAndCalculateLastTrxApplyTime()
#5 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1568): Wikimedia\Rdbms\Database->pendingWriteQueryDuration()
#6 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/loadbalancer/LoadBalancer.php(2016): Closure$Wikimedia\Rdbms\LoadBalancer::approveMasterChanges()
#7 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1584): Wikimedia\Rdbms\LoadBalancer->forEachOpenMasterConnection()
#8 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/lbfactory/LBFactory.php(230): Wikimedia\Rdbms\LoadBalancer->approveMasterChanges()
#9 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/lbfactory/LBFactoryMulti.php(368): Closure$Wikimedia\Rdbms\LBFactory::forEachLBCallMethod()
#10 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/lbfactory/LBFactory.php(233): Wikimedia\Rdbms\LBFactoryMulti->forEachLB()
#11 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/lbfactory/LBFactory.php(287): Wikimedia\Rdbms\LBFactory->forEachLBCallMethod()
#12 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/lbfactory/LBFactory.php(190): Wikimedia\Rdbms\LBFactory->commitMasterChanges()
#13 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/lbfactory/LBFactory.php(163): Wikimedia\Rdbms\LBFactory->shutdown()
#14 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/lbfactory/LBFactory.php(766): Wikimedia\Rdbms\LBFactory->destroy()
#15 (): Wikimedia\Rdbms\LBFactory->__destruct()
#16 {main}

Krinkle updated the task description. (Show Details)Jul 18 2019, 6:19 PM

When will the patch be available?

@Agusbou2015 The live software has been reverted away from this issue. We are now working to fix it before re-upgrading the wikis. Is there a problem you are still experiencing as result of this issue? If so, please contribute by describing it. If not, please avoid interruption that doesn't help us closer to the solution.

Krinkle moved this task from Untriaged to Rdbms library on the MediaWiki-libs-Rdbms board.Jul 18 2019, 7:50 PM

The timeout aspect seems strange. The huge "idle" time increase at https://grafana.wikimedia.org/d/000000273/mysql sounds like the PageEditStash::parseAndCache() has an infinite timeout instead of 0 seconds (bug, it should be 0 as in non-blocking) and the parsing may have been slowed down for some reason, making more threads wait on the lock. Maybe the concurrent nutcracker issues were also affecting mcrouter (since the same hosts are used). Could also be something adding memcached write load: https://grafana.wikimedia.org/d/000000316/memcache?orgId=1&from=1563458818482&to=1563464680644 looks a little unusual, though not unlike the result of key version changes that happen from release to release (including the slow return to normal set() rate).

I don't see anything in the MW DB logs or hourly flamegraphs worthy of note.

The timeouts correspond with the redis problems:

PHP Fatal Error: entire web request took longer than 200 seconds and timed out
#0 [internal function]: Redis->sockReadLine()
#1 [internal function]: Redis->sockReadData(NULL)
#2 [internal function]: Redis->processBooleanResponse()
#3 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/redis/RedisConnRef.php(131): Redis->watch(string)
#4 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/redis/RedisConnRef.php(117): RedisConnRef->tryCall(string, array)
#5 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/objectcache/RedisBagOStuff.php(292): RedisConnRef->__call(string, array)
#6 /srv/mediawiki/php-1.34.0-wmf.14/extensions/AbuseFilter/includes/AbuseFilter.php(768): RedisBagOStuff->incr(string)
#7 /srv/mediawiki/php-1.34.0-wmf.14/includes/deferred/MWCallableUpdate.php(38): Closure$AbuseFilter::recordProfilingResult()
#8 /srv/mediawiki/php-1.34.0-wmf.14/includes/deferred/DeferredUpdates.php(372): MWCallableUpdate->doUpdate()
#9 /srv/mediawiki/php-1.34.0-wmf.14/includes/deferred/DeferredUpdates.php(279): DeferredUpdates::attemptUpdate(MWCallableUpdate, Wikimedia\Rdbms\LBFactoryMulti)
#10 /srv/mediawiki/php-1.34.0-wmf.14/includes/deferred/DeferredUpdates.php(232): DeferredUpdates::run(MWCallableUpdate, Wikimedia\Rdbms\LBFactoryMulti, Monolog\Logger, BufferingStatsdDataFactory, string)
#11 /srv/mediawiki/php-1.34.0-wmf.14/includes/deferred/DeferredUpdates.php(148): DeferredUpdates::handleUpdateQueue(array, string, integer)
#12 /srv/mediawiki/php-1.34.0-wmf.14/includes/MediaWiki.php(926): DeferredUpdates::doUpdates(string)
#13 /srv/mediawiki/php-1.34.0-wmf.14/includes/MediaWiki.php(750): MediaWiki->restInPeace(string, boolean)
#14 [internal function]: Closure$MediaWiki::doPostOutputShutdown()
#15 {main}

aaron closed this task as a duplicate of T225103: LBFactory destructor causes unexpected exception at shutdown.Jul 18 2019, 8:15 PM

Restricted Application removed a subscriber: Liuxinyu970226. · View Herald TranscriptJul 18 2019, 8:15 PM

The redis bug is at T228303

The resolution is unclear to me. We have filled this task due to the requests timeout that occurred immediately after promoting all wikis to 1.34.0-wmf.14 but:

the task got marked as a duplicate of T225103 which seems to be just about unexpected exception which does not seem to relate to the

We had some redis bug via T228303

But I don't see how they relate to Wikimedia\Rdbms\LBFactory->commitMasterChanges() taking too long and timing out after 200 seconds. Then it is not like I know anything about MediaWiki internals nowadays :-]

In short, it is unclear to me whether this task is actually solved and whether we can continue with the train.

The LBFactory errors themselves are not new but are normally very rare. That issue itself is tracked at the older T225103 and is rare, not new, and not train blocking.

@hashar We believe that the new timeouts and increased frequency LBFactory shutdown-errors happened as side-effect from the Redis issue (T228303), which is now solved.

Whether the Redis issue really was the root cause, we can only find out by re-trying the group2 promotion and watching carefully.

Thank you @Krinkle :]

Also, the timeout exceptions themselves where redis, not LBFactory. The later seemed to just have errors related to the improper shutdown.

• mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:05 PM

hashar unsubscribed.Oct 7 2019, 12:32 PM

	F29801372: Screenshot 2019-07-18 at 19.19.21.png
	Jul 18 2019, 6:19 PM

web request timeout after 200 seconds due to Wikimedia\Rdbms\LBFactory->__destruct() > Wikimedia\Rdbms\LBFactory->commitMasterChanges()Closed, DuplicatePublicPRODUCTION ERRORActions