Page MenuHomePhabricator

web request timeout after 200 seconds due to Wikimedia\Rdbms\LBFactory->__destruct() > Wikimedia\Rdbms\LBFactory->commitMasterChanges()
Closed, DuplicatePublic

Description

Error

Request URL:
Request ID: INSERT_ID

message
[{exception_id}] {exception_url}   ErrorException from line 0 of : PHP Warning: Destructor threw an object exception: exception 'Wikimedia\Rdbms\DBTransactionError' with message 'Wikimedia\Rdbms\LBFactory::shutdown: transaction round 'MWCallableUpdate::do
trace
PHP Warning: Destructor threw an object exception: exception 'Wikimedia\Rdbms\DBTransactionError' with message 'Wikimedia\Rdbms\LBFactory::shutdown: transaction round 'MWCallableUpdate::doUpdate' still running' in /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/lbfactory/LBFactory.php:274
Stack trace:
#0 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/lbfactory/LBFactory.php(190): Wikimedia\Rdbms\LBFactory->commitMasterChanges()
#1 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/lbfactory/LBFactory.php(163): Wikimedia\Rdbms\LBFactory->shutdown()
#2 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/lbfactory/LBFactory.php(766): Wikimedia\Rdbms\LBFactory->destroy()
#3 (): Wikimedia\Rdbms\LBFactory->__destruct()
#4 {main}

Impact

No idea!?

Wikimedia\Rdbms\LBFactory->__destruct()
Wikimedia\Rdbms\LBFactory->commitMasterChanges()

Blocks somehow or take more than 200 seconds.

The request is killed.

Notes

Largely noticeable since we have promoted all wikis to 1.34.0-wmf.14 at 14:37 UTC

Event Timeline

hashar created this task.Jul 18 2019, 3:23 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 18 2019, 3:23 PM
hashar triaged this task as Unbreak Now! priority.Jul 18 2019, 3:24 PM
Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptJul 18 2019, 3:24 PM

I was also reported high "lag" between edits and them showing into recentchanges on at least cswiki, but databases had virtually no lag at the moment. This could be related- transaction taking too long to commit.

jijiki updated the task description. (Show Details)Jul 18 2019, 3:28 PM

That happens mostly on enwiki. Over one hour 4k occurrences out of 4400 total matches.

I can see an abnormal number of transactions on the enwiki master in Sleep state with 195+ seconds of connection time. This is not normal. The errors may be the watchdog killing connections to prevent a larger issue.

I saw a few:

MediaWiki\Storage\PageEditStash::getAndWaitForStashValue X.X.X.X */ GET_LOCK('0ada9c

But that is a guess, maybe that is normal (no confirmation that is the root cause).

Mentioned in SAL (#wikimedia-operations) [2019-07-18T15:40:39Z] <hashar@deploy1001> rebuilt and synchronized wikiversions files: Revert group2 wikis to 1.34.0-wmf.13 # T228436 T220739

Change 524252 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/mediawiki-config@master] Revert "all wikis to 1.34.0-wmf.14"

https://gerrit.wikimedia.org/r/524252

Change 524252 merged by jenkins-bot:
[operations/mediawiki-config@master] Revert "all wikis to 1.34.0-wmf.14"

https://gerrit.wikimedia.org/r/524252

I was also reported high "lag" between edits and them showing into recentchanges on at least cswiki, but databases had virtually no lag at the moment. This could be related- transaction taking too long to commit.

See also T228343: Huge delay for RecentChanges in test.wikidata

Indeed the rollback got rid of the spam of errors. To be investigated now is what kind of db transactions are taking a while in Wikimedia\Rdbms\LBFactory->commitMasterChanges() :-\ There might be informations in logstash log if MediaWiki logs anything for databases.

Krinkle added a subscriber: Krinkle.EditedJul 18 2019, 6:18 PM


Alternate stack trace, possibly the same issue:

trace (load.php)
PHP Warning: Destructor threw an object exception: exception 'Wikimedia\Rdbms\DBUnexpectedError' with message 'DB connection was already closed' in /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/database/Database.php:966
Stack trace:
#0 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/database/Database.php(1189): Wikimedia\Rdbms\Database->assertHasConnectionHandle()
#1 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/database/Database.php(1158): Wikimedia\Rdbms\Database->executeQuery()
#2 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/database/Database.php(4282): Wikimedia\Rdbms\Database->query()
#3 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/database/Database.php(687): Wikimedia\Rdbms\Database->ping()
#4 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/database/Database.php(677): Wikimedia\Rdbms\Database->pingAndCalculateLastTrxApplyTime()
#5 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1568): Wikimedia\Rdbms\Database->pendingWriteQueryDuration()
#6 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/loadbalancer/LoadBalancer.php(2016): Closure$Wikimedia\Rdbms\LoadBalancer::approveMasterChanges()
#7 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1584): Wikimedia\Rdbms\LoadBalancer->forEachOpenMasterConnection()
#8 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/lbfactory/LBFactory.php(230): Wikimedia\Rdbms\LoadBalancer->approveMasterChanges()
#9 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/lbfactory/LBFactoryMulti.php(368): Closure$Wikimedia\Rdbms\LBFactory::forEachLBCallMethod()
#10 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/lbfactory/LBFactory.php(233): Wikimedia\Rdbms\LBFactoryMulti->forEachLB()
#11 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/lbfactory/LBFactory.php(287): Wikimedia\Rdbms\LBFactory->forEachLBCallMethod()
#12 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/lbfactory/LBFactory.php(190): Wikimedia\Rdbms\LBFactory->commitMasterChanges()
#13 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/lbfactory/LBFactory.php(163): Wikimedia\Rdbms\LBFactory->shutdown()
#14 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/rdbms/lbfactory/LBFactory.php(766): Wikimedia\Rdbms\LBFactory->destroy()
#15 (): Wikimedia\Rdbms\LBFactory->__destruct()
#16 {main}
Krinkle updated the task description. (Show Details)Jul 18 2019, 6:19 PM

When will the patch be available?

@Agusbou2015 The live software has been reverted away from this issue. We are now working to fix it before re-upgrading the wikis. Is there a problem you are still experiencing as result of this issue? If so, please contribute by describing it. If not, please avoid interruption that doesn't help us closer to the solution.

aaron added a subscriber: aaron.Jul 18 2019, 8:09 PM

The timeout aspect seems strange. The huge "idle" time increase at https://grafana.wikimedia.org/d/000000273/mysql sounds like the PageEditStash::parseAndCache() has an infinite timeout instead of 0 seconds (bug, it should be 0 as in non-blocking) and the parsing may have been slowed down for some reason, making more threads wait on the lock. Maybe the concurrent nutcracker issues were also affecting mcrouter (since the same hosts are used). Could also be something adding memcached write load: https://grafana.wikimedia.org/d/000000316/memcache?orgId=1&from=1563458818482&to=1563464680644 looks a little unusual, though not unlike the result of key version changes that happen from release to release (including the slow return to normal set() rate).

I don't see anything in the MW DB logs or hourly flamegraphs worthy of note.

aaron added a comment.Jul 18 2019, 8:13 PM

The timeouts correspond with the redis problems:

PHP Fatal Error: entire web request took longer than 200 seconds and timed out
#0 [internal function]: Redis->sockReadLine()
#1 [internal function]: Redis->sockReadData(NULL)
#2 [internal function]: Redis->processBooleanResponse()
#3 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/redis/RedisConnRef.php(131): Redis->watch(string)
#4 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/redis/RedisConnRef.php(117): RedisConnRef->tryCall(string, array)
#5 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/objectcache/RedisBagOStuff.php(292): RedisConnRef->__call(string, array)
#6 /srv/mediawiki/php-1.34.0-wmf.14/extensions/AbuseFilter/includes/AbuseFilter.php(768): RedisBagOStuff->incr(string)
#7 /srv/mediawiki/php-1.34.0-wmf.14/includes/deferred/MWCallableUpdate.php(38): Closure$AbuseFilter::recordProfilingResult()
#8 /srv/mediawiki/php-1.34.0-wmf.14/includes/deferred/DeferredUpdates.php(372): MWCallableUpdate->doUpdate()
#9 /srv/mediawiki/php-1.34.0-wmf.14/includes/deferred/DeferredUpdates.php(279): DeferredUpdates::attemptUpdate(MWCallableUpdate, Wikimedia\Rdbms\LBFactoryMulti)
#10 /srv/mediawiki/php-1.34.0-wmf.14/includes/deferred/DeferredUpdates.php(232): DeferredUpdates::run(MWCallableUpdate, Wikimedia\Rdbms\LBFactoryMulti, Monolog\Logger, BufferingStatsdDataFactory, string)
#11 /srv/mediawiki/php-1.34.0-wmf.14/includes/deferred/DeferredUpdates.php(148): DeferredUpdates::handleUpdateQueue(array, string, integer)
#12 /srv/mediawiki/php-1.34.0-wmf.14/includes/MediaWiki.php(926): DeferredUpdates::doUpdates(string)
#13 /srv/mediawiki/php-1.34.0-wmf.14/includes/MediaWiki.php(750): MediaWiki->restInPeace(string, boolean)
#14 [internal function]: Closure$MediaWiki::doPostOutputShutdown()
#15 {main}
Restricted Application removed a subscriber: Liuxinyu970226. · View Herald TranscriptJul 18 2019, 8:15 PM
aaron added a comment.Jul 18 2019, 8:16 PM

The redis bug is at T228303

The resolution is unclear to me. We have filled this task due to the requests timeout that occurred immediately after promoting all wikis to 1.34.0-wmf.14 but:

  • the task got marked as a duplicate of T225103 which seems to be just about unexpected exception which does not seem to relate to the
  • We had some redis bug via T228303

But I don't see how they relate to Wikimedia\Rdbms\LBFactory->commitMasterChanges() taking too long and timing out after 200 seconds. Then it is not like I know anything about MediaWiki internals nowadays :-]

In short, it is unclear to me whether this task is actually solved and whether we can continue with the train.

Krinkle added a comment.EditedJul 19 2019, 2:08 PM

The LBFactory errors themselves are not new but are normally very rare. That issue itself is tracked at the older T225103 and is rare, not new, and not train blocking.

@hashar We believe that the new timeouts and increased frequency LBFactory shutdown-errors happened as side-effect from the Redis issue (T228303), which is now solved.

Whether the Redis issue really was the root cause, we can only find out by re-trying the group2 promotion and watching carefully.

aaron added a comment.Jul 19 2019, 7:38 PM

Also, the timeout exceptions themselves where redis, not LBFactory. The later seemed to just have errors related to the improper shutdown.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:05 PM
hashar removed a subscriber: hashar.Mon, Oct 7, 12:32 PM