Page MenuHomePhabricator

PHP-timed out requests also emit LoadBalancer::destruct error "you can't run this command now: COMMIT"
Closed, ResolvedPublic

Description

There is a low, but significant number of mediawiki errors on query execution with the above error:

Wikimedia\Rdbms\Database::close	10.192.32.168	2014	Commands out of sync; you can't run this command now (10.192.32.168)	COMMIT

e.g.:

{
  "_index": "logstash-2018.09.14",
  "_type": "mediawiki",
  "_id": "AWXX_3jKo1GdFyIETwgn",
  "_version": 1,
  "_score": null,
  "_source": {
    "server": "en.wikipedia.org",
    "db_server": "10.192.32.168",
    "wiki": "enwiki",
    "channel": "DBQuery",
    "type": "mediawiki",
    "error": "Commands out of sync; you can't run this command now (10.192.32.168)",
    "http_method": "GET",
    "@version": 1,
    "host": "mw2274",
    "shard": "s1",
    "sql1line": "COMMIT",
    "fname": "Wikimedia\\Rdbms\\Database::close",
    "errno": 2014,
    "unique_id": "W5ulhArAMGAAALdEtv8AAAHQ",
    "method": "Wikimedia\\Rdbms\\Database::makeQueryException",
    "level": "ERROR",
    "ip": "10.192.16.136",
    "mwversion": "1.32.0-wmf.20",
    "message": "Wikimedia\\Rdbms\\Database::close\t10.192.32.168\t2014\tCommands out of sync; you can't run this command now (10.192.32.168)\tCOMMIT",
    "normalized_message": "{fname}\t{db_server}\t{errno}\t{error}\t{sql1line}",
    "url": "/w/index.php?oldid=848654409&redirect=no",
    "tags": [
      "syslog",
      "es",
      "es"
    ],
    "reqId": "W5ulhArAMGAAALdEtv8AAAHQ",
    "referrer": null,
    "@timestamp": "2018-09-14T12:12:48.000Z",
    "db_name": "centralauth",
    "db_user": "wikiuser"
  },
  "fields": {
    "@timestamp": [
      1536927168000
    ]
  },
  "sort": [
    1536927168000
  ]
}

My guess is that a COMMIT is tried in a non-started transaction (or it is, but connection is lost) See below T204346#4584159 for a proper analysis.

The commands seem to happen (at least) on the following url:

https://en.wikipedia.org/w/index.php?oldid=854160404

And indeed that doesn't seem to load for me.

First occurrence seemed to happen on 2018-09-11T12:27:17, before the dc switch.

Event Timeline

jcrespo created this task.Sep 14 2018, 12:42 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 14 2018, 12:42 PM

It looks like we have a backtrace for the same error in the "error" channel (link):

PHP Warning: Destructor threw an object exception: exception 'Wikimedia\Rdbms\DBQueryError' with message 'A database query error has occurred. Did you forget to run your application's database schema updater after upgrading? 
Query: COMMIT
Function: Wikimedia\Rdbms\Database::close
Error: 2014 Commands out of sync; you can't run this command now (10.192.32.168)
' in /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/database/Database.php:1458
Stack trace:
#0 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/database/Database.php(1428): Wikimedia\Rdbms\Database->makeQueryException()
#1 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/database/Database.php(1198): Wikimedia\Rdbms\Database->reportQueryError()
#2 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/database/Database.php(3893): Wikimedia\Rdbms\Database->query()
#3 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/database/Database.php(3864): Wikimedia\Rdbms\Database->doCommit()
#4 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/database/Database.php(952): Wikimedia\Rdbms\Database->commit()
#5 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1263): Wikimedia\Rdbms\Database->close()
#6 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1762): Closure$Wikimedia\Rdbms\LoadBalancer::closeAll()
#7 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1264): Wikimedia\Rdbms\LoadBalancer->forEachOpenConnection()
#8 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1254): Wikimedia\Rdbms\LoadBalancer->closeAll()
#9 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1980): Wikimedia\Rdbms\LoadBalancer->disable()
#10 (): Wikimedia\Rdbms\LoadBalancer->__destruct()
#11 {main}

My guess is that a COMMIT is tried in a non-started transaction (or it is, but connection is lost).

When I test things from the command line or sql.php in mwmaint1001, MariaDB doesn't seem to raise any error if a COMMIT is issued outside of a transaction.

https://dev.mysql.com/doc/refman/8.0/en/commands-out-of-sync.html seems to indicate that error 2014 occurs when you do something like trying to execute a new command before fetching all the results of a previous command.

I think it was caused by this error:

[W5ulhArAMGAAALdEtv8AAAHQ] /w/index.php?oldid=848654409&redirect=no   PHP Fatal Error: entire web request took longer than 60 seconds and timed out
#0 [internal function]: mysqli->store_result()
#1 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/database/DatabaseMysqli.php(46): mysqli->query(string)
#2 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/database/Database.php(1232): Wikimedia\Rdbms\DatabaseMysqli->doQuery(string)
#3 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/database/Database.php(1155): Wikimedia\Rdbms\Database->doProfiledQuery(string, string, boolean, string)
#4 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/database/Database.php(1655): Wikimedia\Rdbms\Database->query(string, string)
#5 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/database/DBConnRef.php(49): Wikimedia\Rdbms\Database->select(array, array, array, string)
#6 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/database/DBConnRef.php(257): Wikimedia\Rdbms\DBConnRef->__call(string, array)
#7 /srv/mediawiki/php-1.32.0-wmf.20/extensions/CentralAuth/includes/CentralAuthUser.php(399): Wikimedia\Rdbms\DBConnRef->select(array, array, array, string)
#8 /srv/mediawiki/php-1.32.0-wmf.20/extensions/CentralAuth/includes/CentralAuthUser.php(509): CentralAuthUser->loadGroups()
#9 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/objectcache/WANObjectCache.php(1128): Closure$CentralAuthUser::loadFromCache(boolean, integer, array, NULL)
#10 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/objectcache/WANObjectCache.php(1277): Closure$WANObjectCache::getWithSetCallback(boolean, integer, array, NULL)
#11 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/objectcache/WANObjectCache.php(1134): WANObjectCache->doGetWithSetCallback(string, integer, Closure$WANObjectCache::getWithSetCallback;283, array, NULL)
#12 /srv/mediawiki/php-1.32.0-wmf.20/extensions/CentralAuth/includes/CentralAuthUser.php(519): WANObjectCache->getWithSetCallback(string, integer, Closure$CentralAuthUser::loadFromCache;684, array)
#13 /srv/mediawiki/php-1.32.0-wmf.20/extensions/CentralAuth/includes/CentralAuthUser.php(374): CentralAuthUser->loadFromCache()
#14 /srv/mediawiki/php-1.32.0-wmf.20/extensions/CentralAuth/includes/CentralAuthUser.php(549): CentralAuthUser->loadState()
#15 /srv/mediawiki/php-1.32.0-wmf.20/extensions/CentralAuth/includes/CentralAuthIdLookup.php(84): CentralAuthUser->getId()
#16 /srv/mediawiki/php-1.32.0-wmf.20/extensions/CentralAuth/includes/CentralAuthIdLookup.php(90): CentralAuthIdLookup->isAttached(User)
#17 /srv/mediawiki/php-1.32.0-wmf.20/extensions/GlobalPreferences/includes/GlobalPreferencesFactory.php(462): CentralAuthIdLookup->centralIdFromLocalUser(User, integer)
#18 /srv/mediawiki/php-1.32.0-wmf.20/extensions/GlobalPreferences/includes/GlobalPreferencesFactory.php(447): GlobalPreferences\GlobalPreferencesFactory->getUserID()
#19 /srv/mediawiki/php-1.32.0-wmf.20/extensions/GlobalPreferences/includes/Hooks.php(43): GlobalPreferences\GlobalPreferencesFactory->isUserGlobalized()
#20 /srv/mediawiki/php-1.32.0-wmf.20/includes/Hooks.php(174): GlobalPreferences\Hooks::onUserLoadOptions(User, array)
#21 /srv/mediawiki/php-1.32.0-wmf.20/includes/Hooks.php(202): Hooks::callHook(string, array, array, NULL)
#22 /srv/mediawiki/php-1.32.0-wmf.20/includes/user/User.php(5529): Hooks::run(string, array)
#23 /srv/mediawiki/php-1.32.0-wmf.20/includes/user/User.php(536): User->loadOptions()
#24 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/objectcache/WANObjectCache.php(1128): Closure$User::loadFromCache(boolean, integer, array, NULL)
#25 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/objectcache/WANObjectCache.php(1277): Closure$WANObjectCache::getWithSetCallback(boolean, integer, array, NULL)
#26 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/objectcache/WANObjectCache.php(1134): WANObjectCache->doGetWithSetCallback(string, integer, Closure$WANObjectCache::getWithSetCallback;283, array, NULL)
#27 /srv/mediawiki/php-1.32.0-wmf.20/includes/user/User.php(559): WANObjectCache->getWithSetCallback(string, integer, Closure$User::loadFromCache;305, array)
#28 /srv/mediawiki/php-1.32.0-wmf.20/includes/user/User.php(479): User->loadFromCache()
#29 /srv/mediawiki/php-1.32.0-wmf.20/includes/user/User.php(403): User->loadFromId(integer)
#30 /srv/mediawiki/php-1.32.0-wmf.20/extensions/GlobalUserPage/includes/GlobalUserPage.php(159): User->load(integer)
#31 /srv/mediawiki/php-1.32.0-wmf.20/extensions/GlobalUserPage/includes/Hooks.php(57): MediaWiki\GlobalUserPage\GlobalUserPage::shouldDisplayGlobalPage(Title)
#32 /srv/mediawiki/php-1.32.0-wmf.20/includes/Hooks.php(174): MediaWiki\GlobalUserPage\Hooks::onTitleIsAlwaysKnown(Title, NULL)
#33 /srv/mediawiki/php-1.32.0-wmf.20/includes/Hooks.php(202): Hooks::callHook(string, array, array, NULL)
#34 /srv/mediawiki/php-1.32.0-wmf.20/includes/Title.php(4708): Hooks::run(string, array)
#35 /srv/mediawiki/php-1.32.0-wmf.20/includes/parser/Parser.php(2448): Title->isAlwaysKnown()
#36 /srv/mediawiki/php-1.32.0-wmf.20/includes/parser/Parser.php(2145): Parser->replaceInternalLinks2(string)
#37 /srv/mediawiki/php-1.32.0-wmf.20/includes/parser/Parser.php(1377): Parser->replaceInternalLinks(string)
#38 /srv/mediawiki/php-1.32.0-wmf.20/includes/parser/Parser.php(476): Parser->internalParse(string)
#39 /srv/mediawiki/php-1.32.0-wmf.20/includes/content/WikitextContent.php(341): Parser->parse(string, Title, ParserOptions, boolean, boolean, integer)
#40 /srv/mediawiki/php-1.32.0-wmf.20/includes/content/AbstractContent.php(517): WikitextContent->fillParserOutput(Title, integer, ParserOptions, boolean, ParserOutput)
#41 /srv/mediawiki/php-1.32.0-wmf.20/includes/poolcounter/PoolWorkArticleView.php(145): AbstractContent->getParserOutput(Title, integer, ParserOptions)
#42 /srv/mediawiki/php-1.32.0-wmf.20/includes/poolcounter/PoolCounterWork.php(123): PoolWorkArticleView->doWork()
#43 /srv/mediawiki/php-1.32.0-wmf.20/includes/page/Article.php(617): PoolCounterWork->execute()
#44 /srv/mediawiki/php-1.32.0-wmf.20/includes/actions/ViewAction.php(68): Article->view()
#45 /srv/mediawiki/php-1.32.0-wmf.20/includes/MediaWiki.php(501): ViewAction->show()
#46 /srv/mediawiki/php-1.32.0-wmf.20/includes/MediaWiki.php(294): MediaWiki->performAction(Article, Title)
#47 /srv/mediawiki/php-1.32.0-wmf.20/includes/MediaWiki.php(868): MediaWiki->performRequest()
#48 /srv/mediawiki/php-1.32.0-wmf.20/includes/MediaWiki.php(525): MediaWiki->main()
#49 /srv/mediawiki/php-1.32.0-wmf.20/index.php(42): MediaWiki->run()
#50 /srv/mediawiki/w/index.php(3): include(string)
#51 {main}

The web request timeout hit in the middle of mysqli fetching the results to buffer them, before it could properly finish fetching the results. Then later when MediaWiki tried to clean up open DB handles it ran into the resulting out-of-sync.

First occurrence seemed to happen on 2018-09-11T12:27:17, before the dc switch.

I see one earlier occurrence on 2018-09-10 02:09:30, again coinciding with a web request timeout triggering during mysqli->store_result(). The web request timeouts probably started due to rOMWC2214644e36d5: Set PHP time limit.

There was also one on instance of the error on 2018-07-10 09:02:57, but that looks like it's due to WikiExporter actually using streaming mode rather than the web request timeout interrupting mysqli internals.

Thanks, I will update the summary to remove my guess and point to this better evaluation. CCing @tstarling beacause it could be an intended consequence of the time limit and maybe those things can be captured/handled better (not sure)?

jcrespo updated the task description. (Show Details)Sep 14 2018, 3:19 PM
jcrespo updated the task description. (Show Details)

@Anomie I found the same error in the logs for the request ID matching T204291, but in that case I choose to make the task about the cause of the resource exhaustion (eg. time or memory).

But do you think this task should be about trying to prevent the side-effect of the COMMIT query? I haven't yet determined its impact, but if it can affect database state (e.g. it doesn't go away together with the transaction), then we should indeed make sure to also avoid this side-effect because there will always be new causes of resource exhaustion to warrant not wanting to tolerate corruption.

On the other hand, if it's only noise, then we may want to consider it as just part of the exhaustion error and accept that it will cascade into other errors from the same request. If it can only happen during a fatal, then we may also want to consider lowering the severity so that we don't log them at all in prod, or to only log them if an exception happened.

Krinkle renamed this task from Commands out of sync; you can't run this command now COMMIT to Evaluate whether "Commands out of sync; you can't run this command now COMMIT.Sep 14 2018, 5:38 PM
Krinkle renamed this task from Evaluate whether "Commands out of sync; you can't run this command now COMMIT to Evaluate whether LoadBalancer::destruct error "you can't run this command now: COMMIT" can/should be avoided.
Krinkle moved this task from Untriaged to Rdbms library on the Wikimedia-Rdbms board.
Krinkle edited projects, added Performance-Team; removed MediaWiki-General.

@Anomie Do you have time to look in to this? With @aaron on vacation, we're short the person who would normally track down an issue like this.

Not that it changes a lot, but indeed this doesn't happen only on COMMIT, also on ROLLBACK: Wikimedia\Rdbms\Database::close 10.192.32.7 2014 Commands out of sync; you can't run this command now (10.192.32.7) ROLLBACK

Krinkle renamed this task from Evaluate whether LoadBalancer::destruct error "you can't run this command now: COMMIT" can/should be avoided to Requests that timeout php execution emit LoadBalancer::destruct error "you can't run this command now: COMMIT".Sep 20 2018, 9:32 PM
Krinkle renamed this task from Requests that timeout php execution emit LoadBalancer::destruct error "you can't run this command now: COMMIT" to PHP-timed out requests also emit LoadBalancer::destruct error "you can't run this command now: COMMIT".
Anomie added a comment.Oct 1 2018, 5:03 PM

It looks like this only occurs on HHVM. With PHP 7 it seems that destructors aren't called when set_time_limit() causes a timeout error. Due to T192166: Drop HHVM support from MediaWiki it may not be worth spending a whole lot of time on this.

I'm having trouble reproducing this reliably to see about making a fix for it.

It looks like this only occurs on HHVM. [Not] With PHP 7

I would support not spending time on this if this is true.

@Anomie I'm inclined to agree if the only impact is some messages in a log (although a way to turn them off or lower the severity would be useful).

However, I'd like to me sure we document (at least within Phab, if no where else) what the impact of this on the database and the user request. For example, in a single-database world, I'm not concerned about aborting of requests or other random crashes because as far as I know, our transaction model protects us. Either we stop before the transaction is committed and it's as if nothing happened, or it aborts after. There isn't much middle ground. In theory that's all safe.

What I'm concerned about is that we rely on the application logic to orchestrate transactions over multiple connections and databases. For example when one fails, we propagate that error to rollback/cancel other transactions as well.

Can you confirm that under the timeout scenario where these destructors happen, that we still safely rollback everything and the database state is still guaranteed to be normal?

In addition to confirming this is safe and secure on HHVM, it sounds like this may actually be worse on PHP 7, in which case it might be a blocker for T176370.

Anomie added a comment.Oct 1 2018, 6:18 PM

For PHP7, since destructors aren't being run, it should wind up dropping all the MySQL connections when PHP exits causing the DB to automatically roll back any open transactions. I don't think anything in our environment will prevent that. I suppose there is a potential race if the timeout occurs in the middle of the code doing one of those commits.

For HHVM, the fact that it's trying to do a COMMIT from the destructor is a bit worrying since it could succeed in some other cases where there should be a rollback. It looks like Database::close() will commit for explicit transactions, but not for atomic sections and not for DBO_TRX. I can't think of any way to test for the timeout to have Database::close() avoid it, since HHVM doesn't seem to pass the timeout error to the error handler. Why does Database::close() ever try to commit anyway?

In what way do you think it may be worse on PHP 7?

For PHP7, since destructors aren't being run, [..]
In what way do you think it may be worse on PHP 7?

I believe you know this code better then me. But, I suspect our current logic might cause problems on PHP 7 because, as you say, PHP 7 does't invoke destructors upon a request process being aborted by timeout.

If that's the case, then, given we currently do have application logic in our destructors, that seems like a potential problem – given they wouldn't get invoked anymore.

From your last comment, I gather that our current destructors might actually not be that essential. Which means that either we don't need them, or we need to improve them (e.g. cross-commit/cross-rollback), or for PHP 7 remove them and find a way that's compatible with PHP 7.

Anomie added a comment.Oct 1 2018, 7:39 PM

Note PHP 5 didn't execute destructors on timeout either. Although we may not have set timeouts back when we still used Zend PHP 5.

Anomie added a comment.Oct 1 2018, 7:57 PM

I vaguely remember there was some issue with HHVM originally not running destructors when the process ended, but I can't find details from searches. Maybe @tstarling remembers? On the other hand, that may have only applied to successful process-endings rather than fatal errors.

Sorry if this is unrelated, but we do have indeed issues at least at the moment (HHVM) with webrequests being killed/aborted/failed but leaving mysql queries running on the db servers: T149421#4069592 T160984#3158340

Anomie added a comment.Oct 1 2018, 8:26 PM

That seems mostly unrelated to me. Once the query that was left running ends, MySQL should finally notice that PHP went away and roll back the open transaction.

The issue here would prevent MediaWiki-layer logic from trying to kill the query to resolve that task, if that's even possible from PHP code. Although for PHP7 the timeout being discussed here applies to PHP CPU usage rather than wall clock time anyway so it'd be unlikely to trigger while a DB query was running.

aaron added a comment.Oct 2 2018, 9:05 PM

Why does Database::close() ever try to commit anyway?

Legacy behavior that's super old. Old patterns like begin()...write...close() worked. I'd rather it rollback in such cases and any caller be updated.

After looking into it a bit more while writing T206147: Database::close() shouldn't commit transactions, there are two ways Database::close() might try to commit: if an automatic transaction had no writes, or if a manual transaction is open. The latter logs a message (warning level to channel DBQuery), and I don't see that message logged at all in the past 60 days.

Krinkle closed this task as Resolved.Nov 5 2018, 10:48 PM
Krinkle claimed this task.

Per T206147, MediaWiki no longer tries to commit a pending transaction from Database::close/../LoadBalancer->__destruct.

It does still rollback to release read snapshots, but for those errors are ignored, so even in an out-of-sync scenario, it won't cause a DB exception to be thrown and won't show up in the logs in the manner described by this task.

Krinkle reassigned this task from Krinkle to aaron.Nov 5 2018, 10:48 PM
Krinkle edited projects, added Performance-Team; removed Performance-Team (Radar).
aaron added a comment.Nov 5 2018, 11:15 PM

Fixed in bf30fcb71427d673f7c83a067b3241040d3470b6. Rollback is used instead and uses $ignoreErrors so as not to trigger the exception in reportQueryError().

Note that DBQuery log entries will still be logged (I see them in logstash), since merely calling makeQueryException triggers logging even if is just for storing the trx state error.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:08 PM