Page MenuHomePhabricator

Exception thrown for failure to save settings appears ~ 1000 times/day
Closed, ResolvedPublic

Description

From T143367#4432497:

it seems like the exception logs are filled with CAS update failed on user_touched error messages, with about ~1000/day being emitted, which currently accounts to ~30-40% of all exception log lines.

This error is specifically thrown by User::saveSettings() https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/master/includes/user/User.php#4211

Almost all exceptions came from calls to the api.php endpoint; a naive view of the code seems to suggest that the exception is raised when no row is affected by the request.

In all errors, the message reported that the read was performed on the master database, so I doubt that lag might be the cause.

I think it's reasonable to try to manage the error and do some better logging maybe (e.g. reporting both what is the timestamp on the database and what is the one we're trying to write). I also doubt we have 1000 race conditions per day.

In any case, this should probably not be an unhandled exception.

Here is an example of the stack trace:

{
    "backtrace": [
        {
            "args": [
                "Wikimedia\\Rdbms\\DatabaseMysqli",
                "string"
            ],
            "file": "/srv/mediawiki/php-1.32.0-wmf.16/includes/libs/rdbms/database/Database.php",
            "function": "Closure$User::saveSettings",
            "line": 3731
        },
        {
            "args": [
                "string",
                "Closure$User::saveSettings;309"
            ],
            "class": "Wikimedia\\Rdbms\\Database",
            "file": "/srv/mediawiki/php-1.32.0-wmf.16/includes/user/User.php",
            "function": "doAtomicSection",
            "line": 4222,
            "type": "->"
        },
        {
            "args": [],
            "class": "User",
            "file": "/srv/mediawiki/php-1.32.0-wmf.16/extensions/BetaFeatures/includes/BetaFeaturesHooks.php",
            "function": "saveSettings",
            "line": 284,
            "type": "->"
        },
        {
            "args": [],
            "file": "/srv/mediawiki/php-1.32.0-wmf.16/includes/deferred/MWCallableUpdate.php",
            "function": "Closure$BetaFeaturesHooks::getPreferences",
            "line": 34
        },
        {
            "args": [],
            "class": "MWCallableUpdate",
            "file": "/srv/mediawiki/php-1.32.0-wmf.16/includes/deferred/DeferredUpdates.php",
            "function": "doUpdate",
            "line": 268,
            "type": "->"
        },
        {
            "args": [
                "MWCallableUpdate",
                "Wikimedia\\Rdbms\\LBFactoryMulti",
                "string",
                "integer"
            ],
            "class": "DeferredUpdates",
            "file": "/srv/mediawiki/php-1.32.0-wmf.16/includes/deferred/DeferredUpdates.php",
            "function": "runUpdate",
            "line": 214,
            "type": "::"
        },
        {
            "args": [
                "array",
                "string",
                "integer"
            ],
            "class": "DeferredUpdates",
            "file": "/srv/mediawiki/php-1.32.0-wmf.16/includes/deferred/DeferredUpdates.php",
            "function": "execute",
            "line": 134,
            "type": "::"
        },
        {
            "args": [
                "string"
            ],
            "class": "DeferredUpdates",
            "file": "/srv/mediawiki/php-1.32.0-wmf.16/includes/MediaWiki.php",
            "function": "doUpdates",
            "line": 913,
            "type": "::"
        },
        {
            "args": [
                "string",
                "boolean"
            ],
            "class": "MediaWiki",
            "file": "/srv/mediawiki/php-1.32.0-wmf.16/includes/MediaWiki.php",
            "function": "restInPeace",
            "line": 733,
            "type": "->"
        },
        {
            "args": [],
            "function": "Closure$MediaWiki::doPostOutputShutdown"
        }
    ],
    "caught_by": "mwe_handler",
    "code": 0,
    "file": "/srv/mediawiki/php-1.32.0-wmf.16/includes/user/User.php",
    "id": "...",
    "line": 4211,
    "message": "CAS update failed on user_touched for user ID 'XXXXXX' (read from master); the version of the user to be saved is older than the current version.",
    "type": "MWException",
    "url": "/w/api.php"
}

the corresponding (redacted for privacy) line in api.log

2018-08-17 16:15:04 [...] mw1314 enwiki 1.32.0-wmf.16 api INFO: API POST USER_NAME USER_IP  T=415ms action=options format=json change=wllimit%3D200 token=[redacted] formatversion=2

Event Timeline

Joe created this task.Aug 17 2018, 4:24 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 17 2018, 4:24 PM

Here's that same error but as reported to the exception log, which is a bit easier to read (no json).

exception.message: 
CAS update failed on user_touched for user ID '[redacted]' (read from master); the version of the user to be saved is older than the current version.

exception_id: W3b0iApAEMMAAHJ8VmYAAABV
exception_url: /w/api.php
host: mw1314 
http_method: POST

exception.trace:
#0 /srv/mediawiki/php-1.32.0-wmf.16/includes/libs/rdbms/database/Database.php(3731): Closure$User::saveSettings(Wikimedia\Rdbms\DatabaseMysqli, string)
#1 /srv/mediawiki/php-1.32.0-wmf.16/includes/user/User.php(4222): Wikimedia\Rdbms\Database->doAtomicSection(string, Closure$User::saveSettings;309)
#2 /srv/mediawiki/php-1.32.0-wmf.16/extensions/BetaFeatures/includes/BetaFeaturesHooks.php(284): User->saveSettings()
#3 /srv/mediawiki/php-1.32.0-wmf.16/includes/deferred/MWCallableUpdate.php(34): Closure$BetaFeaturesHooks::getPreferences()
#4 /srv/mediawiki/php-1.32.0-wmf.16/includes/deferred/DeferredUpdates.php(268): MWCallableUpdate->doUpdate()
#5 /srv/mediawiki/php-1.32.0-wmf.16/includes/deferred/DeferredUpdates.php(214): DeferredUpdates::runUpdate(MWCallableUpdate, Wikimedia\Rdbms\LBFactoryMulti, string, integer)
#6 /srv/mediawiki/php-1.32.0-wmf.16/includes/deferred/DeferredUpdates.php(134): DeferredUpdates::execute(array, string, integer)
#7 /srv/mediawiki/php-1.32.0-wmf.16/includes/MediaWiki.php(913): DeferredUpdates::doUpdates(string)
#8 /srv/mediawiki/php-1.32.0-wmf.16/includes/MediaWiki.php(733): MediaWiki->restInPeace(string, boolean)
#9 [internal function]: Closure$MediaWiki::doPostOutputShutdown()
#10 {main}
		if ( $saveUser ) {
			$cache = ObjectCache::getLocalClusterInstance();
			$key = $cache->makeKey( __CLASS__, 'prefs-update', $user->getId() );
			// T95839: If concurrent requests pile on (e.g. multiple tabs), only let one
			// thread bother doing these updates. This avoids pointless error log spam.
			if ( $cache->lock( $key, 0, $cache::TTL_MINUTE ) ) {
				// Save the preferences to the DB post-send
				DeferredUpdates::addCallableUpdate( function () use ( $user, $cache, $key ) {
					$user->saveSettings();
					$cache->unlock( $key );
				} );
			}
		}

Some code is already there to avoid such log spam
But it seems this issue is now from the api, I would say there should be no database query from hook 'GetPreferences' to avoid this.

My explanation:
The api module to save options is calling hook 'GetPreferences' indirectly by getting all valid preferences.
When the api module now do the requested updates that conflicts with the auto enroll update in BetaFeatures hook. On index.php this does not happen, because show of Special:Preferences would already trigger the update.

Possible other way is to save the option need fixing in an array and apply the fix on a fresh user object in the deferred update. But it is not possible to reuse the user object in that hook with its settings.

Change 453545 had a related patch set uploaded (by Umherirrender; owner: Umherirrender):
[mediawiki/extensions/BetaFeatures@master] Refactor auto enroll feature to avoid CAS update exceptions

https://gerrit.wikimedia.org/r/453545

Umherirrender triaged this task as Normal priority.

Change 453545 merged by jenkins-bot:
[mediawiki/extensions/BetaFeatures@master] Refactor auto enroll feature to avoid CAS update exceptions

https://gerrit.wikimedia.org/r/453545

This is still happening frequently

CAS update failed on user_touched for user ID '[redacted]' (read from master); the version of the user to be saved is older than the current version.
#0 /srv/mediawiki/php-1.32.0-wmf.19/includes/libs/rdbms/database/Database.php(3746): Closure$User::saveSettings(Wikimedia\Rdbms\DatabaseMysqli, string)
#1 /srv/mediawiki/php-1.32.0-wmf.19/includes/user/User.php(4226): Wikimedia\Rdbms\Database->doAtomicSection(string, Closure$User::saveSettings;3501)
#2 /srv/mediawiki/php-1.32.0-wmf.19/extensions/BetaFeatures/includes/BetaFeaturesHooks.php(289): User->saveSettings()
#3 /srv/mediawiki/php-1.32.0-wmf.19/includes/deferred/MWCallableUpdate.php(34): Closure$BetaFeaturesHooks::getPreferences()
#4 /srv/mediawiki/php-1.32.0-wmf.19/includes/deferred/DeferredUpdates.php(268): MWCallableUpdate->doUpdate()
#5 /srv/mediawiki/php-1.32.0-wmf.19/includes/deferred/DeferredUpdates.php(214): DeferredUpdates::runUpdate(MWCallableUpdate, Wikimedia\Rdbms\LBFactoryMulti, string, integer)
#6 /srv/mediawiki/php-1.32.0-wmf.19/includes/deferred/DeferredUpdates.php(134): DeferredUpdates::execute(array, string, integer)
#7 /srv/mediawiki/php-1.32.0-wmf.19/includes/MediaWiki.php(914): DeferredUpdates::doUpdates(string)
#8 /srv/mediawiki/php-1.32.0-wmf.19/includes/MediaWiki.php(734): MediaWiki->restInPeace(string, boolean)
#9 [internal function]: Closure$MediaWiki::doPostOutputShutdown()
#10 {main}
greg added a subscriber: greg.Sep 10 2018, 4:36 PM

This is still happening frequently

@Umherirrender thoughts?

Change 459819 had a related patch set uploaded (by Umherirrender; owner: Umherirrender):
[mediawiki/extensions/BetaFeatures@master] Avoid concurrent post-send database save on auto enroll

https://gerrit.wikimedia.org/r/459819

It is not possible to move the User::load( User::READ_LATEST ); and User::saveSettings() near together.

The only idea I have for this is to readd the lock with the ObjectCache, I have uploaded a patch set for it. When that does not help, I am out.

Change 460068 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] user: Allow "CAS update failed" exceptions to be normalised

https://gerrit.wikimedia.org/r/460068

Change 459819 merged by jenkins-bot:
[mediawiki/extensions/BetaFeatures@master] Avoid concurrent post-send database save on auto enroll

https://gerrit.wikimedia.org/r/459819

Change 460068 merged by jenkins-bot:
[mediawiki/core@master] user: Allow "CAS update failed" exceptions to be normalised

https://gerrit.wikimedia.org/r/460068

Change 461811 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@wmf/1.32.0-wmf.22] user: Allow "CAS update failed" exceptions to be normalised

https://gerrit.wikimedia.org/r/461811

Change 461811 merged by jenkins-bot:
[mediawiki/core@wmf/1.32.0-wmf.22] user: Allow "CAS update failed" exceptions to be normalised

https://gerrit.wikimedia.org/r/461811

Mentioned in SAL (#wikimedia-operations) [2018-09-23T04:32:56Z] <krinkle@deploy1001> Synchronized php-1.32.0-wmf.22/includes/user/User.php: T202149 - Ic0c25f66f23f (duration: 00m 53s)

Are there failures in the logs for the last two weeks?

aaron added a comment.Oct 13 2018, 1:30 AM

I still see 100-200 per 3 hour interval.

Change 467014 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/extensions/BetaFeatures@master] Use READ_EXCLUSIVE in getPreferences() hook

https://gerrit.wikimedia.org/r/467014

Change 467014 merged by jenkins-bot:
[mediawiki/extensions/BetaFeatures@master] Use READ_EXCLUSIVE in getPreferences() hook

https://gerrit.wikimedia.org/r/467014

Change 470088 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/extensions/BetaFeatures@master] Use getInstanceForUpdate() in preferences update hook

https://gerrit.wikimedia.org/r/470088

Change 470088 merged by jenkins-bot:
[mediawiki/extensions/BetaFeatures@master] Use getInstanceForUpdate() in preferences update hook

https://gerrit.wikimedia.org/r/470088

Better with Ifb9209aa9?

With all the changes surrounding this I'd rather not backport out-of-bound, so we'll have to see this week how it goes.

Please have a look in the logs if this is still an issue or it is resolved. Thanks

aaron added a comment.Mar 17 2019, 5:32 AM

Seems to rarely occur (usually either 0 times per 30 minutes or 1 time per 30 minutes).

Change 497093 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Use READ_EXCLUSIVE in SpecialConfirmEmail::attemptConfirm

https://gerrit.wikimedia.org/r/497093

Change 497093 merged by jenkins-bot:
[mediawiki/core@master] Use READ_EXCLUSIVE in SpecialConfirmEmail::attemptConfirm

https://gerrit.wikimedia.org/r/497093

Krinkle closed this task as Resolved.Apr 6 2019, 7:23 PM

Closing as it is no longer 1000s of times per day. The improvements we can make at the database layer and the User class layer have all been made.

What's left now is individual problems with callers in specific extensions that are explicitly loading a user object at one point, and saving it after changes could've been made in the interim, without there being a lock over it. Those are the bugs this message is meant to identify.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:09 PM