Page MenuHomePhabricator

MassMessage not working for delivery to Suwikiquote or Mlwikiquote: "CAS update failed on user_touched"
Closed, ResolvedPublicPRODUCTION ERROR

Description

An expected delivery to their village pumps was not made. I tried to reproduce at a sandbox, and confirmed.

To reproduce:

  1. {{target | user = Quiddity (WMF)/sandbox | site = su.wikiquote.org }}
  2. {{target | user = Quiddity (WMF)/sandbox | site = ml.wikiquote.org }}
  • Nothing arrives.

Impact

Message delivery is failing due to a system user conflict.

Event Timeline

Hmm.

2017-07-18 22:55:57 [WW6R2wpAME4AAGJWDeEAAABN] mw1299 mlwikiquote 1.30.0-wmf.9 exception ERROR: [WW6R2wpAME4AAGJWDeEAAABN] [no req]   MWException from line 4057 of /srv/mediawiki/php-1.30.0-wmf.9/includes/user/User.php: CAS update failed on user_touched for user ID '2110' (read from replica); the version of the user to be saved is older than the current version. {"exception_id":"WW6R2wpAME4AAGJWDeEAAABN","caught_by":"mwe_handler"} 
[Exception MWException] (/srv/mediawiki/php-1.30.0-wmf.9/includes/user/User.php:4057) CAS update failed on user_touched for user ID '2110' (read from replica); the version of the user to be saved is older than the current version.
  #0 /srv/mediawiki/php-1.30.0-wmf.9/includes/user/User.php(737): User->saveSettings()
  #1 /srv/mediawiki/php-1.30.0-wmf.9/extensions/MassMessage/includes/MassMessage.php(45): User::newSystemUser(string, array)
  #2 /srv/mediawiki/php-1.30.0-wmf.9/extensions/MassMessage/includes/job/MassMessageJob.php(64): MassMessage::getMessengerUser()
  #3 /srv/mediawiki/php-1.30.0-wmf.9/extensions/MassMessage/includes/job/MassMessageJob.php(289): MassMessageJob->getUser()
  #4 /srv/mediawiki/php-1.30.0-wmf.9/extensions/MassMessage/includes/job/MassMessageJob.php(208): MassMessageJob->makeAPIRequest(array)
  #5 /srv/mediawiki/php-1.30.0-wmf.9/extensions/MassMessage/includes/job/MassMessageJob.php(186): MassMessageJob->editPage()
  #6 /srv/mediawiki/php-1.30.0-wmf.9/extensions/MassMessage/includes/job/MassMessageJob.php(41): MassMessageJob->sendMessage()
  #7 /srv/mediawiki/php-1.30.0-wmf.9/includes/jobqueue/JobRunner.php(293): MassMessageJob->run()
  #8 /srv/mediawiki/php-1.30.0-wmf.9/includes/jobqueue/JobRunner.php(193): JobRunner->executeJob(MassMessageJob, Wikimedia\Rdbms\LBFactoryMulti, BufferingStatsdDataFactory, integer)
  #9 /srv/mediawiki/rpc/RunJobs.php(47): JobRunner->run(array)
  #10 {main}
Aklapper renamed this task from MassMessage not working for delivery to Suwikiquote or Mlwikiquote to MassMessage not working for delivery to Suwikiquote or Mlwikiquote: "CAS update failed on user_touched".Jul 19 2017, 9:45 AM

Also see T124414

Nikerabbit added a subscriber: Nikerabbit.

Also happening elsewhere, example from aawiki:

 exception.message	   	CAS update failed on user_touched. The version of the user to be saved is older than the current version.
t exception.trace	   	#0 /srv/mediawiki/php-1.35.0-wmf.37/includes/libs/rdbms/database/Database.php(4264): User->{closure}(Wikimedia\Rdbms\DatabaseMysqli, string)
#1 /srv/mediawiki/php-1.35.0-wmf.37/includes/libs/rdbms/database/DBConnRef.php(68): Wikimedia\Rdbms\Database->doAtomicSection(string, Closure)
#2 /srv/mediawiki/php-1.35.0-wmf.37/includes/libs/rdbms/database/DBConnRef.php(641): Wikimedia\Rdbms\DBConnRef->__call(string, array)
#3 /srv/mediawiki/php-1.35.0-wmf.37/includes/user/User.php(3523): Wikimedia\Rdbms\DBConnRef->doAtomicSection(string, Closure)
#4 /srv/mediawiki/php-1.35.0-wmf.37/includes/user/User.php(844): User->saveSettings()
#5 /srv/mediawiki/php-1.35.0-wmf.37/extensions/MassMessage/includes/MassMessage.php(44): User::newSystemUser(string, array)
#6 /srv/mediawiki/php-1.35.0-wmf.37/extensions/MassMessage/includes/Job/MassMessageJob.php(93): MediaWiki\MassMessage\MassMessage::getMessengerUser()
#7 /srv/mediawiki/php-1.35.0-wmf.37/extensions/MassMessage/includes/Job/MassMessageJob.php(368): MediaWiki\MassMessage\Job\MassMessageJob->getUser()
#8 /srv/mediawiki/php-1.35.0-wmf.37/extensions/MassMessage/includes/Job/MassMessageJob.php(260): MediaWiki\MassMessage\Job\MassMessageJob->makeAPIRequest(array)
#9 [internal function]: MediaWiki\MassMessage\Job\MassMessageJob->editPage(string)
#10 /srv/mediawiki/php-1.35.0-wmf.37/extensions/MassMessage/includes/Job/MassMessageJob.php(240): call_user_func(array, string)
#11 /srv/mediawiki/php-1.35.0-wmf.37/extensions/MassMessage/includes/Job/MassMessageJob.php(69): MediaWiki\MassMessage\Job\MassMessageJob->sendMessage()
#12 /srv/mediawiki/php-1.35.0-wmf.37/extensions/EventBus/includes/JobExecutor.php(79): MediaWiki\MassMessage\Job\MassMessageJob->run()
#13 /srv/mediawiki/rpc/RunSingleJob.php(76): MediaWiki\Extension\EventBus\JobExecutor->execute(array)
#14 {main}

The trace is similar to T124414: MassMessage Fatal error: CAS update failed on user_touched but that has been closed. This is still open, so hijacking it as both traces have MassMessage::getMessengerUser in them.

JJMC89 changed the subtype of this task from "Task" to "Production Error".Jul 23 2020, 3:45 AM

At least on aawiki, MediaWiki message delivery is not a system user: it has a real user token. (But also an empty email and password, which suggests it was a system user at some point... not sure how that came to be.)
That triggers the "steal" branch of User::newSystemUser, which tries to change the token and save the user, which results in a CAS error. Either the job is conflicting with itself, like in T124414 (but that seems unlikely: why would it run multiple instances for aawiki? and more importantly, at least one of the user updates would succeed, and we'd see a valid token) or maybe the logic in User::newSystemUser conflicts with something done in one of the revokeAccessForUser handlers (but that also doesn't seem to be the case: the only things that would run are providerChangeAuthenticationData in LocalPasswordPrimaryAuthenticationProvider and TemporaryPasswordPrimaryAuthenticationProvider and those don't change user_touched).

Running User::newSystemUser manually worked fine on aawiki.

Krinkle assigned this task to Tgr.
Krinkle added a subscriber: Krinkle.
This comment was removed by Krinkle.

Logstash query "CAS update failed on user_touched" AND exception.trace:"MassMessage" over the past 30 days still finds recent ocurrences of this unexplained corruption. The more recent one is from a few hours ago:

  • timestamp: 2020-09-21T21:29:12
  • reqId: b40c328e-4e5e-49b2-9eae-6a307de214cb
  • wiki: mlwikiquote
CAS update failed on user_touched. The version of the user to be saved is older than the current version.
#0 /srv/mediawiki/php-1.36.0-wmf.9/includes/libs/rdbms/database/Database.php(4461): User->{closure}(Wikimedia\Rdbms\DatabaseMysqli, string)
#1 /srv/mediawiki/php-1.36.0-wmf.9/includes/libs/rdbms/database/DBConnRef.php(68): Wikimedia\Rdbms\Database->doAtomicSection(string, Closure)
#2 /srv/mediawiki/php-1.36.0-wmf.9/includes/libs/rdbms/database/DBConnRef.php(641): Wikimedia\Rdbms\DBConnRef->__call(string, array)
#3 /srv/mediawiki/php-1.36.0-wmf.9/includes/user/User.php(3398): Wikimedia\Rdbms\DBConnRef->doAtomicSection(string, Closure)
#4 /srv/mediawiki/php-1.36.0-wmf.9/includes/user/User.php(850): User->saveSettings()
#5 /srv/mediawiki/php-1.36.0-wmf.9/extensions/MassMessage/includes/MassMessage.php(44): User::newSystemUser(string, array)
#6 /srv/mediawiki/php-1.36.0-wmf.9/extensions/MassMessage/includes/Job/MassMessageJob.php(93): MediaWiki\MassMessage\MassMessage::getMessengerUser()
#7 /srv/mediawiki/php-1.36.0-wmf.9/extensions/MassMessage/includes/Job/MassMessageJob.php(368): MediaWiki\MassMessage\Job\MassMessageJob->getUser()
#8 /srv/mediawiki/php-1.36.0-wmf.9/extensions/MassMessage/includes/Job/MassMessageJob.php(260): MediaWiki\MassMessage\Job\MassMessageJob->makeAPIRequest(array)
#9 [internal function]: MediaWiki\MassMessage\Job\MassMessageJob->editPage(string)
#10 /srv/mediawiki/php-1.36.0-wmf.9/extensions/MassMessage/includes/Job/MassMessageJob.php(240): call_user_func(array, string)
#11 /srv/mediawiki/php-1.36.0-wmf.9/extensions/MassMessage/includes/Job/MassMessageJob.php(69): MediaWiki\MassMessage\Job\MassMessageJob->sendMessage()
#12 /srv/mediawiki/php-1.36.0-wmf.9/extensions/EventBus/includes/JobExecutor.php(79): MediaWiki\MassMessage\Job\MassMessageJob->run()
#13 /srv/mediawiki/rpc/RunSingleJob.php(76): MediaWiki\Extension\EventBus\JobExecutor->execute(array)
#14 {main}
Krinkle triaged this task as High priority.Sep 22 2020, 3:18 PM
BPirkle added a subscriber: BPirkle.

Still seeing on wikiquote and wikibooks as of Oct 7, 2020.

Still seeing this on wikiquote & wikibooks as of Jun, 21, 2021.

Change 715797 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/extensions/MassMessage@master] Reuse system user object for MassMessageJob::makeAPIRequest

https://gerrit.wikimedia.org/r/715797

If User::newSystemUser is called a second time within a request/job run it is possible that the second query is reading from a replica, while the master contains the already "stealed" user data, resulting in a CAS exception on another "steal" and rollback of the whole transaction. The next transaction would do the same. The single run from Tgr on aawiki as written in Jul 24 2020, 11:52 AM could break that loop, because that would be committed without the replica read.

But I am not sure, if that is really the reason. But calling the same function twice is overkill, the patch set is reusing the user object and passing it to the function needing the user object.

The replica read is there since https://gerrit.wikimedia.org/r/c/mediawiki/core/+/353827, deployed with MediaWiki 1.30/wmf.4 on 6 June 2017, task exists since 19 July 2017.

There could be a relationship between the core change and this failure (only 6 weeks with around 55 mass messages send: https://meta.wikimedia.org/w/index.php?title=Special:Log&dir=prev&offset=20170606000000&type=massmessage&limit=60)

Change 715797 merged by jenkins-bot:

[mediawiki/extensions/MassMessage@master] Reuse system user object for MassMessageJob::makeAPIRequest

https://gerrit.wikimedia.org/r/715797

Good catch! That should be fixed in core, though, not here. If steal is true and isSystemUser is false, the data should be reloaded from the primary, just like in the non-existence case.

Change 716022 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/core@master] Reload from primary when steal in User::newSystemUser

https://gerrit.wikimedia.org/r/716022

Change 716022 abandoned by Umherirrender:

[mediawiki/core@master] Reload from primary when steal in User::newSystemUser

Reason:

May not the best way to fix this

https://gerrit.wikimedia.org/r/716022