Error
MediaWiki version: 1.36.0-wmf.35
User not found by actor ID: [id]
Impact
Unknown.
Actual IDs can be found in logstash.
Notes
Looks like 11 of these since deploy of wmf.35 (T274939) to group2.
MediaWiki version: 1.36.0-wmf.35
User not found by actor ID: [id]
Unknown.
Actual IDs can be found in logstash.
Looks like 11 of these since deploy of wmf.35 (T274939) to group2.
from /srv/mediawiki/php-1.36.0-wmf.35/includes/changes/RecentChange.php(1255) #0 /srv/mediawiki/php-1.36.0-wmf.35/includes/changes/RecentChange.php(1051): RecentChange->getUserIdentityFromAnyId(NULL, string, integer) #1 /srv/mediawiki/php-1.36.0-wmf.35/includes/changes/RecentChange.php(155): RecentChange->loadFromRow(stdClass) #2 /srv/mediawiki/php-1.36.0-wmf.35/includes/specials/SpecialRecentChanges.php(473): RecentChange::newFromRow(stdClass) #3 /srv/mediawiki/php-1.36.0-wmf.35/includes/specialpage/ChangesListSpecialPage.php(1681): SpecialRecentChanges->outputChangesList(Wikimedia\Rdbms\ResultWrapper, FormOptions) #4 /srv/mediawiki/php-1.36.0-wmf.35/includes/specialpage/ChangesListSpecialPage.php(681): ChangesListSpecialPage->webOutput(Wikimedia\Rdbms\ResultWrapper, FormOptions) #5 /srv/mediawiki/php-1.36.0-wmf.35/includes/specials/SpecialRecentChanges.php(203): ChangesListSpecialPage->execute(string) #6 /srv/mediawiki/php-1.36.0-wmf.35/includes/specialpage/SpecialPage.php(646): SpecialRecentChanges->execute(string) #7 /srv/mediawiki/php-1.36.0-wmf.35/includes/specialpage/SpecialPageFactory.php(1375): SpecialPage->run(string) #8 /srv/mediawiki/php-1.36.0-wmf.35/includes/specialpage/SpecialPageFactory.php(1433): MediaWiki\SpecialPage\SpecialPageFactory->executePath(Title, RequestContext, boolean, MediaWiki\Linker\LinkRenderer) #9 /srv/mediawiki/php-1.36.0-wmf.35/includes/parser/Parser.php(3139): MediaWiki\SpecialPage\SpecialPageFactory->capturePath(Title, RequestContext, MediaWiki\Linker\LinkRenderer) #10 /srv/mediawiki/php-1.36.0-wmf.35/includes/parser/PPFrame_Hash.php(263): Parser->braceSubstitution(array, PPFrame_Hash) #11 /srv/mediawiki/php-1.36.0-wmf.35/includes/parser/Parser.php(2880): PPFrame_Hash->expand(PPNode_Hash_Tree, integer) #12 /srv/mediawiki/php-1.36.0-wmf.35/includes/parser/Parser.php(1560): Parser->replaceVariables(string) #13 /srv/mediawiki/php-1.36.0-wmf.35/includes/parser/Parser.php(653): Parser->internalParse(string) #14 /srv/mediawiki/php-1.36.0-wmf.35/includes/content/WikitextContent.php(375): Parser->parse(string, Title, ParserOptions, boolean, boolean, integer) #15 /srv/mediawiki/php-1.36.0-wmf.35/includes/content/AbstractContent.php(591): WikitextContent->fillParserOutput(Title, integer, ParserOptions, boolean, ParserOutput) #16 /srv/mediawiki/php-1.36.0-wmf.35/includes/Revision/RenderedRevision.php(266): AbstractContent->getParserOutput(Title, integer, ParserOptions, boolean) #17 /srv/mediawiki/php-1.36.0-wmf.35/includes/Revision/RenderedRevision.php(235): MediaWiki\Revision\RenderedRevision->getSlotParserOutputUncached(WikitextContent, boolean) #18 /srv/mediawiki/php-1.36.0-wmf.35/includes/Revision/RevisionRenderer.php(215): MediaWiki\Revision\RenderedRevision->getSlotParserOutput(string, array) #19 /srv/mediawiki/php-1.36.0-wmf.35/includes/Revision/RevisionRenderer.php(152): MediaWiki\Revision\RevisionRenderer->combineSlotOutput(MediaWiki\Revision\RenderedRevision, array) #20 [internal function]: MediaWiki\Revision\RevisionRenderer->MediaWiki\Revision\{closure}(MediaWiki\Revision\RenderedRevision, array) #21 /srv/mediawiki/php-1.36.0-wmf.35/includes/Revision/RenderedRevision.php(197): call_user_func(Closure, MediaWiki\Revision\RenderedRevision, array) #22 /srv/mediawiki/php-1.36.0-wmf.35/includes/poolcounter/PoolWorkArticleView.php(137): MediaWiki\Revision\RenderedRevision->getRevisionParserOutput() #23 /srv/mediawiki/php-1.36.0-wmf.35/includes/poolcounter/PoolCounterWork.php(162): PoolWorkArticleView->doWork() #24 /srv/mediawiki/php-1.36.0-wmf.35/includes/poolcounter/PoolCounterWork.php(173): PoolCounterWork->execute(boolean) #25 /srv/mediawiki/php-1.36.0-wmf.35/includes/page/ParserOutputAccess.php(260): PoolCounterWork->execute() #26 /srv/mediawiki/php-1.36.0-wmf.35/includes/page/WikiPage.php(1314): MediaWiki\Page\ParserOutputAccess->getParserOutput(WikiPage, ParserOptions, MediaWiki\Revision\RevisionStoreCacheRecord, integer) #27 /srv/mediawiki/php-1.36.0-wmf.35/includes/api/ApiParse.php(83): WikiPage->getParserOutput(ParserOptions, NULL, boolean) #28 /srv/mediawiki/php-1.36.0-wmf.35/includes/poolcounter/PoolCounterWorkViaCallback.php(74): ApiParse::{closure}() #29 /srv/mediawiki/php-1.36.0-wmf.35/includes/poolcounter/PoolCounterWork.php(162): PoolCounterWorkViaCallback->doWork() #30 /srv/mediawiki/php-1.36.0-wmf.35/includes/api/ApiParse.php(90): PoolCounterWork->execute() #31 /srv/mediawiki/php-1.36.0-wmf.35/includes/api/ApiParse.php(694): ApiParse->getPageParserOutput(WikiPage, NULL, ParserOptions, boolean) #32 /srv/mediawiki/php-1.36.0-wmf.35/includes/api/ApiParse.php(218): ApiParse->getParsedContent(WikiPage, ParserOptions, boolean, NULL, NULL, boolean) #33 /srv/mediawiki/php-1.36.0-wmf.35/includes/api/ApiMain.php(1646): ApiParse->execute() #34 /srv/mediawiki/php-1.36.0-wmf.35/includes/api/ApiMain.php(616): ApiMain->executeAction() #35 /srv/mediawiki/php-1.36.0-wmf.35/includes/api/ApiMain.php(587): ApiMain->executeActionWithErrorHandling() #36 /srv/mediawiki/php-1.36.0-wmf.35/api.php(90): ApiMain->execute() #37 /srv/mediawiki/php-1.36.0-wmf.35/api.php(45): wfApiMain() #38 /srv/mediawiki/w/api.php(3): require(string) #39 {main}
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Release | hashar | T274940 1.36.0-wmf.36 deployment blockers | ||
Resolved | PRODUCTION ERROR | • Pchelolo | T277795 User not found by actor ID: [id] |
Interesting. Definitely a fallout of https://gerrit.wikimedia.org/r/c/mediawiki/core/+/669946/16/includes/changes/RecentChange.php#1226
I've looked over a few of real actor IDs in logstash and checked that rows exist in the database with these IDs. Also interesting that it seems to be happening on during a parse if a SpecialRecentChanges is transcluded into the page.
Could this be a race condition? If an (anon?) user for whom we don't have an actor ID yet performs an edit, the parsing happens right after we assign the actor ID and save the RC entry - and we might fail to look up the actor while trying to render the RC entry belonging to the edit itself.
That would be in the same request, and normally, the cache in ActorStore would prevent the race condition. But if we are looping over RecentChanges, we would be looking up many actors, and the cache is small, so the entry for the new actor might have been evicted.
This seems like a rare edge case, but I can see how it would happen when transcluding RecentChanges...
If the above is the reason, I can see two solutions:
It seems like it happens for recent changes for anons who just made their first contribution. Before the patch referenced above we were using User::newFromAnyId which would just instantiate a new User no matter whether it's found in the DB or not, and then upon ->load it will just show the user defaults, while now we hard-crash if the actor is not found in the actor table.
try harder to look up actors by falling back to master of the ID isn't found. Since an actor with a known ID should always exist, this should be rarely needed, so we could always try it when an ID isn't found. We do something similar in RevisionStore.
In User we've been checking LoadBalancer::hasOrMadeRecentMasterChanges and if that's the case we were reading from master. Actor store can do the same. We don't have that method on the DB that's passed, but maybe we can just look on the injected load balancer - it's for the same domain as the injected DB. (we need to expand the use of checkDatabaseDomain in ActorStore..
Now up to 43 of these since deploy. I'm inclined to think this is probably a train blocker for .35 unless we have a patch shortly.
cc: @dancy
Change 673363 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/core@master] ActorStore::getActorById - fall back to master.
Change 673115 had a related patch set uploaded (by Brennen Bearnes; owner: Ppchelko):
[mediawiki/core@wmf/1.36.0-wmf.35] ActorStore::getActorById - fall back to master.
Change 673363 merged by jenkins-bot:
[mediawiki/core@master] ActorStore::getActorById - fall back to master.
Change 673115 merged by jenkins-bot:
[mediawiki/core@wmf/1.36.0-wmf.35] ActorStore::getActorById - fall back to master.
Mentioned in SAL (#wikimedia-operations) [2021-03-18T22:53:04Z] <brennen@deploy1002> Synchronized php-1.36.0-wmf.35/includes/specials/SpecialContributions.php: Backport: [[gerrit:673115|ActorStore::getActorById - fall back to master. (T277795)]] (duration: 01m 07s)
RevisionStore has a fairly elaborate and known external use of it, because rev IDs can be publicly found in API responses and RCFeeds in a very quick way and sometimes a client can submit a follow-up query (without a write session or chronology protector) that hits a different replica that doesn't have it yet. That's still something we'd like to fix long-term since it's a bit of a load risk imho as it stands, but it's an old one so we don't care as much.
For ActorStore, it seems indeed like it can only be a logic bug, state bug, or an own-inserted row that we can get a hold of without a master query (or limited to after we have done writes already, and thus not query master on GET). The current patch inserts it without either of the proposed solutions (insertion cache, and hasWrites-check-for-master). Would it make sense to do one or both of those next, and then try reverting the hotfix?
Mentioned in SAL (#wikimedia-operations) [2021-03-18T23:06:44Z] <brennen> train status: 1.36.0-wmf.35 (T274939) stable on all wikis after deploy of hotfix for T277795
Mentioned in SAL (#wikimedia-operations) [2021-03-18T23:35:42Z] <brennen@deploy1002> Synchronized php-1.36.0-wmf.35/includes/user/ActorStore.php: Backport: [[gerrit:673115|ActorStore::getActorById - fall back to master. (T277795)]] (duration: 00m 58s)
I may have spoken too soon. 2 more of these since 23:05 UTC.
Disregard; synced wrong file. Now corrected. I'll watch logs for a bit, but I think this is a good indication I should back away from the keyboard for the day.
Mentioned in SAL (#wikimedia-operations) [2021-03-18T23:38:51Z] <brennen@deploy1002> Synchronized php-1.36.0-wmf.35/includes/user/ActorStore.php: Backport: [[gerrit:673115|ActorStore::getActorById - fall back to master. (T277795)]] (duration: 00m 57s)
We could - if you think that would be valuable from a performance perspective, please file a ticket and put it on the Platform Team Workboards (MW Expedition) workboard (in this case - generally, please use the general Platform Engineering inbox). But I'm not convinced it's worthwhile:
So I think we stand to gain little by optimizing this. Am I overlooking anything?
RevisionStore has a fairly elaborate and known external use of it, because rev IDs can be publicly found in API responses and RCFeeds in a very quick way and sometimes a client can submit a follow-up query (without a write session or chronology protector) that hits a different replica that doesn't have it yet. That's still something we'd like to fix long-term since it's a bit of a load risk imho as it stands, but it's an old one so we don't care as much.
For actor store, we don't expose the actor ID externally. But the same issue applies for looking up actors by name, which we will do routinely for the current actor. If we fail to find the actor ID for a name because we are hitting a stale replica, we'll just assume there is no actor ID yet; if we need one, we'll try to insert it, the insert will fail, and we'll read it from master. Effectively, that's also falling back to master, but via an attempted insert...
We could avoid that by making lookups by name also fall back to master, but that's expensive, since it would happen very frequently for names we really don't have an ID for yet. As opposed to lookups by ID, which should never fail, unless the replica is stale.
Still seeing User not found by actor ID: messages at a lot rate (< 8 / hour). Most are from jobrunner.discovery.wmnet with en.wikipedia.org coming in second.
Can someone knowledgeable tell me if these errors have any ill effects other than on logs?
Depends on where it happens. Generally, it leads to whatever is currently happening to fail. It will probably make a job fail, or cause Special:Watchlist to error out, etc.
I am looking at one of the instances, actor ID: 25670078 on dewiki. Three failures from the jobrunner, via RefreshLinksJob, within a couple of seconds. The stack trace indicates that this is another instance of RecentChanges being included as a template on a page.
The timestamp of the last error is 2021-03-19T16:09:31 (CET), the timestamp of the edit in the database is 2021-03-19T15:09:27 (UTC). Adjusting for time zone, the error happened three times within four seconds after the edit. This was the first edit recorded for the user. The user-account was auto-created from a global account in 2018, but there had not been any activity since, so I assume the actor ID was freshly assigned when the edit happened.
So, it's a race condition caused by replication lag. But why isn't it fixed by falling back to master?
Is it possible that the job runner has a "repeatable read" snapshot on master that is outdated? Can we somehow flush any such snapshots between jobs? @aaron, do you know?
Change 673534 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Try even harder if we can't find an actor by ID.
Change 674084 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/core@master] Permanently cache actors we've just created in this request
After @brennen deployed PEtr's fix, I see exactly one occurrence that wasn't triggered via runSingleJob.php: fNYcTHgBGiM4niWIGqTc is an action=parse API request, which apparently tried to parse a page that transcludes Special:RecentChanges. I can see how this would fail the actor of a very recent edit on a replica, but I don't really see how it can fail to find it on the master database. But I don't see how this could happen during a runSingleJob.php query either...
It is curious to note that all failures happen when trying to transclude Special:RecentChanges. Never when just looking at Special:RecentChanges. This has to tell us something, but I'm stumped.
Looking at when these errors happen, I think my patch that changes the lock mode on the query will not help. It would only help to avoid a repeatable-read snapshot. I don't see how or why the queries in question would have such a stale snapshot.
Change 674105 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/core@master] Do not crash if RecentChange performer not found.
Change 674131 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/core@master] ActorStore: clean cache if transaction aborted.
This may have the same root cause as T224949: A transaction that inserted into the actor table was rolled back, but the actor ID produced in that transaction is still used and recorded into the database. When a subsequent requests try to find the row in the actor table, they fail - but only briefly, until a different actor is inserted into the actor table, with the same ID.
The fundamental issue is that any auto-increment ID from an insert is only valid within the same transaction. It must not be remembered or cached until the transaction has been committed, or appropriate steps must be taken to clean the cache when the transaction is rolled back. This implies in particular that DeferredUpdates and Jobs that rely on the outcome of a database update must not be enqueued until the transaction has been committed.
Change 674131 merged by jenkins-bot:
[mediawiki/core@master] ActorStore: clean cache if transaction aborted.
Status update: we expect https://gerrit.wikimedia.org/r/674131 to completely mitigate the issue once deployed.
There's some followups we would like to make maybe, but they're not required to unblock the train.
Shouldn't https://gerrit.wikimedia.org/r/674131 be cherry picked and deployed for 1.36.0-wmf.35?
It is at least included in 1.36.0-wmf.36 which I am rolling this week.
I went over a few instances of this.
The data in the database is consistent between revision, revision_actor_temp, actor and recent changes tables - they all reference each other correctly. And yet, 3 seconds after it's been written to the database, we crash not being able to fetch the actor even though we fallback to master now.
15:17 Pchelolo: Krinkle: what do you think about this theory: RefreshLinks is a loooong running job, so we establish a master connection when we begin. At some point, we read RecentChanges from a 'recentchanges' group replica - it's well caught up. Then we try to read 'actor' form a replica from another group - that one is not caught up, we fail. The we fallback to master connection, but it was opened looong time ago, and we read repeatable read s 15:17 Pchelolo: napshot - not find anything as well 15:17 Pchelolo: and thus we crash entirely. 15:18 Krinkle: Pchelolo: is ActorStore in master same as this weeks' wmf branch? 15:18 Pchelolo: same 15:18 Krinkle: k 15:20 Krinkle: Pchelolo: yeah, that seems plausible
In this case we actually have all the data in the row since we've joined it with actor when fetching RecentChange, so we can just instantiate UserIdentityValue directly.
Change 674975 had a related patch set uploaded (by Ppchelko; author: Ppchelko):
[mediawiki/core@master] RecentChange: directly build the user identity if we have the data
Change 674836 had a related patch set uploaded (by Ppchelko; author: Ppchelko):
[mediawiki/core@wmf/1.36.0-wmf.36] RecentChange: directly build the user identity if we have the data
Change 674975 merged by jenkins-bot:
[mediawiki/core@master] RecentChange: directly build the user identity if we have the data
@Pchelolo awesome! I am deploying this fix today given 1.36.0-wmf.36 is only on testwikis for now and this way we are settled for a train deployment on monday. Thank you!
Change 674836 merged by jenkins-bot:
[mediawiki/core@wmf/1.36.0-wmf.36] RecentChange: directly build the user identity if we have the data
Mentioned in SAL (#wikimedia-operations) [2021-03-26T17:44:53Z] <hashar@deploy1002> Synchronized php-1.36.0-wmf.36/includes/changes/RecentChange.php: RecentChange: directly build the user identity if we have the data - T277795 (duration: 01m 06s)
Change 674105 abandoned by Ppchelko:
[mediawiki/core@master] Do not crash if RecentChange performer not found.
Reason:
not needed anymore