All of the last 15 global renames got stuck on metawiki (or never even scheduled there, there are no relevant logs, neither errors nor the rename start message). Running fixStuckGlobalRename.php on meta works fine. The last successful non-CLI rename on meta was at 2018-04-26T15:24:07, the next one should have been around 18:00 but did not get triggered. The only possibly relevant SAL event in that time frame is 428970: Enable all jobs for test, test2, testwikidata and mediawiki. mediawikiwiki directly precedes metawiki, and rename jobs run in a chain (a successful job schedules the job for the next wiki) so maybe that breaks with Kafka somehow ([[https://github.com/wikimedia/mediawiki-extensions-CentralAuth/blob/master/includes/LocalRenameJob/LocalRenameJob.php#L154|job scheduling from LocalRenameJob::scheduleNextWiki]] fails). See https://wikitech.wikimedia.org/wiki/Stuck_global_renames for the workaround.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | • Pchelolo | T157088 [EPIC] Develop a JobQueue backend based on EventBus | |||
Resolved | • Pchelolo | T190327 FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus | |||
Resolved | PRODUCTION ERROR | • mobrovac | T193254 Global renames get stuck at metawiki |
Event Timeline
Removing asignee to let people know that on this reopening we have no one working on this new instance yet. Ping @Jdforrester-WMF wrt @Nirmos question above.
Note:
All name changes are turned off until this problem is fixed! so we talk about number of requests more than 100 on all wikis!
I only see one stuck global rename right now. Yes, it is true however that the requests can queue up in numbers on the wikis and in the rename queue though. It's weekend so I'm not sure who's 'on duty' today that could have a look at this.
Probably not, there is no reason that would affect meta only. Also I see no database error. The logs just stop when reaching meta.
@mobrovac do you know if LocalRenameUserJob jobs on meta (and only there) could somehow be affected by the Redis-Kafka migration? I'm probably grasping at straws here, but not sure where else to look.
It seems the last successful non-CLI rename on meta was at 2018-04-26T15:24:07 (let's hope those are UTC timestamps, IIRC Kibana is a bit confused about time), the next one should have been around 18:00 but did not get triggered. The only possibly relevant SAL event in that time frame is 428970: Enable all jobs for test, test2, testwikidata and mediawiki. mediawikiwiki directly precedes metawiki, and rename jobs run in a chain (a successful job schedules the job for the next wiki) so maybe that breaks with Kafka somehow.
@mobrovac do you know if LocalRenameUserJob jobs on meta (and only there) could somehow be affected by the Redis-Kafka migration? I'm probably grasping at straws here, but not sure where else to look.
This T192405 might be related, but it was fixed quite a long time ago. I do not see any logs about jobs related to user renaming actually failing, but I'll keep looking
There doesn't seem to be anything wrong with the transport mechanism: the jobs that got executed on the EventBus side all completed successfully.
there are 14 global renames stuck, all, at meta.wikimedia.
There is only one stuck job now and it completed for mediawikiwiki. It is currently stuck for meta with the message "Sending approval email to User:Ajh98/<email-hidden>"
@mobrovac: the reason there is only one stuck currently is likely because we have stopped doing renames until this is fixed because having users locked out of their account isn't ideal.
The others have been closed with the maintenance script (which does not use the job queue). They all got stuck on meta (and all completed fine for mediawikiwiki). My hypothesis is that [[https://github.com/wikimedia/mediawiki-extensions-CentralAuth/blob/master/includes/LocalRenameJob/LocalRenameJob.php#L154|job scheduling from LocalRenameJob::scheduleNextWiki]] gets stuck (it uses the wrong job queue scheduling service or something).
The problem started within hours of Kafka being enabled on mediawikiwiki, and it affects the wiki that's after mediawikiwiki alphabetically (which is the order global renames go), so it seems pretty likely there is a connection. (Except Husseinzadeh02/Hüseynzadə which also got stuck on the next wiki, minwiki, and the job did not finish properly on mediawikiwiki either. No idea what's up with that one.)
It is currently stuck for meta with the message "Sending approval email to User:Ajh98/<email-hidden>"
That's not a job (you can see from the URL). The special page that's scheduling the jobs is on meta so it probably comes from the web request where a steward/renamer submitted the confirmation form.
As far as I can tell (not very far) the jobs don't get stuck on meta, they never even start.
That is because global renamers and stewards always halt all of their operations as soon as we realize any request is stuck somewhere, until the situation is fully resolved. There are at least +80 requests (GlobalRenameQueue + English Wikipedia CHUS) waiting for this bug.
Anyway, fixed Ajh98/Nqtema with the script.
Is there a per-jobtype debug logging mode? If so, we should probably flip that on and test with the next rename request.
The problem started within hours of Kafka being enabled on mediawikiwiki, and it affects the wiki that's after mediawikiwiki alphabetically (which is the order global renames go), so it seems pretty likely there is a connection. (Except Husseinzadeh02/Hüseynzadə which also got stuck on the next wiki, minwiki, and the job did not finish properly on mediawikiwiki either. No idea what's up with that one.)
I believe this is the case. When switching the job to Kafka it was done only for test wikis and mediawikiwiki. So now, when the job is executed fo mediawikiwiki, it enqueues the next job in the context of mediawikiwiki, and since that one switched to Kafka, it only enqueues that into the Kafka queue, but since metawiki was not switched to Kafka, the job jets ignored by change-prop, so we get stuck.
We could either revert switching the LocalUserRenameJob right now (that will obviously fix the immediate problem, but we will still need to switch it at some point) or we can just push through and switch the job for all the wikis.
The actual bug is in mediawiki/core where JobQueueGroup [picks up the current wiki's wgJobTypeConf instead of the target's one](https://github.com/wikimedia/mediawiki/blob/bfbc44648dd690e5abf89277522bea973744f6d8/includes/jobqueue/JobQueueGroup.php#L108).
We could either revert switching the LocalUserRenameJob right now (that will obviously fix the immediate problem, but we will still need to switch it at some point) or we can just push through and switch the job for all the wikis.
Given that this is a global job, and we already know it can be executed for mediawikiwiki, I would be in favour of switching it on for all wikis.
Change 429835 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/change-propagation/jobqueue-deploy@master] Switch LocalRenameUserJob for all wikis.
Change 429836 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/mediawiki-config@master] Switch LocalRenameUserJob to kafka.
Change 429835 merged by Mobrovac:
[mediawiki/services/change-propagation/jobqueue-deploy@master] Switch LocalRenameUserJob for all wikis.
Change 429836 merged by jenkins-bot:
[operations/mediawiki-config@master] Switch LocalRenameUserJob to kafka.
Mentioned in SAL (#wikimedia-operations) [2018-04-30T16:49:39Z] <ppchelko@tin> Started deploy [cpjobqueue/deploy@01630f2]: Switch LocalRenameUserJob for all wikis. T193254
Mentioned in SAL (#wikimedia-operations) [2018-04-30T16:50:28Z] <ppchelko@tin> Finished deploy [cpjobqueue/deploy@01630f2]: Switch LocalRenameUserJob for all wikis. T193254 (duration: 00m 49s)
Mentioned in SAL (#wikimedia-operations) [2018-04-30T16:50:32Z] <mobrovac@tin> Synchronized wmf-config/jobqueue.php: Switch LocalRenameUserJob to EventBus for all wikis - T193254 T190327 (duration: 00m 59s)
We have switched the LocalRenameUserJob for all wikis to EventBus, so we don't anticipate any problems going forward (given the we already know they work for mediawikiwiki). Resolving, feel free to reopen if the problem persists.
@mobrovac
I think it still the same!
see here, the process was done quickly on all projects, but still in (Queued) status at Meta wiki!
As you know, we can't say it resolved until we being sure, because there's many pending requests, so if we said to all global renamers that the issue solved, there's will be a large number of rename processes in the log!
As you know, we can't say it resolved until we being sure, because there's many pending requests, so if we said to all global renamers that the issue solved, there's will be a large number of rename processes in the log!
The one you were referring to was requested before the issue was fixed. Re-enqueueing the message made it go away. Should all be good now.
Do we need to migrate CentralAuthRename too? If so, can it be done? Thanks.
Eventually everything will be migrated. Are you seeing problems with that job right now?
That's a log channel, not a job queue. Other potentially affected jobs are LocalUserMergeJob (not sure if Wikimedia wikis still allow merges) and LocalPageMoveJob (I think that's triggered differently, not quite sure though).
Other instances of cross-wiki job scheduling that are yielded by a quick ack 'JobQueueGroup::singleton\( ': Cognate/LocalJobSubmitJob, MassMessage/MassMessageSubmitJob, GlobalUsage/GlobalUsageCachePurgeJob, GlobalUserPage/LocalJobSubmitJob, SecurePoll/PopulateVoterListJob. There are probably more (e.g. global blocking works that way IIRC). So we should probably get the core bug fixed.
Other instances of cross-wiki job scheduling that are yielded by a quick ack 'JobQueueGroup::singleton\( ': Cognate/LocalJobSubmitJob, MassMessage/MassMessageSubmitJob, GlobalUsage/GlobalUsageCachePurgeJob, GlobalUserPage/LocalJobSubmitJob, SecurePoll/PopulateVoterListJob.
This problem only occurs when the job is partially switched, which we do for the gradual rollout of new jobs, but ironically, in this case, being cautious actually breaks things.
Most of the jobs from the list are already switched for everything, the 2 missing ones are MassMessage/MassMessageSubmitJob and SecurePoll/PopulateVoterListJob. We should either exclude them from the transition until the core bug is fixed, or switch them right away for all the wikis.
I filed T193471: JobQueueGroup's singletons using the wrong wgJobTypeConf for this.
I think we should go ahead and switch these two for all wikis at this point, given that we might be losing their executions without even knowing it.
Not likely, Beta is on a completely different system, we'll have to take a look at that one separately.
Thanks!
I think we should go ahead and switch these two for all wikis at this point, given that we might be losing their executions without even knowing it.
Note there might be other affected jobs too, e.g. DeferredUpdates can happen cross-wiki, I haven't tried to follow those calls.