Page MenuHomePhabricator

Global renames get stuck at metawiki
Closed, ResolvedPublic

Description

All of the last 15 global renames got stuck on metawiki (or never even scheduled there, there are no relevant logs, neither errors nor the rename start message). Running fixStuckGlobalRename.php on meta works fine. The last successful non-CLI rename on meta was at 2018-04-26T15:24:07, the next one should have been around 18:00 but did not get triggered. The only possibly relevant SAL event in that time frame is 428970: Enable all jobs for test, test2, testwikidata and mediawiki. mediawikiwiki directly precedes metawiki, and rename jobs run in a chain (a successful job schedules the job for the next wiki) so maybe that breaks with Kafka somehow (job scheduling from LocalRenameJob::scheduleNextWiki fails). See https://wikitech.wikimedia.org/wiki/Stuck_global_renames for the workaround.

Details

Related Gerrit Patches:
operations/mediawiki-config : masterSwitch LocalRenameUserJob to kafka.
mediawiki/services/change-propagation/jobqueue-deploy : masterSwitch LocalRenameUserJob for all wikis.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
MarcoAurelio removed Tgr as the assignee of this task.Apr 28 2018, 12:10 PM
MarcoAurelio added a subscriber: Jdforrester-WMF.

Removing asignee to let people know that on this reopening we have no one working on this new instance yet. Ping @Jdforrester-WMF wrt @Nirmos question above.

MarcoAurelio rescinded a token.
MarcoAurelio awarded a token.
Seba98 added a subscriber: Seba98.Apr 28 2018, 5:28 PM

Note:
All name changes are turned off until this problem is fixed! so we talk about number of requests more than 100 on all wikis!

I only see one stuck global rename right now. Yes, it is true however that the requests can queue up in numbers on the wikis and in the rename queue though. It's weekend so I'm not sure who's 'on duty' today that could have a look at this.

Tgr added a comment.Apr 29 2018, 9:30 AM

Probably not, there is no reason that would affect meta only. Also I see no database error. The logs just stop when reaching meta.

Any idea why that may be happening? Issues with the meta job queue? Thanks.

Nqtema Stuck at meta again. @Tgr

Do not forget to fix this please.

Tgr added a subscriber: mobrovac.Apr 29 2018, 9:56 AM

@mobrovac do you know if LocalRenameUserJob jobs on meta (and only there) could somehow be affected by the Redis-Kafka migration? I'm probably grasping at straws here, but not sure where else to look.

Tgr added a comment.Apr 29 2018, 10:05 AM

It seems the last successful non-CLI rename on meta was at 2018-04-26T15:24:07 (let's hope those are UTC timestamps, IIRC Kibana is a bit confused about time), the next one should have been around 18:00 but did not get triggered. The only possibly relevant SAL event in that time frame is 428970: Enable all jobs for test, test2, testwikidata and mediawiki. mediawikiwiki directly precedes metawiki, and rename jobs run in a chain (a successful job schedules the job for the next wiki) so maybe that breaks with Kafka somehow.

revi added a subscriber: revi.Apr 30 2018, 7:54 AM

@mobrovac do you know if LocalRenameUserJob jobs on meta (and only there) could somehow be affected by the Redis-Kafka migration? I'm probably grasping at straws here, but not sure where else to look.

This T192405 might be related, but it was fixed quite a long time ago. I do not see any logs about jobs related to user renaming actually failing, but I'll keep looking

Nqtema Stuck at meta again. @Tgr

@Pchelolo can any one at least fix this case?

Thanks on advance

There doesn't seem to be anything wrong with the transport mechanism: the jobs that got executed on the EventBus side all completed successfully.

there are 14 global renames stuck, all, at meta.wikimedia.

There is only one stuck job now and it completed for mediawikiwiki. It is currently stuck for meta with the message "Sending approval email to User:Ajh98/<email-hidden>"

mobrovac lowered the priority of this task from Unbreak Now! to High.Apr 30 2018, 3:28 PM

@mobrovac: the reason there is only one stuck currently is likely because we have stopped doing renames until this is fixed because having users locked out of their account isn't ideal.

Tgr added a comment.Apr 30 2018, 3:58 PM

There is only one stuck job now and it completed for mediawikiwiki.

The others have been closed with the maintenance script (which does not use the job queue). They all got stuck on meta (and all completed fine for mediawikiwiki). My hypothesis is that job scheduling from LocalRenameJob::scheduleNextWiki gets stuck (it uses the wrong job queue scheduling service or something).

The problem started within hours of Kafka being enabled on mediawikiwiki, and it affects the wiki that's after mediawikiwiki alphabetically (which is the order global renames go), so it seems pretty likely there is a connection. (Except Husseinzadeh02/Hüseynzadə which also got stuck on the next wiki, minwiki, and the job did not finish properly on mediawikiwiki either. No idea what's up with that one.)

It is currently stuck for meta with the message "Sending approval email to User:Ajh98/<email-hidden>"

That's not a job (you can see from the URL). The special page that's scheduling the jobs is on meta so it probably comes from the web request where a steward/renamer submitted the confirmation form.

As far as I can tell (not very far) the jobs don't get stuck on meta, they never even start.

revi added a comment.EditedApr 30 2018, 4:00 PM

There is only one stuck job now and it completed for mediawikiwiki. It is currently stuck for meta with the message "Sending approval email to User:Ajh98/<email-hidden>"

That is because global renamers and stewards always halt all of their operations as soon as we realize any request is stuck somewhere, until the situation is fully resolved. There are at least +80 requests (GlobalRenameQueue + English Wikipedia CHUS) waiting for this bug.

Tgr added a comment.Apr 30 2018, 4:02 PM

Anyway, fixed Ajh98/Nqtema with the script.

Is there a per-jobtype debug logging mode? If so, we should probably flip that on and test with the next rename request.

Anyway, fixed Ajh98/Nqtema with the script.
Is there a per-jobtype debug logging mode? If so, we should probably flip that on and test with the next rename request.

I'll make a new test 1 min

Anyway, fixed Ajh98/Nqtema with the script.
Is there a per-jobtype debug logging mode? If so, we should probably flip that on and test with the next rename request.

I'll make a new test 1 min

I try here, and it's also stuck on Meta (or as @Tgr said "never even start")

Tgr renamed this task from Please unblock stuck global renames at Meta-Wiki to Global renames get stuck at metawiki.Apr 30 2018, 4:18 PM
Tgr updated the task description. (Show Details)
Tgr edited projects, added Event-Platform; removed WMF-JobQueue.
Restricted Application added a project: Analytics. · View Herald TranscriptApr 30 2018, 4:20 PM

The problem started within hours of Kafka being enabled on mediawikiwiki, and it affects the wiki that's after mediawikiwiki alphabetically (which is the order global renames go), so it seems pretty likely there is a connection. (Except Husseinzadeh02/Hüseynzadə which also got stuck on the next wiki, minwiki, and the job did not finish properly on mediawikiwiki either. No idea what's up with that one.)

I believe this is the case. When switching the job to Kafka it was done only for test wikis and mediawikiwiki. So now, when the job is executed fo mediawikiwiki, it enqueues the next job in the context of mediawikiwiki, and since that one switched to Kafka, it only enqueues that into the Kafka queue, but since metawiki was not switched to Kafka, the job jets ignored by change-prop, so we get stuck.

We could either revert switching the LocalUserRenameJob right now (that will obviously fix the immediate problem, but we will still need to switch it at some point) or we can just push through and switch the job for all the wikis.

I believe this is the case. When switching the job to Kafka it was done only for test wikis and mediawikiwiki. So now, when the job is executed fo mediawikiwiki, it enqueues the next job in the context of mediawikiwiki, and since that one switched to Kafka, it only enqueues that into the Kafka queue, but since metawiki was not switched to Kafka, the job jets ignored by change-prop, so we get stuck.

The actual bug is in mediawiki/core where JobQueueGroup picks up the current wiki's wgJobTypeConf instead of the target's one.

We could either revert switching the LocalUserRenameJob right now (that will obviously fix the immediate problem, but we will still need to switch it at some point) or we can just push through and switch the job for all the wikis.

Given that this is a global job, and we already know it can be executed for mediawikiwiki, I would be in favour of switching it on for all wikis.

Change 429835 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/change-propagation/jobqueue-deploy@master] Switch LocalRenameUserJob for all wikis.

https://gerrit.wikimedia.org/r/429835

Change 429836 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/mediawiki-config@master] Switch LocalRenameUserJob to kafka.

https://gerrit.wikimedia.org/r/429836

fdans raised the priority of this task from High to Needs Triage.Apr 30 2018, 4:32 PM
fdans moved this task from Incoming to Radar on the Analytics board.
Milimetric triaged this task as High priority.Apr 30 2018, 4:34 PM
Milimetric added a subscriber: Milimetric.

sorry - reverting accidental change of priority

Change 429835 merged by Mobrovac:
[mediawiki/services/change-propagation/jobqueue-deploy@master] Switch LocalRenameUserJob for all wikis.

https://gerrit.wikimedia.org/r/429835

Change 429836 merged by jenkins-bot:
[operations/mediawiki-config@master] Switch LocalRenameUserJob to kafka.

https://gerrit.wikimedia.org/r/429836

Mentioned in SAL (#wikimedia-operations) [2018-04-30T16:49:39Z] <ppchelko@tin> Started deploy [cpjobqueue/deploy@01630f2]: Switch LocalRenameUserJob for all wikis. T193254

Mentioned in SAL (#wikimedia-operations) [2018-04-30T16:50:28Z] <ppchelko@tin> Finished deploy [cpjobqueue/deploy@01630f2]: Switch LocalRenameUserJob for all wikis. T193254 (duration: 00m 49s)

Mentioned in SAL (#wikimedia-operations) [2018-04-30T16:50:32Z] <mobrovac@tin> Synchronized wmf-config/jobqueue.php: Switch LocalRenameUserJob to EventBus for all wikis - T193254 T190327 (duration: 00m 59s)

mobrovac closed this task as Resolved.Apr 30 2018, 4:57 PM
mobrovac claimed this task.
mobrovac edited projects, added Services (done); removed Patch-For-Review, Services (doing).

We have switched the LocalRenameUserJob for all wikis to EventBus, so we don't anticipate any problems going forward (given the we already know they work for mediawikiwiki). Resolving, feel free to reopen if the problem persists.

@mobrovac
I think it still the same!
see here, the process was done quickly on all projects, but still in (Queued) status at Meta wiki!

As you know, we can't say it resolved until we being sure, because there's many pending requests, so if we said to all global renamers that the issue solved, there's will be a large number of rename processes in the log!

alanajjar reopened this task as Open.Apr 30 2018, 6:32 PM
alanajjar closed this task as Resolved.Apr 30 2018, 6:37 PM

Thanks a lot all

As you know, we can't say it resolved until we being sure, because there's many pending requests, so if we said to all global renamers that the issue solved, there's will be a large number of rename processes in the log!

The one you were referring to was requested before the issue was fixed. Re-enqueueing the message made it go away. Should all be good now.

Yes @Pchelolo I noticed that now, Thanks again

Do we need to migrate CentralAuthRename too? If so, can it be done? Thanks.

Do we need to migrate CentralAuthRename too? If so, can it be done? Thanks.

Eventually everything will be migrated. Are you seeing problems with that job right now?

Tgr added a comment.Apr 30 2018, 9:26 PM

That's a log channel, not a job queue. Other potentially affected jobs are LocalUserMergeJob (not sure if Wikimedia wikis still allow merges) and LocalPageMoveJob (I think that's triggered differently, not quite sure though).

Tgr added a comment.Apr 30 2018, 9:27 PM

LocalPageMoveJob (I think that's triggered differently, not quite sure though).

Yes it is. So LocalUserMergeJob is the only one that might be affected.

We are not performing any user account merges nor globally nor locally. Regards.

Tgr added a comment.Apr 30 2018, 9:36 PM

Other instances of cross-wiki job scheduling that are yielded by a quick ack 'JobQueueGroup::singleton\( ': Cognate/LocalJobSubmitJob, MassMessage/MassMessageSubmitJob, GlobalUsage/GlobalUsageCachePurgeJob, GlobalUserPage/LocalJobSubmitJob, SecurePoll/PopulateVoterListJob. There are probably more (e.g. global blocking works that way IIRC). So we should probably get the core bug fixed.

Other instances of cross-wiki job scheduling that are yielded by a quick ack 'JobQueueGroup::singleton\( ': Cognate/LocalJobSubmitJob, MassMessage/MassMessageSubmitJob, GlobalUsage/GlobalUsageCachePurgeJob, GlobalUserPage/LocalJobSubmitJob, SecurePoll/PopulateVoterListJob.

This problem only occurs when the job is partially switched, which we do for the gradual rollout of new jobs, but ironically, in this case, being cautious actually breaks things.

Most of the jobs from the list are already switched for everything, the 2 missing ones are MassMessage/MassMessageSubmitJob and SecurePoll/PopulateVoterListJob. We should either exclude them from the transition until the core bug is fixed, or switch them right away for all the wikis.

Is this related to T192604 anyhow? Regards.

@Pchelolo @mobrovac @Tgr Would be helpful if we create tracking task for stuck renames? (like T169440)

So we should probably get the core bug fixed.

I filed T193471: JobQueueGroup's singletons using the wrong wgJobTypeConf for this.

Most of the jobs from the list are already switched for everything, the 2 missing ones are MassMessage/MassMessageSubmitJob and SecurePoll/PopulateVoterListJob. We should either exclude them from the transition until the core bug is fixed, or switch them right away for all the wikis.

I think we should go ahead and switch these two for all wikis at this point, given that we might be losing their executions without even knowing it.

Is this related to T192604 anyhow? Regards.

Not likely, Beta is on a completely different system, we'll have to take a look at that one separately.

Tgr added a comment.May 1 2018, 10:11 AM

Thanks!

I think we should go ahead and switch these two for all wikis at this point, given that we might be losing their executions without even knowing it.

Note there might be other affected jobs too, e.g. DeferredUpdates can happen cross-wiki, I haven't tried to follow those calls.

Vvjjkkii renamed this task from Global renames get stuck at metawiki to 52daaaaaaa.Jul 1 2018, 1:13 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed mobrovac as the assignee of this task.
Vvjjkkii updated the task description. (Show Details)
1997kB renamed this task from 52daaaaaaa to Global renames get stuck at metawiki.Jul 1 2018, 2:41 AM
1997kB closed this task as Resolved.
1997kB assigned this task to mobrovac.
1997kB updated the task description. (Show Details)
1997kB added subscribers: GerritBot, MarcoAurelio, Aklapper.
1997kB edited subscribers, added: gerritbot; removed: GerritBot.
MarcoAurelio moved this task from Backlog to Closed on the GlobalRename board.Sep 1 2018, 12:15 PM
mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:09 PM