Maniphest T193254

Global renames get stuck at metawiki
Closed, ResolvedPublicPRODUCTION ERROR
Actions

Assigned To

Authored By

	alaa
	Apr 27 2018, 3:35 PM

Description

All of the last 15 global renames got stuck on metawiki (or never even scheduled there, there are no relevant logs, neither errors nor the rename start message). Running fixStuckGlobalRename.php on meta works fine. The last successful non-CLI rename on meta was at 2018-04-26T15:24:07, the next one should have been around 18:00 but did not get triggered. The only possibly relevant SAL event in that time frame is 428970: Enable all jobs for test, test2, testwikidata and mediawiki. mediawikiwiki directly precedes metawiki, and rename jobs run in a chain (a successful job schedules the job for the next wiki) so maybe that breaks with Kafka somehow ([[https://github.com/wikimedia/mediawiki-extensions-CentralAuth/blob/master/includes/LocalRenameJob/LocalRenameJob.php#L154|job scheduling from LocalRenameJob::scheduleNextWiki]] fails). See https://wikitech.wikimedia.org/wiki/Stuck_global_renames for the workaround.

Details

	Subject	Repo	Branch	Lines +/-
	Switch LocalRenameUserJob to kafka.	operations/mediawiki-config	master	+1 -0
	Switch LocalRenameUserJob for all wikis.	mediawiki/services/change-propagation/jobqueue-deploy	master	+1 -0

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		• Pchelolo	T157088 [EPIC] Develop a JobQueue backend based on EventBus
Resolved		• Pchelolo	T190327 FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus
Resolved	PRODUCTION ERROR	• mobrovac	T193254 Global renames get stuck at metawiki

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Is this because of https://gerrit.wikimedia.org/r/419520 for T167246?

Removing asignee to let people know that on this reopening we have no one working on this new instance yet. Ping @Jdforrester-WMF wrt @Nirmos question above.

MarcoAurelio awarded a token.Apr 28 2018, 12:10 PM

MarcoAurelio rescinded a token.

MarcoAurelio awarded a token.

alaa mentioned this in T167246: Refactor "user" & "user_text" fields into "actor" reference table.Apr 28 2018, 2:46 PM

Seba98 subscribed.Apr 28 2018, 5:28 PM

Note:
All name changes are turned off until this problem is fixed! so we talk about number of requests more than 100 on all wikis!

I only see one stuck global rename right now. Yes, it is true however that the requests can queue up in numbers on the wikis and in the rename queue though. It's weekend so I'm not sure who's 'on duty' today that could have a look at this.

In T193254#4165780, @Nirmos wrote:

Is this because of https://gerrit.wikimedia.org/r/419520 for T167246?

Probably not, there is no reason that would affect meta only. Also I see no database error. The logs just stop when reaching meta.

Any idea why that may be happening? Issues with the meta job queue? Thanks.

In T193254#4165761, @1997kB wrote:

Nqtema Stuck at meta again. @Tgr

Do not forget to fix this please.

@mobrovac do you know if LocalRenameUserJob jobs on meta (and only there) could somehow be affected by the Redis-Kafka migration? I'm probably grasping at straws here, but not sure where else to look.

It seems the last successful non-CLI rename on meta was at 2018-04-26T15:24:07 (let's hope those are UTC timestamps, IIRC Kibana is a bit confused about time), the next one should have been around 18:00 but did not get triggered. The only possibly relevant SAL event in that time frame is 428970: Enable all jobs for test, test2, testwikidata and mediawiki. mediawikiwiki directly precedes metawiki, and rename jobs run in a chain (a successful job schedules the job for the next wiki) so maybe that breaks with Kafka somehow.

revi subscribed.Apr 30 2018, 7:54 AM

@mobrovac do you know if LocalRenameUserJob jobs on meta (and only there) could somehow be affected by the Redis-Kafka migration? I'm probably grasping at straws here, but not sure where else to look.

This T192405 might be related, but it was fixed quite a long time ago. I do not see any logs about jobs related to user renaming actually failing, but I'll keep looking

In T193254#4165761, @1997kB wrote:

Nqtema Stuck at meta again. @Tgr

@Pchelolo can any one at least fix this case?

Thanks on advance

There doesn't seem to be anything wrong with the transport mechanism: the jobs that got executed on the EventBus side all completed successfully.

there are 14 global renames stuck, all, at meta.wikimedia.

There is only one stuck job now and it completed for mediawikiwiki. It is currently stuck for meta with the message "Sending approval email to User:Ajh98/<email-hidden>"

• mobrovac lowered the priority of this task from Unbreak Now! to High.Apr 30 2018, 3:28 PM

@mobrovac: the reason there is only one stuck currently is likely because we have stopped doing renames until this is fixed because having users locked out of their account isn't ideal.

In T193254#4168262, @mobrovac wrote:

There is only one stuck job now and it completed for mediawikiwiki.

The others have been closed with the maintenance script (which does not use the job queue). They all got stuck on meta (and all completed fine for mediawikiwiki). My hypothesis is that [[https://github.com/wikimedia/mediawiki-extensions-CentralAuth/blob/master/includes/LocalRenameJob/LocalRenameJob.php#L154|job scheduling from LocalRenameJob::scheduleNextWiki]] gets stuck (it uses the wrong job queue scheduling service or something).

The problem started within hours of Kafka being enabled on mediawikiwiki, and it affects the wiki that's after mediawikiwiki alphabetically (which is the order global renames go), so it seems pretty likely there is a connection. (Except Husseinzadeh02/Hüseynzadə which also got stuck on the next wiki, minwiki, and the job did not finish properly on mediawikiwiki either. No idea what's up with that one.)

It is currently stuck for meta with the message "Sending approval email to User:Ajh98/<email-hidden>"

That's not a job (you can see from the URL). The special page that's scheduling the jobs is on meta so it probably comes from the web request where a steward/renamer submitted the confirmation form.

As far as I can tell (not very far) the jobs don't get stuck on meta, they never even start.

In T193254#4168262, @mobrovac wrote:

There is only one stuck job now and it completed for mediawikiwiki. It is currently stuck for meta with the message "Sending approval email to User:Ajh98/<email-hidden>"

That is because global renamers and stewards always halt all of their operations as soon as we realize any request is stuck somewhere, until the situation is fully resolved. There are at least +80 requests (GlobalRenameQueue + English Wikipedia CHUS) waiting for this bug.

Anyway, fixed Ajh98/Nqtema with the script.

Is there a per-jobtype debug logging mode? If so, we should probably flip that on and test with the next rename request.

In T193254#4168381, @Tgr wrote:

Anyway, fixed Ajh98/Nqtema with the script.

Is there a per-jobtype debug logging mode? If so, we should probably flip that on and test with the next rename request.

I'll make a new test 1 min

In T193254#4168383, @alanajjar wrote:

In T193254#4168381, @Tgr wrote:

Anyway, fixed Ajh98/Nqtema with the script.

Is there a per-jobtype debug logging mode? If so, we should probably flip that on and test with the next rename request.

I'll make a new test 1 min

I try here, and it's also stuck on Meta (or as @Tgr said "never even start")

Tgr renamed this task from Please unblock stuck global renames at Meta-Wiki to Global renames get stuck at metawiki.Apr 30 2018, 4:18 PM

Tgr updated the task description. (Show Details)

Tgr added projects: WMF-JobQueue, MediaWiki-Core-JobQueue.

Tgr edited projects, added Event-Platform; removed WMF-JobQueue.

Restricted Application added a project: Analytics. · View Herald TranscriptApr 30 2018, 4:20 PM

The problem started within hours of Kafka being enabled on mediawikiwiki, and it affects the wiki that's after mediawikiwiki alphabetically (which is the order global renames go), so it seems pretty likely there is a connection. (Except Husseinzadeh02/Hüseynzadə which also got stuck on the next wiki, minwiki, and the job did not finish properly on mediawikiwiki either. No idea what's up with that one.)

I believe this is the case. When switching the job to Kafka it was done only for test wikis and mediawikiwiki. So now, when the job is executed fo mediawikiwiki, it enqueues the next job in the context of mediawikiwiki, and since that one switched to Kafka, it only enqueues that into the Kafka queue, but since metawiki was not switched to Kafka, the job jets ignored by change-prop, so we get stuck.

We could either revert switching the LocalUserRenameJob right now (that will obviously fix the immediate problem, but we will still need to switch it at some point) or we can just push through and switch the job for all the wikis.

In T193254#4168443, @Pchelolo wrote:

I believe this is the case. When switching the job to Kafka it was done only for test wikis and mediawikiwiki. So now, when the job is executed fo mediawikiwiki, it enqueues the next job in the context of mediawikiwiki, and since that one switched to Kafka, it only enqueues that into the Kafka queue, but since metawiki was not switched to Kafka, the job jets ignored by change-prop, so we get stuck.

The actual bug is in mediawiki/core where JobQueueGroup [picks up the current wiki's wgJobTypeConf instead of the target's one](https://github.com/wikimedia/mediawiki/blob/bfbc44648dd690e5abf89277522bea973744f6d8/includes/jobqueue/JobQueueGroup.php#L108).

We could either revert switching the LocalUserRenameJob right now (that will obviously fix the immediate problem, but we will still need to switch it at some point) or we can just push through and switch the job for all the wikis.

Given that this is a global job, and we already know it can be executed for mediawikiwiki, I would be in favour of switching it on for all wikis.

• mobrovac added a parent task: T190327: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus.Apr 30 2018, 4:27 PM

Change 429835 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/change-propagation/jobqueue-deploy@master] Switch LocalRenameUserJob for all wikis.

https://gerrit.wikimedia.org/r/429835

gerritbot added a project: Patch-For-Review.Apr 30 2018, 4:28 PM

Change 429836 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/mediawiki-config@master] Switch LocalRenameUserJob to kafka.

https://gerrit.wikimedia.org/r/429836

• fdans raised the priority of this task from High to Needs Triage.Apr 30 2018, 4:32 PM

• fdans moved this task from Incoming to Radar on the Analytics board.

sorry - reverting accidental change of priority

Change 429835 merged by Mobrovac:
[mediawiki/services/change-propagation/jobqueue-deploy@master] Switch LocalRenameUserJob for all wikis.

https://gerrit.wikimedia.org/r/429835

Change 429836 merged by jenkins-bot:
[operations/mediawiki-config@master] Switch LocalRenameUserJob to kafka.

https://gerrit.wikimedia.org/r/429836

Mentioned in SAL (#wikimedia-operations) [2018-04-30T16:49:39Z] <ppchelko@tin> Started deploy [cpjobqueue/deploy@01630f2]: Switch LocalRenameUserJob for all wikis. T193254

Mentioned in SAL (#wikimedia-operations) [2018-04-30T16:50:28Z] <ppchelko@tin> Finished deploy [cpjobqueue/deploy@01630f2]: Switch LocalRenameUserJob for all wikis. T193254 (duration: 00m 49s)

Mentioned in SAL (#wikimedia-operations) [2018-04-30T16:50:32Z] <mobrovac@tin> Synchronized wmf-config/jobqueue.php: Switch LocalRenameUserJob to EventBus for all wikis - T193254 T190327 (duration: 00m 59s)

Stashbot mentioned this in T190327: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus.Apr 30 2018, 4:50 PM

We have switched the LocalRenameUserJob for all wikis to EventBus, so we don't anticipate any problems going forward (given the we already know they work for mediawikiwiki). Resolving, feel free to reopen if the problem persists.

@mobrovac
I think it still the same!
see here, the process was done quickly on all projects, but still in (Queued) status at Meta wiki!

As you know, we can't say it resolved until we being sure, because there's many pending requests, so if we said to all global renamers that the issue solved, there's will be a large number of rename processes in the log!

alaa reopened this task as Open.Apr 30 2018, 6:32 PM

Thanks a lot all

As you know, we can't say it resolved until we being sure, because there's many pending requests, so if we said to all global renamers that the issue solved, there's will be a large number of rename processes in the log!

The one you were referring to was requested before the issue was fixed. Re-enqueueing the message made it go away. Should all be good now.

Yes @Pchelolo I noticed that now, Thanks again

Do we need to migrate CentralAuthRename too? If so, can it be done? Thanks.

Do we need to migrate CentralAuthRename too? If so, can it be done? Thanks.

Eventually everything will be migrated. Are you seeing problems with that job right now?

That's a log channel, not a job queue. Other potentially affected jobs are LocalUserMergeJob (not sure if Wikimedia wikis still allow merges) and LocalPageMoveJob (I think that's triggered differently, not quite sure though).

In T193254#4169701, @Tgr wrote:

LocalPageMoveJob (I think that's triggered differently, not quite sure though).

Yes it is. So LocalUserMergeJob is the only one that might be affected.

We are not performing any user account merges nor globally nor locally. Regards.

Other instances of cross-wiki job scheduling that are yielded by a quick ack 'JobQueueGroup::singleton\( ': Cognate/LocalJobSubmitJob, MassMessage/MassMessageSubmitJob, GlobalUsage/GlobalUsageCachePurgeJob, GlobalUserPage/LocalJobSubmitJob, SecurePoll/PopulateVoterListJob. There are probably more (e.g. global blocking works that way IIRC). So we should probably get the core bug fixed.

Other instances of cross-wiki job scheduling that are yielded by a quick ack 'JobQueueGroup::singleton\( ': Cognate/LocalJobSubmitJob, MassMessage/MassMessageSubmitJob, GlobalUsage/GlobalUsageCachePurgeJob, GlobalUserPage/LocalJobSubmitJob, SecurePoll/PopulateVoterListJob.

This problem only occurs when the job is partially switched, which we do for the gradual rollout of new jobs, but ironically, in this case, being cautious actually breaks things.

Most of the jobs from the list are already switched for everything, the 2 missing ones are MassMessage/MassMessageSubmitJob and SecurePoll/PopulateVoterListJob. We should either exclude them from the transition until the core bug is fixed, or switch them right away for all the wikis.

Is this related to T192604 anyhow? Regards.

• mobrovac mentioned this in T193471: JobQueueGroup's singletons using the wrong wgJobTypeConf.May 1 2018, 9:29 AM

@Pchelolo @mobrovac @Tgr Would be helpful if we create tracking task for stuck renames? (like T169440)

In T193254#4169720, @Tgr wrote:

So we should probably get the core bug fixed.

I filed T193471: JobQueueGroup's singletons using the wrong wgJobTypeConf for this.

In T193254#4169741, @Pchelolo wrote:

Most of the jobs from the list are already switched for everything, the 2 missing ones are MassMessage/MassMessageSubmitJob and SecurePoll/PopulateVoterListJob. We should either exclude them from the transition until the core bug is fixed, or switch them right away for all the wikis.

I think we should go ahead and switch these two for all wikis at this point, given that we might be losing their executions without even knowing it.

In T193254#4169799, @MarcoAurelio wrote:

Is this related to T192604 anyhow? Regards.

Not likely, Beta is on a completely different system, we'll have to take a look at that one separately.

In T193254#4170563, @mobrovac wrote:

I filed T193471: JobQueueGroup's singletons using the wrong wgJobTypeConf for this.

Thanks!

I think we should go ahead and switch these two for all wikis at this point, given that we might be losing their executions without even knowing it.

Note there might be other affected jobs too, e.g. DeferredUpdates can happen cross-wiki, I haven't tried to follow those calls.

Liuxinyu970226 unsubscribed.May 2 2018, 7:56 AM

alaa mentioned this in T193790: Global renames get stuck at ty.wikipedia.May 3 2018, 10:14 PM

Trijnstel subscribed.May 15 2018, 1:09 PM

• Vvjjkkii renamed this task from Global renames get stuck at metawiki to 52daaaaaaa.Jul 1 2018, 1:13 AM

• Vvjjkkii reopened this task as Open.

• Vvjjkkii removed • mobrovac as the assignee of this task.

• Vvjjkkii added projects: CheckUser, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), Tamil-Sites, Gamepress, Hashtags, Jade, KartoEditor, Language-2018-Apr-June, New-Editor-Experiences, Mail, TCB-Team (now WMDE-TechWish).

• Vvjjkkii updated the task description. (Show Details)

• Vvjjkkii removed subscribers: gerritbot, MarcoAurelio, Aklapper.

1997kB renamed this task from 52daaaaaaa to Global renames get stuck at metawiki.Jul 1 2018, 2:41 AM

1997kB closed this task as Resolved.

1997kB assigned this task to • mobrovac.

1997kB removed projects: TCB-Team (now WMDE-TechWish), Mail, New-Editor-Experiences, Language-2018-Apr-June, KartoEditor, Jade, Hashtags, Gamepress, Tamil-Sites, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), CheckUser.

1997kB updated the task description. (Show Details)

1997kB added subscribers: GerritBot, MarcoAurelio, Aklapper.

1997kB edited subscribers, added: gerritbot; removed: GerritBot.

MarcoAurelio moved this task from Backlog to Closed on the GlobalRename board.Sep 1 2018, 12:15 PM

• mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:09 PM

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:44 AM

Global renames get stuck at metawikiClosed, ResolvedPublicPRODUCTION ERRORActions

Description

Details

Related ObjectsSearch...

Event Timeline

Global renames get stuck at metawiki
Closed, ResolvedPublicPRODUCTION ERROR
Actions

Related Objects
Search...