Page MenuHomePhabricator

GlobalRename gets stuck sometimes
Closed, ResolvedPublic

Assigned To
Authored By
Tgr
Jun 16 2016, 4:01 PM
Referenced Files
F4261660: 11.jpg
Jul 11 2016, 12:12 PM
Tokens
"The World Burns" token, awarded by Cyberpower678."The World Burns" token, awarded by Liuxinyu970226."The World Burns" token, awarded by MarcoAurelio."The World Burns" token, awarded by Poyekhali."The World Burns" token, awarded by BethNaught."The World Burns" token, awarded by Vituzzu."The World Burns" token, awarded by Linedwell.

Description

See for example https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/Ratte which is stuck on several wikis. (Nuevo Paso might be another example.)
Most global renames seem to work though.
I found no relevant log entries in logstash. (There are a bunch of unresolvable Hausratte@<wiki> errors which is probably just a consequence of CentralAuth trying an account migration process some time after the rename errors, and failing on the non-renamed wikis.)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

HELLO? Is anyone working on this? It's pretty annoying that no one seems to care to quickly fix this... I apologize if I'm mistaken, but it certainly looks this way.

Nobody is currently assigned to solve this task.

@Aklapper Could you try to find someone with a knowledge of how global rename
code works, so we can devise a script to manually achieve these stuck
renames?

@Legoktm, @Tgr, @Anomie: Any of you knows who could look into this (and if this is the same as T135656? Thanks in advance!

Last I heard, @Tgr and @Legoktm talked about this at Wikimania and Lego had a plan of some sort.

This and T135656 are probably referring to the same issue, at least at the moment. This one is so vague and the other has had several iterations of similar problems under its banner which makes it hard to be more specific.

Can at least the renames stuck be resolved? Those users ain't able to log-in anymore until their renames are completed? We've issued warnings to the global renamers and stewards to avoid further renames until this is fixed. Thank you.

As Anomie noted earlier, what probably happens is that (when a user has accounts on many wikis) lots of rename jobs are started at the same time, each job tries to reset the central token, some of them fail, and leave the user in some state where automatically reattempting the rename does not work. (At a guess, there is a CAS error when invalidateSessionsForUser is called, which causes the CentralAuthUser to not be saved, which causes the user save at the very end of RenameUserSQL::rename to not be a no-op for CentralAuthHooks::onUserSaveSettings, and there is a lock wait timeout at that point; by then the user is fully renamed, so the next rename attempt will not find it.) Since the rename is marked as "in process", MediaWiki refuses to log the user in.

We discussed this at the hackathon and the easiest fix is to make the jobs run sequentially, instead of in parallel: instead of scheduling all the jobs at start, just have one rename job schedule the next one. That will make renames slower for users with many accounts, but as I understand that's not considered a big problem.

(The more complex alternative would be to ensure that the user is not logged in on the wiki without global token resets, e.g. by breaking the local session and then having CentralAuthSessionProvider check some sort of blacklist.)

Change 297537 had a related patch set uploaded (by Gergő Tisza):
Make LocalRename jobs run sequentially

https://gerrit.wikimedia.org/r/297537

Five users are stuck:

mysql:wikiadmin@db1079 [centralauth]> select ru_oldname, ru_newname, count(*) from renameuser_status group by ru_oldname, ru_newname;
+------------------------------------+------------------+----------+
| ru_oldname                         | ru_newname       | count(*) |
+------------------------------------+------------------+----------+
| Acee8                              | Nuevo Paso       |      103 |
| Hausratte                          | Ratte            |       58 |
| Markouzki                          | Quijx            |        1 |
| Михаил Марчук                      | Tot Samyj Niekto |       13 |
| बिप्लब आनन्द                           | Biplab Anand     |      258 |
+------------------------------------+------------------+----------+

Markouzki was actually moved but the status got stuck; fixed that. The other four are moved globally and on some wikis but not on many others. Not sure how to deal with that; is that what forceRenameUsers.php is for?

The script uses users_to_rename table, according 42c451c commit message.
It's a table used during SUL migration, see T73924.

I understand there is a patch for review. Will this patch fix the renames getting stuck problem, or simply get the current renames unstuck?

I understand there is a patch for review. Will this patch fix the renames getting stuck problem, or simply get the current renames unstuck?

As far i can see https://gerrit.wikimedia.org/r/297537 is fixing the rename problem but not the blocked accounts. After the patch is merged we can start with renaming users again?

I would very much love to. I dedicate a few hours to review the entire queue.

Change 297537 merged by jenkins-bot:
Make LocalRename jobs run sequentially

https://gerrit.wikimedia.org/r/297537

Cool with it being merged, when will it be deployed?

Cool with it being merged, when will it be deployed?

It'll be in .10 unless backported

Cool with it being merged, when will it be deployed?

It'll be in .10 unless backported

Got a deployment schedule handy? :p

Should be everywhere by the 14th.

Starting to be deployed on the 12th

Should be everywhere by the 14th.

Starting to be deployed on the 12th

(Y)

Change 297697 had a related patch set uploaded (by Legoktm):
Make LocalRename jobs run sequentially

https://gerrit.wikimedia.org/r/297697

Change 297698 had a related patch set uploaded (by Legoktm):
Make LocalRename jobs run sequentially

https://gerrit.wikimedia.org/r/297698

Change 297697 merged by jenkins-bot:
Make LocalRename jobs run sequentially

https://gerrit.wikimedia.org/r/297697

Change 297698 merged by jenkins-bot:
Make LocalRename jobs run sequentially

https://gerrit.wikimedia.org/r/297698

Mentioned in SAL [2016-07-07T00:03:29Z] <legoktm@tin> Synchronized php-1.28.0-wmf.8/extensions/CentralAuth/: Make LocalRename jobs run sequentially - T137973 (duration: 00m 34s)

Mentioned in SAL [2016-07-07T00:05:17Z] <legoktm@tin> Synchronized php-1.28.0-wmf.9/extensions/CentralAuth/: Make LocalRename jobs run sequentially - T137973 (duration: 00m 30s)

Mentioned in SAL [2016-07-07T00:06:50Z] <legoktm@tin> Synchronized php-1.28.0-wmf.8/extensions/CentralAuth/: Make LocalRename jobs run sequentially - T137973 (for real this time) (duration: 00m 30s)

Legoktm assigned this task to Tgr.

Email sent to the global-renamers list:

Hi,

Tgr and Anomie (send cookies and thanks their way!) worked on a patch to fix global rename by making the jobs for each wiki run one at a time instead of in parallel. Some testing on renames today showed that the fix is working and there haven't been any issues yet.

Global rename is now significantly slower - you'll notice that one wiki goes at a time and will be processed in order.

For now, please only start one rename at a time. Keep an eye on the overall Special:GlobalRenameProgress and make sure there aren't more than 10 other renames currently running. I know there is a backlog of rename requests, but let's not clear the entire queue at once. :-)

  • Kunal

It is happening again. All of these nine cases got stuck.

Ami Ruse → Jack Dobson (view progress)
Benjaminekman → Benjekman (view progress)
Elayamir → TheGodfather85 (view progress)
Fra150190 → Superpes15 (view progress)
Hausratte → Ratte (view progress)
MahdiEynian → MikeEcho (view progress)
SamLikesPlanes → Substellar (view progress)
Sarybe → Tommy377 (view progress)
Yareth Sarmiento → Khloe S Castro (view progress)

I tried to understand what is wrong. For example. In case of MikeEcho It finished loginwiki but never get to start mediawikiwiki. I saved logs for this case in logstash.

Finally i am able to log in now. Thanks @Tgr and @Legoktm

Finally i am able to log in now. Thanks @Tgr and @Legoktm

Yes, but now there are other poor users stuck in limbo. Looks like the problem has gotten worse.

Finally i am able to log in now. Thanks @Tgr and @Legoktm

Yes, but now there are other poor users stuck in limbo. Looks like the problem has gotten worse.

Yes definitely. So before we proceed for any new rename we have to check the progress first.

Looks like the serializing of the jobs isn't quite working. For example,

runJobs.log
2016-07-07 04:27:53 [V33Z4QpAEKsAABC3WScAAABY] mw1167 jawiki 1.28.0-wmf.8 runJobs DEBUG: LocalRenameUserJob Global_rename_job from=MahdiEynian to=MikeEcho renamer=Ladsgroup movepages=1 suppressredirects= promotetoglobal= reason=per [[m:Special:GlobalRenameQueue/request/25140|request]] session={...} force= requestId=V33Z4QpAEKsAABC3WScAAABY (uuid=beaf29220639406c8b2bb2bb16bbada5,timestamp=1467865670,QueuePartition=rdb3-6380) STARTING
2016-07-07 04:27:54 [V33Z4QpAEKsAABC3WScAAABY] mw1165 loginwiki 1.28.0-wmf.9 runJobs DEBUG: LocalRenameUserJob Global_rename_job from=MahdiEynian to=MikeEcho renamer=Ladsgroup movepages=1 suppressredirects= promotetoglobal= reason=per [[m:Special:GlobalRenameQueue/request/25140|request]] session={...} force= requestId=V33Z4QpAEKsAABC3WScAAABY (uuid=dabadab2fcfd408687ee498edfdde3ef,timestamp=1467865673,QueuePartition=rdb1-6380) STARTING
2016-07-07 04:27:54 [V33Z4QpAEKsAABC3WScAAABY] mw1167 jawiki 1.28.0-wmf.8 runJobs INFO: LocalRenameUserJob Global_rename_job from=MahdiEynian to=MikeEcho renamer=Ladsgroup movepages=1 suppressredirects= promotetoglobal= reason=per [[m:Special:GlobalRenameQueue/request/25140|request]] session={...} force= requestId=V33Z4QpAEKsAABC3WScAAABY (uuid=beaf29220639406c8b2bb2bb16bbada5,timestamp=1467865670,QueuePartition=rdb3-6380) t=240 good
2016-07-07 04:27:54 [V33Z4QpAEKsAABC3WScAAABY] mw1167 jawiki 1.28.0-wmf.8 runJobs DEBUG: LocalRenameUserJob Global_rename_job from=MahdiEynian to=MikeEcho renamer=Ladsgroup movepages=1 suppressredirects= promotetoglobal= reason=per [[m:Special:GlobalRenameQueue/request/25140|request]] session={...} force= requestId=V33Z4QpAEKsAABC3WScAAABY (uuid=19ea2f7808d74ba0a05b01353cc39fd8,timestamp=1467865674,QueuePartition=rdb1-6381) STARTING
2016-07-07 04:27:54 [V33Z4QpAEKsAABC3WScAAABY] mw1167 jawiki 1.28.0-wmf.8 runJobs INFO: LocalRenameUserJob Global_rename_job from=MahdiEynian to=MikeEcho renamer=Ladsgroup movepages=1 suppressredirects= promotetoglobal= reason=per [[m:Special:GlobalRenameQueue/request/25140|request]] session={...} force= requestId=V33Z4QpAEKsAABC3WScAAABY (uuid=19ea2f7808d74ba0a05b01353cc39fd8,timestamp=1467865674,QueuePartition=rdb1-6381) t=15 good
2016-07-07 04:27:54 [V33Z4QpAEKsAABC3WScAAABY] mw1165 loginwiki 1.28.0-wmf.9 runJobs INFO: LocalRenameUserJob Global_rename_job from=MahdiEynian to=MikeEcho renamer=Ladsgroup movepages=1 suppressredirects= promotetoglobal= reason=per [[m:Special:GlobalRenameQueue/request/25140|request]] session={...} force= requestId=V33Z4QpAEKsAABC3WScAAABY (uuid=dabadab2fcfd408687ee498edfdde3ef,timestamp=1467865673,QueuePartition=rdb1-6380) t=132 good

It looks like what might be happening is this:

  1. jawiki job auto-starts a DB transaction thanks to DBO_DEFAULT/DBO_TRX.
  2. jawiki rename finishes.
  3. jawiki job schedules the next wiki, loginwiki.
  4. loginwiki job is started.
  5. loginwiki rename finishes. I note it logged a CAS error, probably because the jawiki job didn't commit its DB writes yet.
  6. loginwiki job schedules the next wiki. Since the jawiki job hasn't committed its transaction yet, it doesn't see that jawiki is marked as "done" so it schedules for jawiki.
  7. First jawiki job commits its transaction.
  8. Second jawiki job runs, finds that jawiki has already been done, and bails out.

Change 297817 had a related patch set uploaded (by Anomie):
Fix job serializing (and status display on Special:GlobalRenameProgress)

https://gerrit.wikimedia.org/r/297817

Change 297817 merged by jenkins-bot:
Fix job serializing (and status display on Special:GlobalRenameProgress)

https://gerrit.wikimedia.org/r/297817

Change 297941 had a related patch set uploaded (by Legoktm):
Fix job serializing (and status display on Special:GlobalRenameProgress)

https://gerrit.wikimedia.org/r/297941

Got Problem once again:) my account is not attached with more than 256 accounts :)
https://commons.wikimedia.org/w/index.php?title=Special%3ACentralAuth&target=Biplab+Anand

Use Special:MergeAccount.

In T137973#2440498, @Pokefan95 wrote:

Got Problem once again:) my account is not attached with more than 256 accounts :)
https://commons.wikimedia.org/w/index.php?title=Special%3ACentralAuth&target=Biplab+Anand

Use Special:MergeAccount.

Yes i do but merging failed.

In T137973#2440498, @Pokefan95 wrote:

Got Problem once again:) my account is not attached with more than 256 accounts :)
https://commons.wikimedia.org/w/index.php?title=Special%3ACentralAuth&target=Biplab+Anand

Use Special:MergeAccount.

Yes i do but merging failed.

What is the error message?

Mentioned in SAL [2016-07-08T18:52:01Z] <anomie> Attempting to resubmit LocalRenameUserJobs for T137973

Oh... I didn't pay close enough attention, the backport was submitted as https://gerrit.wikimedia.org/r/#/c/297941/ but not actually merged and backported, so things might still fail as they did yesterday.

Change 297941 merged by jenkins-bot:
Fix job serializing (and status display on Special:GlobalRenameProgress)

https://gerrit.wikimedia.org/r/297941

Mentioned in SAL [2016-07-08T22:02:49Z] <legoktm@tin> Synchronized php-1.28.0-wmf.9/extensions/CentralAuth/: Fix job serializing (and status display on Special:GlobalRenameProgress) - T137973 (duration: 00m 32s)

In T137973#2440559, @Pokefan95 wrote:
In T137973#2440498, @Pokefan95 wrote:

Got Problem once again:) my account is not attached with more than 256 accounts :)
https://commons.wikimedia.org/w/index.php?title=Special%3ACentralAuth&target=Biplab+Anand

Use Special:MergeAccount.

Yes i do but merging failed.

What is the error message?

Here you can see @Pokefan95

11.jpg (998×995 px, 190 KB)

Can we start renaming again? Recent renames seem to be going through, but I'm not getting any "official" confirmation that the problem is fixed for sure.

Update from IRC:

legoktm set the topic: (...) | Status: <10 concurrent renames plz

Which means not more than 10 renames at https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress

Apart from that we did a lot of renames today (~300) and all where successful. (Exempt one: 25356 is broken, but that is likely unrelated to this bug.)

In T137973#2440559, @Pokefan95 wrote:
In T137973#2440498, @Pokefan95 wrote:

Got Problem once again:) my account is not attached with more than 256 accounts :)
https://commons.wikimedia.org/w/index.php?title=Special%3ACentralAuth&target=Biplab+Anand

Use Special:MergeAccount.

Yes i do but merging failed.

What is the error message?

Here you can see @Pokefan95

11.jpg (998×995 px, 190 KB)

Did you typed the password that you are using for your Wikimedia account? If yes, was it successful? If not, then try using password reset for all sites that failed.

@Pokefan95 wrote:
If not, then try using password reset for all sites that failed.

This won't work, the CA seems broken. A tech has to look into it. This is not a standard problem, you won't be able to help him.

Change 299887 had a related patch set uploaded (by Gergő Tisza):
Make LocalRename jobs run sequentially

https://gerrit.wikimedia.org/r/299887

Change 299887 merged by jenkins-bot:
Make LocalRename jobs run sequentially

https://gerrit.wikimedia.org/r/299887

Change 299899 had a related patch set uploaded (by Gergő Tisza):
Fix job serializing (and status display on Special:GlobalRenameProgress)

https://gerrit.wikimedia.org/r/299899

Change 299899 merged by jenkins-bot:
Fix job serializing (and status display on Special:GlobalRenameProgress)

https://gerrit.wikimedia.org/r/299899

To manually fix a blocked rename, one can run:

mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php

Has to be run for each of wikis.

Thanks for the tip @hashar. I guess people can ping me (or lots of other people with access to terbium) to do it in case of happening.

May I suggest to create a script which doesn't need to be run on each wiki
where the rename gets stuck. I think it might be a bit boring for you
all having to run the script 10 times if the account become stuck on ten
sites.