Page MenuHomePhabricator

Unblock stuck global rename of multiple users
Closed, ResolvedPublic

Description

All are automatic Account-Vanishing.

  • Hugo 1964 → Vanished user 48ef63447df9847490d3b95672d70d52 (view progress)
  • Marc pertrie → Vanished user d5ae8440bee1b4d6a87c3db28ce4abca (view progress)
  • Mistmorn → Vanished user 466a020c463f9ad3042f9015496d70a2 (view progress)
  • SMayer2006 → Vanished user 7efbc8977ac323ab26bc0e38c1ae4d4c (view progress)
  • Vuskaioos → Vanished user 0012c98bfb844675ea15549c41b8818b (view progress)
  • Zerdeshtroj21 → Vanished user d2ed26f2d879b3a5bd87e18fed465b82 (view progress)
  • Μυστικος → Vanished user 5482e0d57966b21e4523eebd9a17965d (view progress)

Event Timeline

JJMC89 renamed this task from Unblock stuck global rename of 6 users to Unblock stuck global rename of 7 users.Aug 7 2024, 8:10 PM
JJMC89 moved this task from Backlog to WMF Prod on the Wikimedia-maintenance-script-run board.

There are now 16 requests, looks like all the automatic vanish requests are getting stuck.

50 requests now. @Seddon: Please look into this.

Seddon renamed this task from Unblock stuck global rename of 7 users to Unblock stuck global rename of multiple users.Aug 10 2024, 12:25 PM

Just noting that we are looking into this, along with another ticket. The cause is most like the fixes that went out as part of https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/1056208 that included fixes for T370841. We have retested the changes in local dev environments and there aren't any issues showing up there, so this is a bug only showing up in production. It might be a job queue issue or alternatively given this went out on the train this week and the global renames fixes see to be working went through, it is possible that the discordance between job queue versions created the issue.

@Tgr is going to attempt to clear out some of the backlog and see if the stuck rename script might help here. However with Wikimania, a WMF holiday and a weekend means a fix won't be in place until the beginning of the week. Will follow up when we know more.

Fixed the older stuck renames using a script generated with

tgr@mwmaint1002:~$ sql centralauth -- --silent --raw > fix.sh <<EOF
select CONCAT( "mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=", ru_wiki, " --ignorestatus --logwiki=metawiki '",  ru_oldname, "' '", ru_newname,  "'\nsleep 120") from renameuser_status join renameuser_queue on ru_oldname = rq_name and ru_newname = rq_newname where ru_status = 'inprogress' and rq_completed_ts < '20240810000000';
EOF

(the sleep is probably not needed but we haven't tried in a long time to mass-fix many stuck renames simultaneously, so just in case)

New ones are still coming in, so something in the vanish script will have to be fixed before this can be fully cleaned up.

it is possible that the discordance between job queue versions created the issue.

I think we can exclude that given the timing, version mismatch problems can only happen between Tuesday and Thursday, and the last stuck request (AdityaDayal79) has been made at 8AM UTC today.

It looks like the order of operations for the autovanishing is off in some way.

Skipping duplicate rename from {oldName} to {newName}

{
  "_index": "logstash-mediawiki-1-7.0.0-1-2024.08.11",
  "_id": "Ol3IQ5EBDD1VxFBSiD-B",
  "_version": 1,
  "_score": null,
  "_source": {
    "kubernetes": {
      "host": "kubernetes1045.eqiad.wmnet",
      "pod_name": "mw-jobrunner.eqiad.main-56fcb5fc44-zkcjd",
      "labels": {
        "deployment": "mw-jobrunner",
        "release": "main"
      },
      "namespace_name": "mw-jobrunner"
    },
    "level": "INFO",
    "type": "mediawiki",
    "referrer": null,
    "wiki": "commonswiki",
    "newName": "Vanished user 9b150274bb0473dd0f71b592c0e8c0e2",
    "timestamp": "2024-08-11T23:31:40+00:00",
    "@version": 1,
    "shard": "s4",
    "facility": "user",
    "logsource": "mw-jobrunner.eqiad.main-56fcb5fc44-zkcjd",
    "message": "Skipping duplicate rename from 69Odoyle69 to Vanished user 9b150274bb0473dd0f71b592c0e8c0e2",
    "mwversion": "1.43.0-wmf.17",
    "server": "mw-jobrunner.discovery.wmnet",
    "phpversion": "7.4.33",
    "host": "mw-jobrunner.eqiad.main-56fcb5fc44-zkcjd",
    "reqId": "86fbc264-022b-4ff4-b315-628a2d763d32",
    "component": "GlobalRename",
    "servergroup": "kube-mw-jobrunner",
    "oldName": "69Odoyle69",
    "tags": [
      "input-kafka-rsyslog-udp-localhost",
      "rsyslog-udp-localhost",
      "kafka",
      "es",
      "es"
    ],
    "channel": "CentralAuth",
    "normalized_message": "Skipping duplicate rename from {oldName} to {newName}",
    "@timestamp": "2024-08-11T23:31:40.380Z",
    "http_method": "POST",
    "status": "inprogress",
    "program": "mediawiki",
    "monolog_level": 200,
    "severity": "info",
    "url": "/rpc/RunSingleJob.php"
  },
  "fields": {
    "@timestamp": [
      "2024-08-11T23:31:40.380Z"
    ]
  },
  "highlight": {
    "channel.keyword": [
      "@opensearch-dashboards-highlighted-field@CentralAuth@/opensearch-dashboards-highlighted-field@"
    ]
  },
  "sort": [
    1723419100380
  ]
}

Change #1062769 had a related patch set uploaded (by Amdrel; author: Amdrel):

[mediawiki/extensions/CentralAuth@master] Save the request before starting the automatic vanish job

https://gerrit.wikimedia.org/r/1062769

Change #1062769 merged by jenkins-bot:

[mediawiki/extensions/CentralAuth@master] Save the request before starting the automatic vanish job

https://gerrit.wikimedia.org/r/1062769

The fix was deployed to the beta cluster where I could duplicate the issue and the patch seems to fix the problem.

We think the issue was caused by the rename function expecting a string when it was infact receiving a message being added directly to the array.

This caused issues further downstream and likely resulted in the job failing.

I'm going to backport to production tomorrow and if possible @Tgr when that is done, afterwards could we run a script later in the day or Friday to clean up the backlog that's built up through the week.

Change #1062996 had a related patch set uploaded (by Seddon; author: Amdrel):

[mediawiki/extensions/CentralAuth@wmf/1.43.0-wmf.18] Save the request before starting the automatic vanish job

https://gerrit.wikimedia.org/r/1062996

Change #1062996 merged by jenkins-bot:

[mediawiki/extensions/CentralAuth@wmf/1.43.0-wmf.18] Save the request before starting the automatic vanish job

https://gerrit.wikimedia.org/r/1062996

Mentioned in SAL (#wikimedia-operations) [2024-08-15T13:15:58Z] <logmsgbot> lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1062996|Save the request before starting the automatic vanish job (T372006)]]

Mentioned in SAL (#wikimedia-operations) [2024-08-15T13:40:31Z] <logmsgbot> lucaswerkmeister-wmde@deploy1003 seddon, lucaswerkmeister-wmde: Backport for [[gerrit:1062996|Save the request before starting the automatic vanish job (T372006)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-08-15T13:50:54Z] <logmsgbot> lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1062996|Save the request before starting the automatic vanish job (T372006)]] (duration: 34m 44s)

Mentioned in SAL (#wikimedia-operations) [2024-08-15T18:54:47Z] <tgr|away> running global rename cleanup script per T372006#10055573