Page MenuHomePhabricator

gerrit-replica seems out of date breaking extension distributor
Closed, ResolvedPublicBUG REPORT

Description

The ConfirmAccount master branch seems quite out of date on gerrit-replica.wikimedia.org. This is breaking tools like extension distributor. I can only assume other extensions are also out of date.

if we do git log origin/master from a clone of https://gerrit-replica.wikimedia.org/r/mediawiki/extensions/ConfirmAccount we get

commit e0953d7cea7cb14be52ae3aca76a3105bbfa95d8 (HEAD -> master, origin/master, origin/HEAD)
Author: Translation updater bot <l10n-bot@translatewiki.net>
Date:   Fri Apr 14 09:26:08 2023 +0200

    Localisation updates from https://translatewiki.net.
    
    Change-Id: I90f8ffe6bb8e74b51b68c6dd30d70f6731579ebd

commit 9a3686de3330dd63fe58817d974a60284ca7ca28
Author: Translation updater bot <l10n-bot@translatewiki.net>
Date:   Mon Apr 10 09:20:38 2023 +0200

    Localisation updates from https://translatewiki.net.
    
    Change-Id: I8d272ba54bcbad7da771ce9ff3376fa38493cd2f

Compare this to gerrit.wikimedia.org, which has 2 additional commits from apr 25 and may 1:

commit 06490a58bd4b8d996ba1ef7f87dd64bd5ccfc83c (HEAD -> master, origin/master, origin/HEAD)
Author: Translation updater bot <l10n-bot@translatewiki.net>
Date:   Mon May 1 09:57:26 2023 +0200

    Localisation updates from https://translatewiki.net.
    
    Change-Id: I05333972389d42710160698ec6f74965dfa2f54d

commit 242418c7f4bf2d9408f971e4a3173247b236405a
Author: Translation updater bot <l10n-bot@translatewiki.net>
Date:   Tue Apr 25 08:47:44 2023 +0200

    Localisation updates from https://translatewiki.net.
    
    Change-Id: Ic63caeb9bed3fa22582e8b0968b978b884e76970

commit e0953d7cea7cb14be52ae3aca76a3105bbfa95d8
Author: Translation updater bot <l10n-bot@translatewiki.net>
Date:   Fri Apr 14 09:26:08 2023 +0200

    Localisation updates from https://translatewiki.net.
    
    Change-Id: I90f8ffe6bb8e74b51b68c6dd30d70f6731579ebd

Originally reported at https://www.mediawiki.org/wiki/Topic:Xhc1dadw1r9twn44

Same issue witnessed with mediawiki/core (originally reported on T335995):

I cloned https://gerrit-replica.wikimedia.org/r/mediawiki/core and noticed that the latest commit on its master branch is rMWb11c3dcc1374b459c40d793944feb0226ce6050c, not rMW8e9a308d3c531ff25c0b4fd17e38265dad6f5864.

You can see this in rMW, which mirrors the replica. Notably, rMW8e9a308d3c531ff25c0b4fd17e38265dad6f5864 is known to rMW, suggesting that replication broke sometime after 3:46 PM.

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2023-05-02T07:22:26Z] <bawolff> Extension distributor having outdated links due to T335730

hashar subscribed.

Browsing GitHub https://github.com/wikimedia/mediawiki-extensions-ConfirmAccount gives me https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ConfirmAccount/+/913855

For the Gerrit replica (which is on gerrit2002.wikimedia.org) from the replication log on the primary I see it is in status NOT_ATTEMPTED. No idea why, I will dig ;)

[2023-05-01 07:57:33,471] Push to gerrit2@gerrit2002.wikimedia.org:/srv/gerrit/git/mediawiki/extensions/ConfirmAccount.git references:
RemoteRefUpdate{
  refSpec=refs/changes/55/913855/1:refs/changes/55/913855/1
  status=NOT_ATTEMPTED
  id=(null)..AnyObjectId[06490a58bd4b8d996ba1ef7f87dd64bd5ccfc83c]
  force=yes
  delete=no
  ffwd=no
},
RemoteRefUpdate{
  refSpec=refs/changes/55/913855/meta:refs/changes/55/913855/meta
  status=NOT_ATTEMPTED
  id=(null)..AnyObjectId[8d46ebffe6ff94845201e13298164a6a6aa03a1a]
  force=yes
  delete=no
  ffwd=no
}
[CONTEXT
  PLUGIN="replication"
  RECEIVE_ID="mediawiki/extensions/ConfirmAccount-1682927848386-9afcdb8a"
  project="mediawiki/extensions/ConfirmAccount"
  pushOneId="e8208860"
  request="GIT_RECEIVE"
]

I have triggered the replication manually:

$ ssh -p 29418 gerrit.wikimedia.org replication start mediawiki/extensions/ConfirmAccount --now --wait
Replicate mediawiki/extensions/ConfirmAccount ref ..all.. to github.com, Succeeded! (OK)
Replicate mediawiki/extensions/ConfirmAccount ref ..all.. to gerrit2002.wikimedia.org, Succeeded! (OK)
Replication of mediawiki/extensions/ConfirmAccount ref ..all.. completed to 2 nodes, 
----------------------------------------------
Replication completed successfully!

And I can now see change 913855 on the replica:

$ git ls-remote https://gerrit-replica.wikimedia.org/r/mediawiki/extensions/ConfirmAccount|grep refs/changes/55/913855
06490a58bd4b8d996ba1ef7f87dd64bd5ccfc83c	refs/changes/55/913855/1
4f5ef541f317f9886af7eeed9a34a5897526f263	refs/changes/55/913855/meta

Then I suspect various repositories to be affected by the same issue.

After yesterday change got merged, I do get a successful replication message:

[2023-05-01 08:01:18,705] Replication to gerrit2@gerrit2002.wikimedia.org:/srv/gerrit/git/mediawiki/extensions/ConfirmAccount.git completed in 1488ms, 5003ms delay, 0 retries [CONTEXT PLUGIN="replication" pushOneId="39a2b4ab" request="SSH" ]

I am afraid that one will remain a mystery. Please reopen if you spot another affected repo and I will dig further.

Is T335995 similar? I can't tell if here the repository suddenly travelled back in time like what happened to mediawiki/core in T335995.

T335995 is the same issue, as to why, that is a good question.

Gerrit has been up and running since March 29th

● gerrit.service - Gerrit code review tool
   Loaded: loaded (/lib/systemd/system/gerrit.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2023-03-29 06:46:31 UTC; 1 months 6 days ago

April 4th had the last deployment https://gerrit.wikimedia.org/r/c/operations/software/gerrit/+/905708, which causes all plugins to be reloaded and might have broken the replication plugin. Then we had replications happening after that so I don't think the reload caused any trouble.

Looking at the work queue:

$ ssh -p 29418 hashar@gerrit.wikimedia.org gerrit show-queue --wide --by-queue
...

Queue: ReplicateTo-replica_codfw
9ba5ca13              18:23:40.155      [f23988e4] push gerrit2@gerrit2002.wikimedia.org:/srv/gerrit/git/mediawiki/core.git [refs/changes/60/911960/3 refs/changes/60/911960/meta]
------------------------------------------------------------------------------
  1 tasks, 4 worker threads

Queue: replication_auto-reload-config
04939d4f 18:25:07.553 Apr-04 19:59      com.googlesource.gerrit.plugins.replication.AutoReloadRunnable@727fae4c
------------------------------------------------------------------------------
  1 tasks, 1 worker threads

The AutoReloadRunnable task matches the deployment of https://gerrit.wikimedia.org/r/c/operations/software/gerrit/+/905708 which causes the replication plugin to be reloaded. Apparently that does not work well and might be the cause of the breakage somehow.

The replication entry was for https://gerrit.wikimedia.org/r/c/mediawiki/core/+/911960 and stuck around for a few minutes (causes pushes are regrouped). Patchset 3 which got created a few minutes ago does show up on the replica:

$ git ls-remote https://gerrit-replica.wikimedia.org/r/mediawiki/core refs/changes/60/911960/*
af1d70117733ed6a38cda782420d7b9911013c71	refs/changes/60/911960/2
56b74086b84b50bd68a38816fa9965ef6b584f30	refs/changes/60/911960/3
932aebb6a3913e46e5936618f005b14d0fe02130	refs/changes/60/911960/meta

And the replica has:

$ sudo -u gerrit2 git -C /srv/gerrit/git/mediawiki/core.git/ log -n1
commit 9cc1e890a25578b5dfa878e3b2d5b1454900e746 (HEAD -> master)
Merge: 14bce951e5b 823d2429337
Author: jenkins-bot <jenkins-bot@gerrit.wikimedia.org>
Date:   Thu May 4 18:08:41 2023 +0000

    Merge "installer: Support SSL in the installer"
hashar@gerrit2002:~$ 

The HEAD is correct:

$ sudo -u gerrit2 git -C /srv/gerrit/git/mediawiki/core.git/ rev-parse HEAD
c2a620c74f40ce37d0a7386cf76cbbdbcbd5110f

Doing a full clone from the replica gives the same:

$ git clone https://gerrit-replica.wikimedia.org/r/mediawiki/core
$ cd core
$ git rev-parse HEAD
c2a620c74f40ce37d0a7386cf76cbbdbcbd5110f

So it is no more happening? :-(

Sorry about this, 2 gerrit hosts were fighting over the replication.

So, the new host gerrit1003 (gerrit-new.wikimedia.org) was trying to replicate to the gerrit-replica. That is because gerrit config has it hardcoded and a remote of gerrit2002 in it.

We just realized this a couple minutes ago before seeing this task right now.

The gerrit-replica remote was removed from config on gerrit1003 and service was restarted and puppet is disabled. So this should not happen anymore.

If we start replication manually from 1001 one more time it should be fixed.

temp fix: stopped puppet on gerrit1003, edited /etc/gerrit/replication.config, set replicateOnStartup = false in [gerrit] section and entirely remove the 2 remote sections (github and gerrit2002), restarted gerrit service, confirmed no more attempts in replication log

upcoming fix: make it configurable in puppet to disable it, based on .. ? (we don't have anymore the definition of a single "active" host in Hiera)

If we start replication manually from 1001 one more time it should be fixed.

Just ran ssh -p 29418 gerrit.wikimedia.org replication start --all and the queue is working through a lot of stuff. 2568 tasks, 2 worker threads

Change 915830 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: disable replication from gerrit1003

https://gerrit.wikimedia.org/r/915830

Change 915830 merged by Dzahn:

[operations/puppet@production] gerrit: disable replication from gerrit1003

https://gerrit.wikimedia.org/r/915830

replication from gerrit1003 is stopped and puppet is running, we ran fresh replication from gerrit1001. This shouldn't happen again.

Soon we will switch to 1003 as the source, after a fresh rsync of course, so there won't be "travel back in time".

Sorry about this, 2 gerrit hosts were fighting over the replication.

So, the new host gerrit1003 (gerrit-new.wikimedia.org) was trying to replicate to the gerrit-replica. That is because gerrit config has it hardcoded and a remote of gerrit2002 in it.
...

I have seen the message yesterday and as a result did no further investigation. It definitely explains the issue, I guess gerrit1003 is in a stall process from a snapshot. The new host should have been added as a replica first then the primary can be switched over to it. I will follow up on T326368.

Thanks for the fix!

I have seen the message yesterday and as a result did no further investigation. It definitely explains the issue, I guess gerrit1003 is in a stall process from a snapshot. The new host should have been added as a replica first then the primary can be switched over to it. I will follow up on T326368.

An rsync of all data was performed from 1001 to 1003 but as a one-time action, to prepare for future migration. So in that sense, yes a snapshot in time. Just copying everything over (and fixing permissions) should be equivalent to letting it replicate afaict. It "just worked" except the part where the rsync wasn't recent. In other words, if the rsync would have run automatically every hour probably nobody would have noticed anything.

edit: I now did see your comment about "obsolete left over files that never got garbage collected by Puppet" so that's something to consider. ack!

Change 918589 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: enable replication from gerrit1003, disable from gerrit1001

https://gerrit.wikimedia.org/r/918589

Change 918589 merged by Dzahn:

[operations/puppet@production] gerrit: enable replication from gerrit1003, disable from gerrit1001

https://gerrit.wikimedia.org/r/918589