Page MenuHomePhabricator

Investigate out of date refs following gerrit switchover
Closed, ResolvedPublic

Description

Following gerrit switch from gerrit2002 to gerrit1003 we noticed missing changes in the gerrit interface.

Initial investigation indicates this is the same problem as T236114

Fix ongoing

We have a copy of the Zuul mergers repos on contint1002 and contint2002 each under /srv/zuul/git-20250430.

Event Timeline

thcipriani triaged this task as Unbreak Now! priority.Apr 30 2025, 5:12 PM

The current hypothesis is that during the DNS change both hosts were considered to be primary and unexpected replication took place for approximately an hour and 20 minutes (at most). The result was gerrit2002 deleting changes made on gerrit1003.

The current hypothesis is that during the DNS change both hosts were considered to be primary and unexpected replication took place for approximately an hour and 20 minutes (at most). The result was gerrit2002 deleting changes made on gerrit1003.

https://sal.toolforge.org/production?p=0&q=authdns-update&d=

There seems to be some gap between when authdns-update was started and when it completed. Note the timestamps: 15:15 (start) and 15:29 (end). I am not sure for the reason for this but it a) does not take this long, b) the changes are not considered "live" until authdns-update finishes running on all hosts. So if some other process/cookbook was started in between that depended on these DNS changes, that might have been a problem. Something to keep in mind as we debug this, if this fits in with the current investigation.

Replication from gerrit2002 -> gerrit1003 has affected two repos outside of draft comments—SmashPig and operations/mediawiki-config. We brought back Gerrit marking those repos as Read Only to allow partial restore of service.

Affected change in SmashPig was abandoned, this leaves operations/mediawiki-config as the remaining one.

Note that SmashPig is still set to read only.

Change #1140241 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/mediawiki-config@refs/meta/config] Allow force push to reconstruct repo

https://gerrit.wikimedia.org/r/1140241

Change #1140241 merged by Hashar:

[operations/mediawiki-config@refs/meta/config] Allow force push to reconstruct repo

https://gerrit.wikimedia.org/r/1140241

Change #1140242 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/mediawiki-config@refs/meta/config] Revert "Allow force push to reconstruct repo"

https://gerrit.wikimedia.org/r/1140242

Change #1140242 merged by Hashar:

[operations/mediawiki-config@refs/meta/config] Revert "Allow force push to reconstruct repo"

https://gerrit.wikimedia.org/r/1140242

LSobanski lowered the priority of this task from Unbreak Now! to High.May 1 2025, 12:11 PM

Dropping to High as the service is operational.

Puppet on wikireplica hosts is failing when pulling from operations/mediawiki-config

/Stage[main]/Profile::Wmcs::Db::Scriptconfig/Git::Clone[operations/mediawiki-config]/Exec[git_pull_operations/mediawiki-config]/returns

'/usr/bin/git pull --recurse-submodules --quiet' returned 128 instead of one of [0]

'hint: You have divergent branches and need to specify how to reconcile them.'
mvernon@db1155:/usr/local/lib/mediawiki-config$ sudo git status
On branch master
Your branch and 'origin/master' have diverged,
and have 1 and 2 different commits each, respectively.
  (use "git pull" to merge the remote branch into yours)

nothing to commit, working tree clean

The commit on master (and not origin/master) is:

commit d145baae7de90dce6835e578730d61889e4805f1 (HEAD -> master)
Author: Antoine Musso <hashar@free.fr>
Date:   Wed Apr 30 16:43:20 2025 +0000

    group1 to 1.44.0-wmf.27
    
    Bug: T386222
    Change-Id: Ifd99378b067082f0a3dad2844ad046bf26008fd2

Which is from https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1140206

Looking at puppetboard, it looks like the following are currently broken because of this:

  • clouddb1015
  • clouddb1017
  • clouddb1018
  • clouddb1019
  • clouddb1020
  • db1154
  • db1155
  • db2186
  • db2187

Change #1140506 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: enable bacula backups on gerrit2002

https://gerrit.wikimedia.org/r/1140506

mvernon@db1155:/usr/local/lib/mediawiki-config$ sudo git status
On branch master
Your branch and 'origin/master' have diverged,
and have 1 and 2 different commits each, respectively.
  (use "git pull" to merge the remote branch into yours)

nothing to commit, working tree clean

The commit on master (and not origin/master) is:

commit d145baae7de90dce6835e578730d61889e4805f1 (HEAD -> master)
Author: Antoine Musso <hashar@free.fr>
Date:   Wed Apr 30 16:43:20 2025 +0000

    group1 to 1.44.0-wmf.27
    
    Bug: T386222
    Change-Id: Ifd99378b067082f0a3dad2844ad046bf26008fd2

Which is from https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1140206

The simplest fix here is git reset --hard origin/master on those hosts.

Locally, you have two changes with the same parent commit. Both of these changes came from the gerrit server due to this incident.

There was no correct state we could restore to on the gerrit host, so we picked the first change as the correct change on the gerrit server. If you reset to origin/master then you will reset your local copy of this repo to what we have on the server.

Longer explanation:

  • We had two gerrit primaries for a short period of time: gerrit1003 and gerrit2002—each were trying to replicate to one another
  • gerrit.wikimedia.org pointed to gerrit1003, the intended primary
  • Two changes merged to operations/mediawiki-config during the time we had two primaries
  • After the first change merged on gerrit1003, gerrit2002's replication overwrote that change
  • Then another change merged.
  • If you pulled after the second change was merged, you may end up in this state.
  • I believe having a git config that tells git how to reconcile this should make the pull invisible (depending on your config, you may just see (forced-update). Or you could manually fix this issue with git reset --hard origin/master. I think the latter seems the best since this problem should never happen—this was an exceptional situation.

Change #1140507 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: enable backups on gerrit2003

https://gerrit.wikimedia.org/r/1140507

@thcipriani is this still a blocker or are we good for group1/all wiki promotion today?

@thcipriani is this still a blocker or are we good for group1/all wiki promotion today?

We should be good. We reset the deployment host and deploys should be OK. What is deployed is the current state of operations/mediawiki-config. Note that @hashar made a patch to promote to group1. That patch looks merged, but it is NOT merged. We are on group0 and that is the current state of operations/mediawiki-config, too.

mvernon@db1155:/usr/local/lib/mediawiki-config$ sudo git status
On branch master
Your branch and 'origin/master' have diverged,
and have 1 and 2 different commits each, respectively.
  (use "git pull" to merge the remote branch into yours)

nothing to commit, working tree clean

The commit on master (and not origin/master) is:

commit d145baae7de90dce6835e578730d61889e4805f1 (HEAD -> master)
Author: Antoine Musso <hashar@free.fr>
Date:   Wed Apr 30 16:43:20 2025 +0000

    group1 to 1.44.0-wmf.27
    
    Bug: T386222
    Change-Id: Ifd99378b067082f0a3dad2844ad046bf26008fd2

Which is from https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1140206

The simplest fix here is git reset --hard origin/master on those hosts.

Locally, you have two changes with the same parent commit. Both of these changes came from the gerrit server due to this incident.

There was no correct state we could restore to on the gerrit host, so we picked the first change as the correct change on the gerrit server. If you reset to origin/master then you will reset your local copy of this repo to what we have on the server.

Longer explanation:

  • We had two gerrit primaries for a short period of time: gerrit1003 and gerrit2002—each were trying to replicate to one another
  • gerrit.wikimedia.org pointed to gerrit1003, the intended primary
  • Two changes merged to operations/mediawiki-config during the time we had two primaries
  • After the first change merged on gerrit1003, gerrit2002's replication overwrote that change
  • Then another change merged.
  • If you pulled after the second change was merged, you may end up in this state.
  • I believe having a git config that tells git how to reconcile this should make the pull invisible (depending on your config, you may just see (forced-update). Or you could manually fix this issue with git reset --hard origin/master. I think the latter seems the best since this problem should never happen—this was an exceptional situation.

Thank you Daniel. This fixed the issue.
I've fixed all sanitarium hosts and the 8 clouddb hosts + an-redacteddb1001

Change #1141862 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/cookbooks@master] gerrit: probe DNS on both hosts before doing stuff

https://gerrit.wikimedia.org/r/1141862

Change #1142793 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/cookbooks@master] gerrit: safeguards against corruptions and mishaps

https://gerrit.wikimedia.org/r/1142793

Change #1143102 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/cookbooks@master] gerrit: grepping for misconfigurations

https://gerrit.wikimedia.org/r/1143102

Change #1144565 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/cookbooks@master] gerrit: git backup tree consistency checker

https://gerrit.wikimedia.org/r/1144565

Change #1140506 merged by Arnaudb:

[operations/puppet@production] gerrit: enable bacula backups on gerrit2002

https://gerrit.wikimedia.org/r/1140506

Change #1145208 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/cookbooks@master] gerrit: lock and preflight checks

https://gerrit.wikimedia.org/r/1145208

Change #1140507 merged by Dzahn:

[operations/puppet@production] gerrit: enable backups on gerrit2003

https://gerrit.wikimedia.org/r/1140507

19:48 <+icinga-wm> PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (gerrit2003), No backups: 1 (gerrit2003), Fresh: 143 jobs 
                   https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
19:59 < mutante> ^ I just merged a change to add that host to backups earlier.. so it's not a surprise to me there isn't one yet. I guess it's a known minor issue it alerts on that.

Change #1159395 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/cookbooks@master] gerrit: read-only plugin orchestration in failover

https://gerrit.wikimedia.org/r/1159395

Change #1141862 merged by jenkins-bot:

[operations/cookbooks@master] gerrit: probe DNS on both hosts before doing stuff

https://gerrit.wikimedia.org/r/1141862

Change #1142793 merged by jenkins-bot:

[operations/cookbooks@master] gerrit: rsync --checksum local backup safety net

https://gerrit.wikimedia.org/r/1142793

Change #1143102 merged by jenkins-bot:

[operations/cookbooks@master] gerrit: grepping for misconfigurations

https://gerrit.wikimedia.org/r/1143102

Change #1144565 merged by jenkins-bot:

[operations/cookbooks@master] gerrit: git backup tree consistency checker

https://gerrit.wikimedia.org/r/1144565

Change #1145208 merged by jenkins-bot:

[operations/cookbooks@master] gerrit: lock, preflight checks, hieradata lookups, verbosity

https://gerrit.wikimedia.org/r/1145208

Change #1159395 merged by jenkins-bot:

[operations/cookbooks@master] gerrit: read-only plugin orchestration in failover

https://gerrit.wikimedia.org/r/1159395

LSobanski claimed this task.

Follow up work is happening in T387833: Gerrit switchover process, resolving.