Page MenuHomePhabricator

gerrit1003 service implementation task
Closed, ResolvedPublic


At the request of serviceops all racking tasks for new hardware also have a sub-task for service ops implementation tracking.

Once parent task T326366 shows resolved, this can proceed via the service ops team.

topic branch with related patches:

1# schedule and announce downtime
2# on gerrit1001: shortly before the scheduled downtime:
3# on gerrit1001, as root, in a screen: rsync -avp --delete --bwlimit=100m /var/lib/gerrit2/review_site/ rsync://
4# on gerrit1001, as root, in a screen: rsync -avp --delete --bwlimit=100m /srv/gerrit/ rsync://
5# on gerrit1003: rsync -avp /srv/gerrit/plugins/lfs/ /srv/gerrit/data/lfs/
6# on gerrit1003: chown -R gerrit2:gerrit2 /var/lib/gerrit2
7# on gerrit1003: chown -R gerrit2:gerrit2 /srv/gerrit
8# scheduled downtime begins / IRC announcement
9# on cumin1001:sudo cookbook sre.hosts.downtime -r 'maintenance' -D 30
10# on cumin1001:sudo cookbook sre.hosts.downtime -r 'maintenance' -H 1
11# on - manually schedule downtime for the checks connected to virtual server "". The cookbook does not find this virtual host.
12# on gerrit1003: disable puppet; stop gerrit? (sudo disable-puppet 'gerrit maintenance'; systemctl stop gerrit)
13# merge DNS change that removes gerrit-new and switches IP of - in web UI of gerrit(-old)
14# run authdns-update on, see the diff but do NOT commit yet
15# on gerrit1001: disable puppet; stop gerrit! (sudo disable-puppet 'gerrit maintenance'; systemctl stop gerrit)
16# on gerrit1001, as root, in a screen: rsync -avp --delete --bwlimit=100m /var/lib/gerrit2/review_site/ rsync://
17# on gerrit1001, as root, in a screen: rsync -avp --delete --bwlimit=100m /srv/gerrit/ rsync://
18# on gerrit1003: rsync -avp /srv/gerrit/plugins/lfs/ /srv/gerrit/data/lfs/
19# on gerrit1003: chown -R gerrit2:gerrit2 /var/lib/gerrit2
20# on gerrit1003: chown -R gerrit2:gerrit2 /srv/gerrit
21# on gerrit1003: start gerrit
22# say "yes" to authdns-update and actually merge DNS change that removes gerrit-new and switches IP of
23# wait 5 minutes
24# ..test https ( in browser)
25# ..test ssh (e.g. ssh -p 29418)
26# announce downtime is over
27# ensure gerrit1001 has puppet disabled and/or services are masked
28# grace period (how long?)
29# decom old host ->


SubjectRepoBranchLines +/-
integration/configmaster+1 -28
integration/configmaster+28 -1
integration/configmaster+6 -6
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+2 -0
operations/puppetproduction+25 -2
operations/puppetproduction+1 -0
operations/puppetproduction+4 -1
operations/puppetproduction+1 -0
operations/puppetproduction+4 -0
operations/homer/publicmaster+2 -0
operations/puppetproduction+2 -2
operations/dnsmaster+8 -8
operations/dnsmaster+8 -8
operations/dnsmaster+2 -2
operations/puppetproduction+1 -0
operations/puppetproduction+11 -3
operations/puppetproduction+1 -6
operations/puppetproduction+4 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -2
operations/puppetproduction+1 -0
operations/puppetproduction+10 -10
operations/puppetproduction+6 -0
operations/puppetproduction+14 -15
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 909792 merged by Cwhite:

[operations/puppet@production] logstash: replace gerrit1001 with gerrit1003 in tests

Change 909796 merged by Dzahn:

[operations/puppet@production] gerrit: add host-based Hiera keys for gerrit1003

Change 910049 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: add gerrit prod role to gerrit1003

Change 910064 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: add gerrit1003 to rsync dest hosts when using prod role

Change 910064 merged by Dzahn:

[operations/puppet@production] gerrit: add gerrit1003 to rsync dest hosts when using prod role

Change 909791 merged by Dzahn:

[operations/puppet@production] replace gerrit1001 with gerrit1003 as ping target for blackbox smoke

Change 909790 merged by Dzahn:

[operations/puppet@production] acme_chief/gerrit certs: add gerrit1003 to hosts and gerrit-new to SNI

Change 909795 merged by Dzahn:

[operations/puppet@production] cloudgw: allow VMs to speak to new gerrit server (gerrit1003)

Change 910049 merged by Dzahn:

[operations/puppet@production] site: add gerrit prod role to gerrit1003

Mentioned in SAL (#wikimedia-operations) [2023-04-25T21:06:55Z] <mutante> adding production gerrit role to new machine gerrit1003 - monitoring downtimed - but it has a service IP that is going to be added by this and cant be downtimed ? (Bug: T326368)

Mentioned in SAL (#wikimedia-operations) [2023-04-25T21:14:07Z] <mutante> gerrit1003 - manually replacing deploy2002 with deploy1002 in /srv/deployment/gerrit/gerrit-cache/.config to fix initial scap deployment T257317 T326368

Mentioned in SAL (#wikimedia-operations) [2023-04-25T21:17:02Z] <mutante> gerrit1003 - mv /srv/gerrit/plugins/lfs /srv/gerrit/data/ T333143 T326368

Mentioned in SAL (#wikimedia-operations) [2023-04-25T21:19:20Z] <mutante> gerrit1003 - chown -R gerrit2:gerrit2 /srv/gerrit T333143 T326368

Mentioned in SAL (#wikimedia-releng) [2023-04-25T21:23:23Z] <mutante> gerrit1003 - sudo -u gerrit2 /usr/bin/scap deploy-local --repo gerrit/gerrit -D log_json:False (manually it works, but that's the same command that puppet is supposed to run !?) - T257317 T326368

Dzahn updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2023-04-25T21:40:30Z] <mutante> gerrit1003 - chown -R gerrit2:gerrit2 /var/lib/gerrit2/review_site/ - T326368

Change 911941 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: make configurable whether service is running

Change 911941 abandoned by Dzahn:

[operations/puppet@production] gerrit: make configurable whether service is running


not needed

Change 914021 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: add gerrit1003 to gerrit ssh_allowed hosts

Change 914021 merged by Dzahn:

[operations/puppet@production] gerrit: add gerrit1003 to gerrit ssh_allowed hosts

I wasn't aware about this task until yesterday (via T335730).

I'd like the new host to be added first as a replica rather than an entirely new primary Gerrit server then do a switch over of the service from the current primary gerrit1001 to the new gerrit1003.

Notably, I would like to avoid carrying state from a host to another since last time it caused multiple issues (notably we had obsolete left over files that never got garbage collected by Puppet, the caches filed the disk after a decade of being filing up). In theory the replica can be used as a primary as is, albeit with cold caches at first but I don't think it will cause problems in practice. Or we can do that for the gerrit2002 replacement?

Change 916639 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/dns@master] gerrit: switch service name, turn new into current and current into old

Change 916637 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/dns@master] lower TTL for reverse lookups

I'd like the new host to be added first as a replica rather than an entirely new primary Gerrit server then do a switch over of the service from the current primary gerrit1001 to the new gerrit1003.

Are you suggesting to ONLY let gerrit replicate and no rsync at all or first let Gerrit replicate and then still rsync afterwards regardlesss? If the first then it would be different from how we did it last time but if the latter then would'nt we copy those unwanted files anyways?

Change 916637 merged by Dzahn:

[operations/dns@master] lower TTL for reverse lookups

Change 916639 merged by Dzahn:

[operations/dns@master] gerrit: switch service IP, turn new into current and current into old

Upstream released a security update of Gerrit yesterday (3.5.6) I thus upgraded gerrit1001 and gerrit2002 to the new version this morning shortly after the train T336339.

I did not do gerrit1003 since it was not in the dsh group of operations/puppet which prompted me to create a new gerrit dsh group fully managed by Puppet ( ) and switched the Gerrit deployment repo to use that file ( ). So now scap deploy runs on all three hosts.

I checked gerrit1003 which is publicly reachable via but can't ssh into it. Turns out it does not have any of the git repositories under /srv/gerrit/git and thus lack any projects and users. I don't think I can do the Gerrit upgrade without them, we would need to rsync in the git repositories from the primary gerrit1001 to gerrit1003 (they are in /srv/gerrit/git).

I made several edits to the migration plan doc that is transcluded here from See my comments there for details.

The main thing is:

  • changed the way we deploy the DNS change to avoid the problem that you can't merge in Gerrit when Gerrit is down :)
  • upped bwlimit to 10/20x faster after confirming we really have gigabit

And some details like how long we are planning to keep the old host around, a decom ticket etc.

..but can't ssh into it. Turns out it does not have any of the git repositories under /srv/gerrit/git and thus lack any projects and users. I don't think I can do the Gerrit upgrade without them, we would need to rsync..

ACK! This is in progress. I am currently working on fixing this. We had already (slowly, over 4 days originally) synced data and things worked but then today data got deleted again during our migration window. I am currently syncing again to get it back in order.

To speed things up so we can deploy asap (and before our new window tomorrow) , I am syncing with bwlimit 100m right now. I will update you shortly.

@hashar @thcipriani status update:

I have freshly:

  • rsynced /srv/gerrit
  • rsynced /var/lib/gerrit2
  • for the lfs path change: copied /srv/gerrit/plugins/lfs to /srv/gerrit/data/lfs on gerrit1003, did not "mv", just "cp" for now so right now in both locations
  • ensured it's all owned gerrit2:gerrit2
  • restarted gerrit service

Now I would like you to try deploying again.

The thing is.. currently I see no repos in gerrit-new web UI at all.. even though everything was synced and before it did show them?!

Just realized one more thing we have to remember / add to the plan. After we switch we must active replication from gerrit1003 and deactivate it from gerrit1001. (re: that recent issue where both gerrit1001 and gerrit1003 was replicating to gerrit2002 and we had to disable it on gerrit1003.

Change 918529 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/dns@master] Revert "Revert "gerrit: switch service IP, turn new into current and current into old""

Change 918589 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: enable replication from gerrit1003, disable from gerrit1001

steps we have to add:

Change 918529 merged by Dzahn:

[operations/dns@master] Revert "Revert "gerrit: switch service IP, turn new into current and current into old""

Change 918589 merged by Dzahn:

[operations/puppet@production] gerrit: enable replication from gerrit1003, disable from gerrit1001

Change 919151 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/homer/public@master] definition/ add Gerrit switchover IPs

Change 919151 merged by Ssingh:

[operations/homer/public@master] definition/ add Gerrit switchover IPs

This has now happened. is now on new hardware, a new IP and a new distro version.

Change 919226 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] acme_chief: add gerrit-old to list of allowed SANs from gerrit hosts

Change 919226 merged by Dzahn:

[operations/puppet@production] acme_chief: add gerrit-old to list of allowed SANs from gerrit hosts

Change 919244 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: disable monitoring for gerrit1001

Change 919246 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: add parameter service_ensure, set to stopped on gerrit1001

Service is implemeted on gerrit1003.

It is now the production server behind since today.

A few follow-ups will be part of decomin'g the old machine (T336427).

Change 919246 abandoned by Dzahn:

[operations/puppet@production] gerrit: add parameter service_ensure, set to stopped on gerrit1001


replaced in favor of

Change 919359 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: allow masking the service and do so on gerrit1001

Change 919402 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: add gerrit1003 to hosts using KexAlgo ecdh-sha2-nistp521 for ssh

Mentioned in SAL (#wikimedia-operations) [2023-05-12T20:08:07Z] <mutante> gerrit1001 - systemctl mask gerrit T326368

Change 919405 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: add gerrit1003 SSH host key known_hosts

Dzahn changed the status of subtask T336427: decom gerrit1001 from Stalled to Open.May 12 2023, 9:26 PM

Change 919405 merged by Dzahn:

[operations/puppet@production] gerrit: add gerrit1003 SSH host key known_hosts

Change 919359 merged by Dzahn:

[operations/puppet@production] gerrit: allow masking the service and do so on gerrit1001

Change 919244 merged by Dzahn:

[operations/puppet@production] gerrit: disable monitoring for gerrit1001

Change 919402 merged by Dzahn:

[operations/puppet@production] gerrit: add gerrit1003 to hosts using KexAlgo ecdh-sha2-nistp521 for ssh

Change 924608 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit/bacula: adjust Gerrit file paths to be backed up

Change 924608 merged by Dzahn:

[operations/puppet@production] gerrit/bacula: adjust Gerrit file paths to be backed up

Change 927280 had a related patch set uploaded (by Dzahn; author: Dzahn):

[integration/config@master] update IP in dockerfiles/maven-java8/gerrit_ssh_host_key

Change 927280 merged by jenkins-bot:

[integration/config@master] Dockerfiles: [maven-java8] Update IP

Mentioned in SAL (#wikimedia-releng) [2023-06-05T22:21:59Z] <James_F> Dockerfiles: [maven-java8] Update IP for T326368

Change 927281 had a related patch set uploaded (by Jforrester; author: Jforrester):

[integration/config@master] jjb: Update maven-java8-based jobs to images with new gerrit IP

Change 927281 merged by jenkins-bot:

[integration/config@master] jjb: Update maven-java8-based jobs to images with new gerrit IP

Change 928102 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] Revert "Dockerfiles: [maven-java8] Update IP"

Change 928102 merged by jenkins-bot:

[integration/config@master] Revert "Dockerfiles: [maven-java8] Update IP"