Page MenuHomePhabricator

setup gerrit2003 with gerrit service (gerrit on bookworm)
Closed, ResolvedPublic

Description

gerrit2003 is new hardware and on bookworm

  • prepare hiera data and puppet code to allow applying the production gerrit role without starting any services / no influence on production
  • apply the gerrit production role and check for puppet issues / missing packages etc
  • determine if this is resolved once it's a warm standby host or if we switch production to this because it's newer hardware

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/puppetproduction+2 -4
operations/puppetproduction+3 -2
operations/puppetproduction+1 -1
operations/puppetproduction+16 -8
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+0 -41
operations/puppetproduction+0 -27
operations/puppetproduction+1 -1
operations/puppetproduction+1 -0
operations/puppetproduction+7 -0
operations/puppetproduction+12 -0
operations/puppetproduction+4 -0
operations/puppetproduction+8 -3
operations/puppetproduction+1 -3
operations/puppetproduction+1 -1
operations/puppetproduction+1 -3
operations/puppetproduction+17 -3
operations/puppetproduction+12 -1
operations/puppetproduction+1 -0
operations/puppetproduction+5 -0
operations/puppetproduction+1 -1
operations/puppetproduction+1 -0
operations/puppetproduction+3 -0
operations/puppetproduction+17 -23
operations/puppetproduction+33 -0
operations/puppetproduction+7 -1
operations/puppetproduction+11 -0
operations/puppetproduction+4 -0
Show related patches Customize query in gerrit

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenNone
OpenNone
ResolvedNone
OpenNone
OpenNone
OpenNone
ResolvedJhancock.wm
OpenNone
OpenNone
In ProgressABran-WMF
OpenNone
ResolvedABran-WMF
ResolvedABran-WMF
ResolvedMatthewVernon
ResolvedLSobanski
ResolvedABran-WMF
OpenABran-WMF
ResolvedLSobanski
Resolvedhashar
OpenNone
Resolvedhashar
ResolvedDzahn
In ProgressABran-WMF
ResolvedDzahn
OpenNone
Resolved Marostegui
ResolvedDzahn
OpenNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1070683 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: add backup::host, gerrit::migration etc to insetup role

https://gerrit.wikimedia.org/r/1070683

Change #1070683 merged by Dzahn:

[operations/puppet@production] gerrit: add backup::host, gerrit::migration etc to insetup role

https://gerrit.wikimedia.org/r/1070683

Change #1072323 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: add gerrit::proxy profile to insetup::gerrit role

https://gerrit.wikimedia.org/r/1072323

Change #1072323 merged by Dzahn:

[operations/puppet@production] gerrit: add gerrit::proxy profile to insetup::gerrit role

https://gerrit.wikimedia.org/r/1072323

Change #1073305 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit::proxy: files managed under /var/www/ require httpd

https://gerrit.wikimedia.org/r/1073305

Change #1073308 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit::proxy: fix link target for gerrit logo

https://gerrit.wikimedia.org/r/1073308

Change #1073308 merged by Dzahn:

[operations/puppet@production] gerrit::proxy: fix link target for gerrit logo

https://gerrit.wikimedia.org/r/1073308

Change #1073305 merged by Dzahn:

[operations/puppet@production] gerrit::proxy: ensure /var/www/ exists before files under it

https://gerrit.wikimedia.org/r/1073305

Change #1074275 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] acme_chief: authorize new machine gerrit2003 to fetch gerrit certs

https://gerrit.wikimedia.org/r/1074275

Change #1074275 merged by Dzahn:

[operations/puppet@production] acme_chief: authorize new machine gerrit2003 to fetch gerrit certs

https://gerrit.wikimedia.org/r/1074275

Change #1074477 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: include gerrit profile in insetup::gerrit for testing

https://gerrit.wikimedia.org/r/1074477

Change #1074498 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: add acme_chief snippet to gerrit-setup role

https://gerrit.wikimedia.org/r/1074498

Change #1074498 merged by Dzahn:

[operations/puppet@production] gerrit: add acme_chief to gerrit-setup role

https://gerrit.wikimedia.org/r/1074498

gerrit2003 now has a working apache-based gerrit::proxy with certs, no puppet errors and everything.

except the actual gerrit application and we avoided adding any service IP

Change #1077781 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: make it possible to not bind the service IP on a gerrit server

https://gerrit.wikimedia.org/r/1077781

Change #1077781 merged by Dzahn:

[operations/puppet@production] gerrit: make it possible to not bind the service IP on a gerrit server

https://gerrit.wikimedia.org/r/1077781

Change #1074477 merged by Dzahn:

[operations/puppet@production] gerrit: include gerrit profile in insetup::gerrit for testing

https://gerrit.wikimedia.org/r/1074477

Change #1078748 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: move passwords include from role to profile

https://gerrit.wikimedia.org/r/1078748

Change #1078748 merged by Dzahn:

[operations/puppet@production] gerrit: move passwords include from role to profile

https://gerrit.wikimedia.org/r/1078748

Change #1078752 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: sync lfs data also to new machine

https://gerrit.wikimedia.org/r/1078752

Mentioned in SAL (#wikimedia-operations) [2024-10-08T21:34:38Z] <mutante> gerrit2003 - sudo -u gerrit-deploy /usr/bin/scap deploy-local --repo gerrit/gerrit -D log_json:False (for some reason this fails in puppet but works manually) T372804 T257317 T317412

Change #1078759 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: avoid duplicate declaration error on first setup

https://gerrit.wikimedia.org/r/1078759

Change #1078759 merged by Dzahn:

[operations/puppet@production] gerrit: avoid duplicate declaration error on first setup

https://gerrit.wikimedia.org/r/1078759

Change #1079026 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: comment out creation of site dir in migration profile

https://gerrit.wikimedia.org/r/1079026

Change #1079026 merged by Dzahn:

[operations/puppet@production] gerrit: comment out creation of site dir in migration profile

https://gerrit.wikimedia.org/r/1079026

For the first time puppet runs just fine on the new hardware now, before it is in production.

Also gerrit is deployed there already. Everything is in place minus "no service IP is bound to the NIC", we don't sync lfs-data yet and ?.

Notably this also means gerrit on bookworm seems to work. Since no more puppet issues, app deployed, same Java version.

Mentioned in SAL (#wikimedia-operations) [2024-10-10T07:32:23Z] <hashar> Stopped gerrit service on gerrit2003.codfw.wmnet since it is not starting up properly | T372804

The gerrit process on gerrit2003 does not start properly and is flapping:

Oct 10 07:29:21 gerrit2003 systemd[1]: gerrit.service: Scheduled restart job, restart counter is at 19328.
Oct 10 07:29:30 gerrit2003 systemd[1]: gerrit.service: Scheduled restart job, restart counter is at 19329.
Oct 10 07:29:38 gerrit2003 systemd[1]: gerrit.service: Scheduled restart job, restart counter is at 19330.
Oct 10 07:29:47 gerrit2003 systemd[1]: gerrit.service: Scheduled restart job, restart counter is at 19331.
Oct 10 07:29:56 gerrit2003 systemd[1]: gerrit.service: Scheduled restart job, restart counter is at 19332.
Oct 10 07:30:04 gerrit2003 systemd[1]: gerrit.service: Scheduled restart job, restart counter is at 19333.
Oct 10 07:30:13 gerrit2003 systemd[1]: gerrit.service: Scheduled restart job, restart counter is at 19334.

I tried to stop it manually, but of course Puppet bring it back up. That is causing alerts as the service goes up and down continuously. The service and its monitoring should be disabled until it s ready.

As a side track, I don't know what gerrit2003 is for. Is that a hardware refresh for gerrit2002?

Change #1079206 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] Disable gerrit monitoring on gerrit2003

https://gerrit.wikimedia.org/r/1079206

Change #1079206 abandoned by Hashar:

[operations/puppet@production] Disable gerrit monitoring on gerrit2003

Reason:

Thank you for having set the downtime!

https://gerrit.wikimedia.org/r/1079206

Change #1079206 restored by Dzahn:

[operations/puppet@production] Disable gerrit monitoring on gerrit2003

https://gerrit.wikimedia.org/r/1079206

Change #1079206 merged by Dzahn:

[operations/puppet@production] Disable gerrit monitoring on gerrit2003

https://gerrit.wikimedia.org/r/1079206

Change #1079358 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: set Hiera keys for nist_keys, nftables

https://gerrit.wikimedia.org/r/1079358

Change #1079358 merged by Dzahn:

[operations/puppet@production] gerrit: set Hiera keys for nist_keys, nftables

https://gerrit.wikimedia.org/r/1079358

Change #1079363 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit2003: move bind_serviceIP Hiera key host name level

https://gerrit.wikimedia.org/r/1079363

Change #1079363 merged by Dzahn:

[operations/puppet@production] gerrit2003: move bind_service_ip Hiera key host name level

https://gerrit.wikimedia.org/r/1079363

Change #1078752 merged by Dzahn:

[operations/puppet@production] gerrit: sync lfs data also to new machine

https://gerrit.wikimedia.org/r/1078752

Change #1063893 merged by Dzahn:

[operations/puppet@production] site: apply gerrit role on gerrit2003

https://gerrit.wikimedia.org/r/1063893

Change #1080379 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: remove Hiera keys on host gerrit2003 that are applied by role

https://gerrit.wikimedia.org/r/1080379

Change #1080379 merged by Dzahn:

[operations/puppet@production] gerrit: remove Hiera keys on host gerrit2003 that are applied by role

https://gerrit.wikimedia.org/r/1080379

Dzahn changed the task status from In Progress to Stalled.Oct 17 2024, 5:42 PM

Basically done.

All that is missing is we haven't assigned a service IP to this machine.

In puppet it is just set to not bind a service IP and the service is masked.

But it could be enabled now whenever we like. Also lfs data is already synced from prod server.

To be determined how exactly we use it next.

One way would be to assign like "gerrit-replica-b" to it.

Another would be to fail-over to this machine as the new production gerrit (since it's newer hardware and also newer OS version!) and then make the previous prod server a replica.

Change #1074488 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: delete temp gerrit setup role

https://gerrit.wikimedia.org/r/1074488

Change #1074488 merged by Dzahn:

[operations/puppet@production] gerrit: delete temp gerrit setup role

https://gerrit.wikimedia.org/r/1074488

Dzahn changed the task status from Stalled to In Progress.Oct 21 2024, 5:26 PM
Dzahn added a subscriber: Jelto.

Discussed in today's team meeting. This server is up and running with the production role on it now, so the task to setup the service on this machine is considered done.

There is just no service name name attached to it and the gerrit service is masked. Monitoring is disabled.

How to use it will be determined in the upcoming work regarding Gerrit failover strategies.

But as has been pointed out by @Jelto the current Gerrit prod server is not that old either. The current replica is older.

It might make the most sense to turn this into a new replica, leave the current prod host as is and also keep using the current old replica, to have 3 Gerrit machines at the same time.

LFS data syncing from the prod server is also already setup and happened via the puppetized timer:

root@gerrit2003:/srv/gerrit/data/lfs# du -hs .
31G
Dzahn renamed this task from setup gerrit2003 with gerrit service to setup gerrit2003 with gerrit service (gerrit on bookworm).Oct 24 2024, 12:42 AM

Change #1087967 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: add chown parameter to lfs data rsync, ensure daemon_user is used

https://gerrit.wikimedia.org/r/1087967

Change #1087967 merged by Dzahn:

[operations/puppet@production] gerrit: add chown parameter to lfs data rsync, ensure daemon_user is used

https://gerrit.wikimedia.org/r/1087967

We need a follow-up task to _acutally start using_ this new server and failover gerrit to it.

Change #1140520 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: add a second replica, start replicating to gerrit2003

https://gerrit.wikimedia.org/r/1140520

Change #1140520 merged by Dzahn:

[operations/puppet@production] gerrit: add a second replica, start replicating to gerrit2003

https://gerrit.wikimedia.org/r/1140520

Change #1152782 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: add ssh_host_rsa public key for gerrit2003

https://gerrit.wikimedia.org/r/1152782

Change #1152782 merged by Dzahn:

[operations/puppet@production] gerrit: add ssh_host_rsa public key for gerrit2003

https://gerrit.wikimedia.org/r/1152782

Change #1152810 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: introduce second daemon_user name

https://gerrit.wikimedia.org/r/1152810

Change #1152819 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: replace gerrit2003 RSA host key with ed25519 host key

https://gerrit.wikimedia.org/r/1152819

I have reverted the replica configuration since that broke GitHub replication and the new gerrit2003 host was misconfigured ( T395887#10879856 )

Change #1153159 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] Revert "gerrit: add a second replica, start replicating to gerrit2003"

https://gerrit.wikimedia.org/r/1153159

Change #1152819 merged by Dzahn:

[operations/puppet@production] gerrit: replace gerrit2003 RSA host key with ed25519 host key

https://gerrit.wikimedia.org/r/1152819

Change #1152810 abandoned by Dzahn:

[operations/puppet@production] gerrit: introduce second daemon_user name

Reason:

will make a new simpler patch rebased on top of a re-revert of the replication config change

https://gerrit.wikimedia.org/r/1152810

Since yesterday we are now replicating to the new machine gerrit2003 again.

https://gerrit.wikimedia.org/r/c/operations/puppet/+/1153265

@ABran-WMF I wonder if you have thoughts on my original question on this ticket, back in August 2024 I said:

"determine if this is resolved once it's a warm standby host or if we switch production to this because it's newer hardware".

"determine if this is resolved once it's a warm standby host or if we switch production to this because it's newer hardware".

At this stage, I think we'll still wait a for a while:

  • I'll test things on gerrit2003 to make sure our next switchover goes smoothly, maybe we can call it done when we're also done with T338470: Rename gerrit2 unix user to gerrit and assign a fixed uid as we'll be running production on gerrit2003, after which I'llll make gerrit2003 a warm standby again.
  • Also, I'll make gerrit run as --replica as well on gerrit2003 to align with our chosen pattern.

That makes sense to me. Sounds good. Thank you!

Change #1167838 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: enable gerrit.service and monitoring

https://gerrit.wikimedia.org/r/1167838

Change #1167838 merged by Arnaudb:

[operations/puppet@production] gerrit: enable gerrit.service and monitoring

https://gerrit.wikimedia.org/r/1167838

Dzahn changed the task status from Open to In Progress.Sep 26 2025, 4:00 PM
Dzahn raised the priority of this task from Medium to High.

I would argue that this is already resolved because clearly a gerrit exists on gerrit2003.

But we agreed to close this out once it's actually the production gerrit.

@ABran-WMF is working on this right now: --> https://gerrit.wikimedia.org/r/c/operations/puppet/+/1188351

I feel like this ticket is resolved regardless of fail-over or not. Clearly we have a gerrit service running on this machine and on the new OS .. if we had not we would not have even attempted to make it prod.

I am happy to be convinced otherwise and if you want to reopen it that's not a big deal to me. But all the remaining work has been happening on other tickets more specific to the failover anyways.