Page MenuHomePhabricator

upgrade gerrit servers to bullseye
Closed, ResolvedPublic

Description

Currently we have 3 gerrit servers, gerrit1001, gerrit2002 and gerrit1002.

1001 and 2002 are on buster and in production. They shouldn't be on buster. -> T327068.

1003 is on bullseye and not yet in production, it is new hardware racked in T326366

Separately there is T326368 to implement the service on gerrit1003.

This ticket is resolved when there are no more gerrit servers on buster in production, one way or another.

So it consists of the parts:

  • get T326368 resolved. gerrit1003 is in production
  • reimage gerrit2002, gerrit2002 is on bullseye and in production
  • decom gerrit1001 -> T336427

The first 2 steps can also be flipped around.

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptApr 11 2023, 8:35 PM
LSobanski moved this task from Incoming to Backlog on the collaboration-services board.
Dzahn changed the task status from Open to In Progress.Apr 20 2023, 4:56 PM
Dzahn claimed this task.

basically duplicate of https://phabricator.wikimedia.org/T326368 with small variation that the cloud instance should also be upgraded and gerrit1001 be decom'ed

gerrit1003 is now the production server and on bullseye.

remaining: upgrade gerrit2002 to bullseye (implied in this task)
remaining: shut down gerrit1001 (decom subtask)

@thcipriani next is we need to reimage gerrit2002 with bullseye.. so downtime of the replica during an entire reinstall... or... some kind of switch-over.. We said first thing we want to do is get clients off of using the replica. So we will open a new ticket for just that.

Change 920761 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] extdist: switch git URLs from gerrit-replica to gerrit

https://gerrit.wikimedia.org/r/920761

Change 920765 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: make new lfs path the default and clean up

https://gerrit.wikimedia.org/r/920765

Change 920773 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit2002: mask gerrit service

https://gerrit.wikimedia.org/r/920773

the plan for gerrit2002:

  • on Thursday, May 25th:
  • schedule downtime, monitoring downtime
  • mask gerrit service on gerrit2002 - https://gerrit.wikimedia.org/r/c/operations/puppet/+/920773
  • run re-image cookbook on gerrit2002
  • machine comes back up with bullseye, gerrit service stays masked, replica has no data
  • releng starts fresh replication from gerrit1003, gerrit service on gerrit2002 stays masked
  • releng monitors when replication is done, we revert the patch above, gerrit service on gerrit2002 comes up again

notes: codesearch is currently on gerrit, and not gerrit-replica, at least until after migration is done. (cc: @Ladsgroup @Legoktm )
note: extension distributor though is still actively on gerrit-replica and https://gerrit.wikimedia.org/r/c/operations/puppet/+/920761 remains to be decided

Change 920761 abandoned by Hashar:

[operations/puppet@production] extdist: switch git URLs from gerrit-replica to gerrit

Reason:

Lets keep it on the replica per my previous analysis. The tarballs will be stall while the replica is gone which is not the end of the world given there are barely any commits made for release branches :)

https://gerrit.wikimedia.org/r/920761

Change 920765 merged by Dzahn:

[operations/puppet@production] gerrit: remove lfs_dir parameter, use hardcoded new default

https://gerrit.wikimedia.org/r/920765

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1001 for host gerrit2002.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1001 for host gerrit2002.wikimedia.org with OS bullseye completed:

  • gerrit2002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202305251533_dzahn_3574334_gerrit2002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 923378 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: update SSH host key for reimaged gerrit2002

https://gerrit.wikimedia.org/r/923378

Change 923378 merged by Dzahn:

[operations/puppet@production] gerrit: update SSH host key for reimaged gerrit2002

https://gerrit.wikimedia.org/r/923378

Mentioned in SAL (#wikimedia-releng) [2023-05-25T16:37:41Z] <hashar> ssh -p 29418 gerrit.wikimedia.org replication start --url gerrit2002 --all --wait # T334521

Mentioned in SAL (#wikimedia-releng) [2023-05-25T16:44:20Z] <hashar> gerrit2002: creating lucene indices: java -jar /var/lib/gerrit2/review_site/bin/gerrit.war reindex --index groups # T334521

gerrit2002 has been reimaged and is back up and running on bullseye, currently replication from gerrit1003 is ongoing.

Mentioned in SAL (#wikimedia-releng) [2023-05-25T17:30:56Z] <hashar> gerrit2002 replication eventually has completed at some point: Replication completed with some errors! # T334521

The full replication log which I triggered from my machine is P48560:
It ends with:

Replication completed with some errors!

And the errors:

Error: Failed replicate of refs/master to gerrit2@gerrit2002.wikimedia.org:/srv/gerrit/git/labs/private.git, reason: funny refname
Error: Failed replicate of refs/master to gerrit2@gerrit2002.wikimedia.org:/srv/gerrit/git/mediawiki/services/ores/deploy.git, reason: funny refname
Error: Failed replicate of refs/for2.4.4 to gerrit2@gerrit2002.wikimedia.org:/srv/gerrit/git/mediawiki/skins/Metrolook.git, reason: funny refname
Error: Failed replicate of refs/for3.0-beta-9 to gerrit2@gerrit2002.wikimedia.org:/srv/gerrit/git/mediawiki/skins/Metrolook.git, reason: funny refname
Error: Failed replicate of refs/master to gerrit2@gerrit2002.wikimedia.org:/srv/gerrit/git/operations/debs/kubeyaml.git, reason: funny refname
Error: Failed replicate of refs/wmf-192fix to gerrit2@gerrit2002.wikimedia.org:/srv/gerrit/git/operations/software/nginx.git, reason: funny refname
Error: Failed replicate of refs/master to gerrit2@gerrit2002.wikimedia.org:/srv/gerrit/git/operations/software/puppet-compiler.git, reason: funny refname

Which should be investigated: T337508

gerrit2002.wikimedia.org seems to be up and operational the primary successfully replicates to it :]

I think that concludes our adventure. There are follow up actions T313553, T337502 and T337508 but they are more or less unrelated.

Thank you for the .plan!

Thank you as well @hashar :) Sounds good.

One follow-up: We did not actually merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/920773 to mask the service. We should have. Then the gerrit service wouldn't have come up on gerrit2002 at initial puppet run which was triggered by cookbook. This is why it was up and that wasn't the plan originally, my bad.

But you masked it manually, so.. seems alright.

Change 920773 abandoned by Dzahn:

[operations/puppet@production] gerrit2002: mask gerrit service

Reason:

should have been merged during migration and then reverted after. but was forgotten and now not needed anymore.

https://gerrit.wikimedia.org/r/920773

Declaring this resolved because:

  • both production servers with the gerrit role are on bullseye
  • the former gerrit prod server on buster has lost the gerrit role today
  • decom of gerrit1001 is still in progress but it's not a gerrit server anymore