Page MenuHomePhabricator

gerrit1003 service implementation task
Closed, ResolvedPublic

Description

At the request of serviceops all racking tasks for new hardware also have a sub-task for service ops implementation tracking.

Once parent task T326366 shows resolved, this can proceed via the service ops team.


topic branch with related patches: https://gerrit.wikimedia.org/r/q/topic:gerrit-bullseye


1# schedule and announce downtime
2# on gerrit1001: shortly before the scheduled downtime:
3# on gerrit1001, as root, in a screen: rsync -avp --delete --bwlimit=100m /var/lib/gerrit2/review_site/ rsync://gerrit1003.wikimedia.org/gerrit-var-lib/
4# on gerrit1001, as root, in a screen: rsync -avp --delete --bwlimit=100m /srv/gerrit/ rsync://gerrit1003.wikimedia.org/gerrit-data/
5# on gerrit1003: rsync -avp /srv/gerrit/plugins/lfs/ /srv/gerrit/data/lfs/
6# on gerrit1003: chown -R gerrit2:gerrit2 /var/lib/gerrit2
7# on gerrit1003: chown -R gerrit2:gerrit2 /srv/gerrit
8# scheduled downtime begins / IRC announcement
9# on cumin1001:sudo cookbook sre.hosts.downtime -r 'maintenance' -D 30 gerrit1001.wikimedia.org
10# on cumin1001:sudo cookbook sre.hosts.downtime -r 'maintenance' -H 1 gerrit1003.wikimedia.org
11# on icinga.wikimedia.org - manually schedule downtime for the checks connected to virtual server "gerrit.wikimedia.org". The cookbook does not find this virtual host.
12# on gerrit1003: disable puppet; stop gerrit? (sudo disable-puppet 'gerrit maintenance'; systemctl stop gerrit)
13# merge DNS change that removes gerrit-new and switches IP of gerrit.wikimedia.org - in web UI of gerrit(-old)
14# run authdns-update on ns0.wikimedia.org, see the diff but do NOT commit yet
15# on gerrit1001: disable puppet; stop gerrit! (sudo disable-puppet 'gerrit maintenance'; systemctl stop gerrit)
16# on gerrit1001, as root, in a screen: rsync -avp --delete --bwlimit=100m /var/lib/gerrit2/review_site/ rsync://gerrit1003.wikimedia.org/gerrit-var-lib/
17# on gerrit1001, as root, in a screen: rsync -avp --delete --bwlimit=100m /srv/gerrit/ rsync://gerrit1003.wikimedia.org/gerrit-data/
18# on gerrit1003: rsync -avp /srv/gerrit/plugins/lfs/ /srv/gerrit/data/lfs/
19# on gerrit1003: chown -R gerrit2:gerrit2 /var/lib/gerrit2
20# on gerrit1003: chown -R gerrit2:gerrit2 /srv/gerrit
21# on gerrit1003: start gerrit
22# say "yes" to authdns-update and actually merge DNS change that removes gerrit-new and switches IP of gerrit.wikimedia.org
23# wait 5 minutes
24# ..test https (https://gerrit.wikimedia.org in browser)
25# ..test ssh (e.g. ssh dzahn@gerrit-new.wikimedia.org -p 29418)
26# announce downtime is over
27# ensure gerrit1001 has puppet disabled and/or services are masked
28# grace period (how long?)
29# decom old host -> https://phabricator.wikimedia.org/T336427


https://gerrit-new.wikimedia.org/r/

Details

SubjectRepoBranchLines +/-
integration/configmaster+1 -28
integration/configmaster+28 -1
integration/configmaster+6 -6
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+2 -0
operations/puppetproduction+25 -2
operations/puppetproduction+1 -0
operations/puppetproduction+4 -1
operations/puppetproduction+1 -0
operations/puppetproduction+4 -0
operations/homer/publicmaster+2 -0
operations/puppetproduction+2 -2
operations/dnsmaster+8 -8
operations/dnsmaster+8 -8
operations/dnsmaster+2 -2
operations/puppetproduction+1 -0
operations/puppetproduction+11 -3
operations/puppetproduction+1 -6
operations/puppetproduction+4 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -2
operations/puppetproduction+1 -0
operations/puppetproduction+10 -10
operations/puppetproduction+6 -0
operations/puppetproduction+14 -15
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 909792 merged by Cwhite:

[operations/puppet@production] logstash: replace gerrit1001 with gerrit1003 in tests

https://gerrit.wikimedia.org/r/909792

Change 909796 merged by Dzahn:

[operations/puppet@production] gerrit: add host-based Hiera keys for gerrit1003

https://gerrit.wikimedia.org/r/909796

Change 910049 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: add gerrit prod role to gerrit1003

https://gerrit.wikimedia.org/r/910049

Change 910064 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: add gerrit1003 to rsync dest hosts when using prod role

https://gerrit.wikimedia.org/r/910064

Change 910064 merged by Dzahn:

[operations/puppet@production] gerrit: add gerrit1003 to rsync dest hosts when using prod role

https://gerrit.wikimedia.org/r/910064

Change 909791 merged by Dzahn:

[operations/puppet@production] replace gerrit1001 with gerrit1003 as ping target for blackbox smoke

https://gerrit.wikimedia.org/r/909791

Change 909790 merged by Dzahn:

[operations/puppet@production] acme_chief/gerrit certs: add gerrit1003 to hosts and gerrit-new to SNI

https://gerrit.wikimedia.org/r/909790

Change 909795 merged by Dzahn:

[operations/puppet@production] cloudgw: allow VMs to speak to new gerrit server (gerrit1003)

https://gerrit.wikimedia.org/r/909795

Change 910049 merged by Dzahn:

[operations/puppet@production] site: add gerrit prod role to gerrit1003

https://gerrit.wikimedia.org/r/910049

Mentioned in SAL (#wikimedia-operations) [2023-04-25T21:06:55Z] <mutante> adding production gerrit role to new machine gerrit1003 - monitoring downtimed - but it has a service IP that is going to be added by this and cant be downtimed ? (Bug: T326368)

Mentioned in SAL (#wikimedia-operations) [2023-04-25T21:14:07Z] <mutante> gerrit1003 - manually replacing deploy2002 with deploy1002 in /srv/deployment/gerrit/gerrit-cache/.config to fix initial scap deployment T257317 T326368

Mentioned in SAL (#wikimedia-operations) [2023-04-25T21:17:02Z] <mutante> gerrit1003 - mv /srv/gerrit/plugins/lfs /srv/gerrit/data/ T333143 T326368

Mentioned in SAL (#wikimedia-operations) [2023-04-25T21:19:20Z] <mutante> gerrit1003 - chown -R gerrit2:gerrit2 /srv/gerrit T333143 T326368

Mentioned in SAL (#wikimedia-releng) [2023-04-25T21:23:23Z] <mutante> gerrit1003 - sudo -u gerrit2 /usr/bin/scap deploy-local --repo gerrit/gerrit -D log_json:False (manually it works, but that's the same command that puppet is supposed to run !?) - T257317 T326368

Dzahn updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2023-04-25T21:40:30Z] <mutante> gerrit1003 - chown -R gerrit2:gerrit2 /var/lib/gerrit2/review_site/ - T326368

Change 911941 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: make configurable whether service is running

https://gerrit.wikimedia.org/r/911941

Change 911941 abandoned by Dzahn:

[operations/puppet@production] gerrit: make configurable whether service is running

Reason:

not needed

https://gerrit.wikimedia.org/r/911941

Change 914021 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: add gerrit1003 to gerrit ssh_allowed hosts

https://gerrit.wikimedia.org/r/914021

Change 914021 merged by Dzahn:

[operations/puppet@production] gerrit: add gerrit1003 to gerrit ssh_allowed hosts

https://gerrit.wikimedia.org/r/914021

I wasn't aware about this task until yesterday (via T335730).

I'd like the new host to be added first as a replica rather than an entirely new primary Gerrit server then do a switch over of the service from the current primary gerrit1001 to the new gerrit1003.

Notably, I would like to avoid carrying state from a host to another since last time it caused multiple issues (notably we had obsolete left over files that never got garbage collected by Puppet, the caches filed the disk after a decade of being filing up). In theory the replica can be used as a primary as is, albeit with cold caches at first but I don't think it will cause problems in practice. Or we can do that for the gerrit2002 replacement?

Change 916639 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/dns@master] gerrit: switch service name, turn new into current and current into old

https://gerrit.wikimedia.org/r/916639

Change 916637 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/dns@master] lower TTL for gerrit.wikimedia.org reverse lookups

https://gerrit.wikimedia.org/r/916637

I'd like the new host to be added first as a replica rather than an entirely new primary Gerrit server then do a switch over of the service from the current primary gerrit1001 to the new gerrit1003.

Are you suggesting to ONLY let gerrit replicate and no rsync at all or first let Gerrit replicate and then still rsync afterwards regardlesss? If the first then it would be different from how we did it last time but if the latter then would'nt we copy those unwanted files anyways?

Change 916637 merged by Dzahn:

[operations/dns@master] lower TTL for gerrit.wikimedia.org reverse lookups

https://gerrit.wikimedia.org/r/916637

Change 916639 merged by Dzahn:

[operations/dns@master] gerrit: switch service IP, turn new into current and current into old

https://gerrit.wikimedia.org/r/916639

Upstream released a security update of Gerrit yesterday (3.5.6) I thus upgraded gerrit1001 and gerrit2002 to the new version this morning shortly after the train T336339.

I did not do gerrit1003 since it was not in the dsh group of operations/puppet which prompted me to create a new gerrit dsh group fully managed by Puppet ( https://gerrit.wikimedia.org/r/c/operations/puppet/+/918481 ) and switched the Gerrit deployment repo to use that file ( https://gerrit.wikimedia.org/r/c/operations/software/gerrit/+/918482 ). So now scap deploy runs on all three hosts.

I checked gerrit1003 which is publicly reachable via gerrit-new.wikimedia.org but can't ssh into it. Turns out it does not have any of the git repositories under /srv/gerrit/git and thus lack any projects and users. I don't think I can do the Gerrit upgrade without them, we would need to rsync in the git repositories from the primary gerrit1001 to gerrit1003 (they are in /srv/gerrit/git).

I made several edits to the migration plan doc that is transcluded here from https://phabricator.wikimedia.org/P47782. See my comments there for details.

The main thing is:

  • changed the way we deploy the DNS change to avoid the problem that you can't merge in Gerrit when Gerrit is down :)
  • upped bwlimit to 10/20x faster after confirming we really have gigabit

And some details like how long we are planning to keep the old host around, a decom ticket etc.

..but can't ssh into it. Turns out it does not have any of the git repositories under /srv/gerrit/git and thus lack any projects and users. I don't think I can do the Gerrit upgrade without them, we would need to rsync..

ACK! This is in progress. I am currently working on fixing this. We had already (slowly, over 4 days originally) synced data and things worked but then today data got deleted again during our migration window. I am currently syncing again to get it back in order.

To speed things up so we can deploy asap (and before our new window tomorrow) , I am syncing with bwlimit 100m right now. I will update you shortly.

@hashar @thcipriani status update:

I have freshly:

  • rsynced /srv/gerrit
  • rsynced /var/lib/gerrit2
  • for the lfs path change: copied /srv/gerrit/plugins/lfs to /srv/gerrit/data/lfs on gerrit1003, did not "mv", just "cp" for now so right now in both locations
  • ensured it's all owned gerrit2:gerrit2
  • restarted gerrit service

Now I would like you to try deploying again.

The thing is.. currently I see no repos in gerrit-new web UI at all.. even though everything was synced and before it did show them?!

Just realized one more thing we have to remember / add to the plan. After we switch we must active replication from gerrit1003 and deactivate it from gerrit1001. (re: that recent issue where both gerrit1001 and gerrit1003 was replicating to gerrit2002 and we had to disable it on gerrit1003.

Change 918529 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/dns@master] Revert "Revert "gerrit: switch service IP, turn new into current and current into old""

https://gerrit.wikimedia.org/r/918529

Change 918589 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: enable replication from gerrit1003, disable from gerrit1001

https://gerrit.wikimedia.org/r/918589

steps we have to add:

Change 918529 merged by Dzahn:

[operations/dns@master] Revert "Revert "gerrit: switch service IP, turn new into current and current into old""

https://gerrit.wikimedia.org/r/918529

Change 918589 merged by Dzahn:

[operations/puppet@production] gerrit: enable replication from gerrit1003, disable from gerrit1001

https://gerrit.wikimedia.org/r/918589

Change 919151 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/homer/public@master] definition/static.net: add Gerrit switchover IPs

https://gerrit.wikimedia.org/r/919151

Change 919151 merged by Ssingh:

[operations/homer/public@master] definition/static.net: add Gerrit switchover IPs

https://gerrit.wikimedia.org/r/919151

This has now happened. gerrit.wikimedia.org is now on new hardware, a new IP and a new distro version.

Change 919226 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] acme_chief: add gerrit-old to list of allowed SANs from gerrit hosts

https://gerrit.wikimedia.org/r/919226

Change 919226 merged by Dzahn:

[operations/puppet@production] acme_chief: add gerrit-old to list of allowed SANs from gerrit hosts

https://gerrit.wikimedia.org/r/919226

Change 919244 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: disable monitoring for gerrit1001

https://gerrit.wikimedia.org/r/919244

Change 919246 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: add parameter service_ensure, set to stopped on gerrit1001

https://gerrit.wikimedia.org/r/919246

Service is implemeted on gerrit1003.

It is now the production server behind gerrit.wikimedia.org since today.

A few follow-ups will be part of decomin'g the old machine (T336427).

Change 919246 abandoned by Dzahn:

[operations/puppet@production] gerrit: add parameter service_ensure, set to stopped on gerrit1001

Reason:

replaced in favor of https://gerrit.wikimedia.org/r/c/operations/puppet/+/919359

https://gerrit.wikimedia.org/r/919246

Change 919359 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: allow masking the service and do so on gerrit1001

https://gerrit.wikimedia.org/r/919359

Change 919402 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: add gerrit1003 to hosts using KexAlgo ecdh-sha2-nistp521 for ssh

https://gerrit.wikimedia.org/r/919402

Mentioned in SAL (#wikimedia-operations) [2023-05-12T20:08:07Z] <mutante> gerrit1001 - systemctl mask gerrit T326368

Change 919405 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: add gerrit1003 SSH host key known_hosts

https://gerrit.wikimedia.org/r/919405

Dzahn changed the status of subtask T336427: decom gerrit1001 from Stalled to Open.May 12 2023, 9:26 PM

Change 919405 merged by Dzahn:

[operations/puppet@production] gerrit: add gerrit1003 SSH host key known_hosts

https://gerrit.wikimedia.org/r/919405

Change 919359 merged by Dzahn:

[operations/puppet@production] gerrit: allow masking the service and do so on gerrit1001

https://gerrit.wikimedia.org/r/919359

Change 919244 merged by Dzahn:

[operations/puppet@production] gerrit: disable monitoring for gerrit1001

https://gerrit.wikimedia.org/r/919244

Change 919402 merged by Dzahn:

[operations/puppet@production] gerrit: add gerrit1003 to hosts using KexAlgo ecdh-sha2-nistp521 for ssh

https://gerrit.wikimedia.org/r/919402

Change 924608 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit/bacula: adjust Gerrit file paths to be backed up

https://gerrit.wikimedia.org/r/924608

Change 924608 merged by Dzahn:

[operations/puppet@production] gerrit/bacula: adjust Gerrit file paths to be backed up

https://gerrit.wikimedia.org/r/924608

Change 927280 had a related patch set uploaded (by Dzahn; author: Dzahn):

[integration/config@master] update gerrit.wikimedia.org IP in dockerfiles/maven-java8/gerrit_ssh_host_key

https://gerrit.wikimedia.org/r/927280

Change 927280 merged by jenkins-bot:

[integration/config@master] Dockerfiles: [maven-java8] Update gerrit.wikimedia.org IP

https://gerrit.wikimedia.org/r/927280

Mentioned in SAL (#wikimedia-releng) [2023-06-05T22:21:59Z] <James_F> Dockerfiles: [maven-java8] Update gerrit.wikimedia.org IP for T326368

Change 927281 had a related patch set uploaded (by Jforrester; author: Jforrester):

[integration/config@master] jjb: Update maven-java8-based jobs to images with new gerrit IP

https://gerrit.wikimedia.org/r/927281

Change 927281 merged by jenkins-bot:

[integration/config@master] jjb: Update maven-java8-based jobs to images with new gerrit IP

https://gerrit.wikimedia.org/r/927281

Change 928102 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] Revert "Dockerfiles: [maven-java8] Update gerrit.wikimedia.org IP"

https://gerrit.wikimedia.org/r/928102

Change 928102 merged by jenkins-bot:

[integration/config@master] Revert "Dockerfiles: [maven-java8] Update gerrit.wikimedia.org IP"

https://gerrit.wikimedia.org/r/928102