Page MenuHomePhabricator

decom gerrit1001
Closed, ResolvedPublic

Description

This will be the task to decommission gerrit1001 from servcice. We are switching to gerrit1003 tomorrow.

Then there will be a grace period and then this will become actionable.

Creating it now as part of the overall migration plan since it's its last checkbox.

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+1 -3
operations/puppetproduction+3 -0
operations/puppetproduction+0 -10
integration/configmaster+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+6 -11
integration/configmaster+1 -28
integration/configmaster+28 -1
operations/dnsmaster+0 -4
operations/homer/publicmaster+2 -4
operations/puppetproduction+1 -1
operations/puppetproduction+0 -2
operations/puppetproduction+1 -1
operations/puppetproduction+0 -3
operations/puppetproduction+1 -1
operations/puppetproduction+2 -0
operations/puppetproduction+2 -4
operations/puppetproduction+25 -2
operations/puppetproduction+4 -1
Show related patches Customize query in gerrit

Event Timeline

Dzahn changed the task status from Open to Stalled.May 10 2023, 5:46 PM

Change 919244 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: disable monitoring for gerrit1001

https://gerrit.wikimedia.org/r/919244

Change 919246 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: add parameter service_ensure, set to stopped on gerrit1001

https://gerrit.wikimedia.org/r/919246

Change 919246 abandoned by Dzahn:

[operations/puppet@production] gerrit: add parameter service_ensure, set to stopped on gerrit1001

Reason:

replaced in favor of https://gerrit.wikimedia.org/r/c/operations/puppet/+/919359

https://gerrit.wikimedia.org/r/919246

Change 919359 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: allow masking the service and do so on gerrit1001

https://gerrit.wikimedia.org/r/919359

Change 919400 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: remove gerrit1001 as a source host for migrations

https://gerrit.wikimedia.org/r/919400

Change 919401 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: remove gerrit1001 from ssh_allowed hosts and acme_chief

https://gerrit.wikimedia.org/r/919401

Change 919403 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: remove gerrit1001 from .ssh/config

https://gerrit.wikimedia.org/r/919403

Change 919407 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: remove gerrit1001 from gerrit role, rm hiera host data

https://gerrit.wikimedia.org/r/919407

Change 919408 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] openstack: remove old Gerrit IP from cloudgw

https://gerrit.wikimedia.org/r/919408

Dzahn changed the task status from Stalled to Open.May 12 2023, 9:26 PM
LSobanski triaged this task as Medium priority.May 15 2023, 2:08 PM

Change 919359 merged by Dzahn:

[operations/puppet@production] gerrit: allow masking the service and do so on gerrit1001

https://gerrit.wikimedia.org/r/919359

Change 919400 merged by Dzahn:

[operations/puppet@production] gerrit: remove gerrit1001 as a source host for migrations

https://gerrit.wikimedia.org/r/919400

Change 919244 merged by Dzahn:

[operations/puppet@production] gerrit: disable monitoring for gerrit1001

https://gerrit.wikimedia.org/r/919244

Change 919403 merged by Dzahn:

[operations/puppet@production] gerrit: remove gerrit1001 from .ssh/config

https://gerrit.wikimedia.org/r/919403

Change 919401 merged by Dzahn:

[operations/puppet@production] gerrit: remove gerrit1001 from acme_chief, ssh known_hosts and firewall rules

https://gerrit.wikimedia.org/r/919401

Change 920749 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: remove gerrit1001

https://gerrit.wikimedia.org/r/920749

Change 920749 abandoned by Dzahn:

[operations/puppet@production] site: remove gerrit1001

Reason:

should also delete hiera data but is duplicated by https://gerrit.wikimedia.org/r/c/operations/puppet/+/919407

https://gerrit.wikimedia.org/r/920749

Change 919408 merged by Dzahn:

[operations/puppet@production] openstack: remove old Gerrit IP from cloudgw

https://gerrit.wikimedia.org/r/919408

@hashar Could you define which file system pathes we need to keep from this host? Assuming /srv/gerrit and /var/lib/gerrit2. Since that is everything we ever copied between hosts.

Now.. I double checked and we have already copied everything under /srv/gerrit from 1001 to 1003. So for example there already is /srv/gerrit/cobalt which is a backup of the previous host and it exists on both 1001 and 1003.

Also we have these 3 backups of git repos there:

4.0K drwxr-xr-x 127 gerrit2 gerrit2 4.0K Oct 22  2019 git.2019-10-22
4.0K drwxr-xr-x 127 gerrit2 gerrit2 4.0K Oct 24  2019 git.2019-10-24
4.0K drwxrwxr-x 127 gerrit2 gerrit2 4.0K Apr  7  2020 git.2020-06-27.qchris.just-before-3.2-upgrade

These also exist on gerrit1003 and use together over 100GB. (and it's not a problem for disk space.. so far).

So in short.. this is already all copied and given that I really don't see a reason to keep gerrit1001 around any longer.

Finally I looked at Bacula and we backup /srv/gerrit/data, /srv/gerrit/git, /srv/gerrit/plugins but NOT /var/lib/gerrit2 and not the other backups mentioned above.

Seems to me we should backup /var/lib/gerrit2 and the entire /srv/gerrit. That is what we copy for migrations.. so it should also be what we are able to restore in emergencies / the unlikely case both gerrit hosts are gone.

Change 924608 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit/bacula: adjust Gerrit file paths to be backed up

https://gerrit.wikimedia.org/r/924608

Change 924608 merged by Dzahn:

[operations/puppet@production] gerrit/bacula: adjust Gerrit file paths to be backed up

https://gerrit.wikimedia.org/r/924608

based on T336427#8890714 shell access to gerrit1001 will be removed next week

Dzahn raised the priority of this task from Medium to High.Jun 2 2023, 9:42 PM

Mentioned in SAL (#wikimedia-releng) [2023-06-02T21:44:09Z] <mutante> based on T336427#8890714 pending a response, everything already copied to gerrit1003, and extra paths being added to Bacule... shell access to gerrit1001 will be removed next week

Change 927246 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/homer/public@master] remove old gerrit service IP from static definitions

https://gerrit.wikimedia.org/r/927246

Change 927246 merged by jenkins-bot:

[operations/homer/public@master] remove old gerrit service IP from static definitions

https://gerrit.wikimedia.org/r/927246

Mentioned in SAL (#wikimedia-cloud) [2023-06-05T18:42:46Z] <mutante> - access to old gerrit service IP (gerrit-old.wikimedia.org) for cloud IPs was removed with gerrit:927246 (homer deploy), T336427

Change 927267 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/dns@master] delete gerrit-old.wikimedia.org

https://gerrit.wikimedia.org/r/927267

Change 927280 had a related patch set uploaded (by Dzahn; author: Dzahn):

[integration/config@master] update gerrit.wikimedia.org IP in dockerfiles/maven-java8/gerrit_ssh_host_key

https://gerrit.wikimedia.org/r/927280

Change 927280 merged by jenkins-bot:

[integration/config@master] Dockerfiles: [maven-java8] Update gerrit.wikimedia.org IP

https://gerrit.wikimedia.org/r/927280

Change 927267 merged by Dzahn:

[operations/dns@master] delete gerrit-old.wikimedia.org

https://gerrit.wikimedia.org/r/927267

Change 928102 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] Revert "Dockerfiles: [maven-java8] Update gerrit.wikimedia.org IP"

https://gerrit.wikimedia.org/r/928102

Change 928102 merged by jenkins-bot:

[integration/config@master] Revert "Dockerfiles: [maven-java8] Update gerrit.wikimedia.org IP"

https://gerrit.wikimedia.org/r/928102

Change 928183 had a related patch set uploaded (by Dzahn; author: Dzahn):

[integration/config@master] remove IPs from gerrit_ssh_host_key file

https://gerrit.wikimedia.org/r/928183

Change 919407 merged by Dzahn:

[operations/puppet@production] site: remove gerrit1001 from gerrit role, rm hiera host data

https://gerrit.wikimedia.org/r/919407

Change 928676 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: fix typo in gerrit1001 role assignment

https://gerrit.wikimedia.org/r/928676

Change 928676 merged by Dzahn:

[operations/puppet@production] site: fix typo in gerrit1001 role assignment

https://gerrit.wikimedia.org/r/928676

double checked: after recently expaning the Bacula backup set, we now have these things in there as well:

git.2019-10-22/
git.2019-10-24/
git.2020-06-27.qchris.just-before-3.2-upgrade/

and the entire /var/lib/gerrit2/ is now also backed up and wasn't before.

additionally the data has been copied to gerrit1003.

following that, shell access on gerrit1001 was revoked today

Change 928183 merged by jenkins-bot:

[integration/config@master] remove IPs from gerrit_ssh_host_key file

https://gerrit.wikimedia.org/r/928183

@hashar Still waiting for a reply here. The data had already been copied to new server, then I added more to Bacula so now the entire /srv/gerrit on gerrit1003 is in Bacula and it also includes these old git copies from 2019, like:

4.0K drwxr-xr-x 127 gerrit2 gerrit2 4.0K Oct 22  2019 git.2019-10-22
4.0K drwxr-xr-x 127 gerrit2 gerrit2 4.0K Oct 24  2019 git.2019-10-24
4.0K drwxrwxr-x 127 gerrit2 gerrit2 4.0K Apr  7  2020 git.2020-06-27.qchris.just-before-3.2-upgrade

Are those what you had in mind? If we can delete those one day on gerrit1003 that gives us space back. but the only way we could have lost anything from gerrit1001 is if it was outside both /srv/gerrit and /var/lib/gerrit2.

The old copies of git repositories (git-*) is part of what I want to keep indeed which were taken before a hardware switch and before the Gerrit 3.2 upgrade. They might hold objects that are currently missing from the canonical repositories and maybe I can recover those objects from there. It is a cookie to lick eventually.

the only way we could have lost anything from gerrit1001 is if it was outside both /srv/gerrit and /var/lib/gerrit2.

Yes that is what I want to check. Notably the home directories certainly have some handy scripts that should be upstreamed to Puppet.

Yes that is what I want to check. Notably the home directories certainly have some handy scripts that should be upstreamed to Puppet.

If /home directories on gerrit servers are expected to have handy scripts, let's just backup home dirs on gerrit servers.. and let's just copy the home dirs over from old to new server during migrations. It would have been easy to add.

Let's not say on the one hand that it's not worth it but on the other hand that it blocks this ticket.

Change 931680 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: backup /home on gerrit servers in Bacula

https://gerrit.wikimedia.org/r/931680

Change 931680 merged by Dzahn:

[operations/puppet@production] gerrit: backup /home on gerrit servers in Bacula

https://gerrit.wikimedia.org/r/931680

Mentioned in SAL (#wikimedia-operations) [2023-06-21T20:30:34Z] <mutante> gerrit1001 (formerly gerrit prod) - creating tarball of entire /home/ in /home/ and copying it over to gerrit1003 - simultaneousy adding /home on gerrit servers to bacula from now on - T336427

backups of /home and /root created and they are here now:

[gerrit1003:/home] $ du -hs *.tar.gz
500M	gerrit1001-home-20230621.tar.gz
16K	gerrit1001-root-20230621.tar.gz

@hashar @thcipriani See below, those old backups of git data from 2019 and all that.. made it to Bacula now. just fyi. You can feel more confident if we want to delete that from the actual servers some time.

As the last thing before actually destroying the machine I checked in Bacula, bconsole, if I can actually see all the things in /srv/gerrit (and not just /srv/gerrit/git like before!) now. And I can!

cwd is: /srv/gerrit/
$ ls
All-Users/
All-Users-2020-03-20.git/
T236443/
analytics-wmde-wd-wd_identifiedlandscape.git.2019-10-24/
cobalt/
data/
git/
git.2019-10-22/
git.2019-10-24/
git.2020-06-27.qchris.just-before-3.2-upgrade/
plugins/
replication/
wikimedia-fundraising-crm.2019-10-24.git/
$

The procedure to get there:

  • ssh backup1001
  • sudo bconsole
  • type "restore"
  • 5: Select the most recent backup for a client
  • select gerrit1001 from client list (client 131)
  • select file set "gerrit-repo-data" (1)
  • wait for it to build directory tree for a couple seconds
  • ls / cd to look around
  • use "mark" to select files to restore
  • type "done"
  • follow the rest of the wizard.. asking you where to restore it to.. etc

I found then that we had even more files from old-old Gerrit server "cobalt.wikimedia.org" in /srv/ but not in /srv/gerrit/ (which is backed up) even though there was also already /srv/gerrit/cobalt/git.

I made tarballs out of 3 large dirs and moved all of them under /srv/gerrit/cobalt, then rsynced it to gerrit1003:/srv/gerrit/cobalt/. So now:

4.0K drwxrwxr-x 127 gerrit2 gerrit2 4.0K Oct 11  2019 git
973M -rw-r--r--   1 root    root    973M Jun 21 22:00 home-cobalt.wikimedia.org.tar.gz
5.6G -rw-r--r--   1 root    root    5.6G Jun 21 22:04 srv-cobalt.wikimedia.org.tar.gz
2.1G -rw-r--r--   1 root    root    2.1G Jun 21 22:08 var-lib-gerrit2-cobalt.wikimedia.org.tar.gz
root@gerrit1003:/srv/gerrit/cobalt#

Mentioned in SAL (#wikimedia-operations) [2023-06-21T22:16:17Z] <mutante> destroying previous production gerrit server gerrit1001 - T336427

Disable and reset vlan on asw2-b8-eqiad:ge-8/0/29 for local eno1
Delete IP 208.80.154.136/26 on eno1
Delete IP 2620:0:861:2:208:80:154:136/64 on eno1
Unset DNS name for IP 10.65.3.102/16 on mgmt
[Netbox] Set status to Decommissioning, deleted all non-mgmt IPs,  updated switch interfaces (disabled, removed vlans, etc)
----- OUTPUT of 'configure exclus...re;rollback;exit' -----
Entering configuration mode
[edit interfaces ge-8/0/29]
-   description gerrit1001;
+   description DISABLED;
+   disable;

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: gerrit1001.wikimedia.org

  • gerrit1001.wikimedia.org (WARN)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 932021 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: remove gerrit1001 from site and gerrit2001 hiera data

https://gerrit.wikimedia.org/r/932021

Change 932021 merged by Dzahn:

[operations/puppet@production] gerrit: remove gerrit1001 from site and gerrit2001 hiera data

https://gerrit.wikimedia.org/r/932021

All our steps are done. We are now supposed to give the actual hardware back to dcops. Also I am giving the public IP back. So continued in T340077 from here.

Change 931714 had a related patch set uploaded (by Dzahn; author: Jcrespo):

[operations/puppet@production] gerrit: use default job defaults for home dir backup

https://gerrit.wikimedia.org/r/931714

Change 931714 merged by Dzahn:

[operations/puppet@production] gerrit: use default job defaults for home dir backup

https://gerrit.wikimedia.org/r/931714