Page MenuHomePhabricator

replacement for gerrit2001, decom gerrit2001
Closed, ResolvedPublic

Description

gerrit2001 is out of warranty

https://netbox.wikimedia.org/dcim/devices/315/

Papaul points out if something happens to it it would take a while to fix, if possible at all.

Also we don't keep spares around anymore.

That combined with the policy to keep things around after warranty expires seems suboptimal.

Can we get new hardware for it or have general spares around again please?

<time passes> <hardware request was granted, new hardware was delivered> T299081 <hardware was racked and got an OS> T299575 <time passes>

The new hardware has been installed and is named gerrit2002.

At this time we have these Gerrit hosts

HostRoleHardware status
gerrit1001Gerrit PrimaryOK
gerrit2001Gerrit ReplicaOut of warranty
gerrit2002Gerrit Replica (new)New

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/puppetproduction+0 -1
operations/puppetproduction+0 -10
operations/software/gerritdeploy/wmf/stable-3.4+0 -1
operations/puppetproduction+1 -4
operations/puppetproduction+0 -5
operations/puppetproduction+0 -10
operations/dnsmaster+4 -8
operations/puppetproduction+0 -10
operations/puppetproduction+2 -1
operations/puppetproduction+11 -0
operations/puppetproduction+1 -0
operations/software/gerritdeploy/wmf/stable-3.4+1 -0
operations/puppetproduction+5 -0
operations/dnsmaster+4 -0
operations/puppetproduction+15 -3
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+4 -0
operations/dnsmaster+4 -1
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Dzahn triaged this task as Medium priority.Jan 16 2020, 9:53 PM
Dzahn updated the task description. (Show Details)
Dzahn added a project: Gerrit.

@Dzahn did this ever happen? Or should we maybe ping specific people to make it happen?

@Marostegui No, it's still open. I'll follow-up on it soon.

@LSobanski @akosiaris This one is for consideration in a manager meeting with releng.

Dzahn added a subtask: Unknown Object (Task).Jan 13 2022, 8:59 PM

There is now the new procurement ticket T299081

Papaul closed subtask Unknown Object (Task) as Resolved.Feb 15 2022, 6:40 PM

gerrit2001.wikimedia.org is a replica and can also be used as a spare to switch the primary service. It also serves repos over gerrit-replica.wikimedia.org which is used by various scripts and robot that mass pull from all repositories for various use cases.

To replace the server, we can add the new one as a 2nd replica and have the repositories replicated there (should take a few hours at max). Then we can switch the DNS entry for gerrit-replica.wikimedia.org to that new host and once validated we can unconfigure gerrit2001 and decommissioned it.

The migration would require two public IPv4 addresses (one for the server, one for the Gerrit service).

@brennen , @thcipriani and @dancy also have some knowledge about our Gerrit setup.

TLDR: replacing the server should not be too complicated.

Change 792281 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/dns@master] add gerrit-replica-new secondary/service IPs

https://gerrit.wikimedia.org/r/792281

To replace the server, we can add the new one as a 2nd replica and have the repositories replicated there (should take a few hours at max). Then we can switch the DNS entry for gerrit-replica.wikimedia.org to that new host and once validated we can unconfigure gerrit2001 and decommissioned it.

The migration would require two public IPv4 addresses (one for the server, one for the Gerrit service).

I tried to add the new DNS name gerrit-replica-new.wikimedia.org which nowdays means a process similar but not identical to https://wikitech.wikimedia.org/wiki/DNS/Netbox#How_to_manually_allocate_a_special_purpose_IP_address_in_Netbox

Unfortunately it turned out that these are special cases which can't be added manual only in DNS but als not only in netbox and IPs can only migrate within the same row ... and gerrit2002 was not installed in the same row as gerrit2001.

So now for this plan to work we have to ask dcops to move the physical host around it seems.

Change 792281 abandoned by Dzahn:

[operations/dns@master] add gerrit-replica-new secondary/service IPs

Reason:

gerrit2002 is in the wrong row

https://gerrit.wikimedia.org/r/792281

reopened T299575 to move the host around

Dzahn changed the task status from Open to In Progress.Jul 19 2022, 3:27 AM
Dzahn raised the priority of this task from Medium to High.

Change 815395 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/dns@master] add gerrit-replica-new.wikimedia.org, point to 208.80.153.109

https://gerrit.wikimedia.org/r/815395

Change 815396 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: add gerrit role and hiera settings for replica to gerrit2002

https://gerrit.wikimedia.org/r/815396

Change 815397 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] acme_chief: add gerrit2002 to hosts allowed to fetch TLS certs

https://gerrit.wikimedia.org/r/815397

Change 815398 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: add gerrit2002 to firewall rules for cluster support

https://gerrit.wikimedia.org/r/815398

Change 815400 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: add gerrit2002 to puppetized known_hosts file

https://gerrit.wikimedia.org/r/815400

Change 815401 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: add hiera data for a second replica

https://gerrit.wikimedia.org/r/815401

Change 815402 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] admin/gerrit: add gerrit shell admins on gerrit2002

https://gerrit.wikimedia.org/r/815402

Change 815402 merged by Vgutierrez:

[operations/puppet@production] admin/gerrit: add gerrit shell admins on gerrit2002

https://gerrit.wikimedia.org/r/815402

Change 815397 merged by Dzahn:

[operations/puppet@production] acme_chief: add gerrit2002 to hosts allowed to fetch TLS certs

https://gerrit.wikimedia.org/r/815397

Change 815398 merged by Dzahn:

[operations/puppet@production] gerrit: add gerrit2002 to firewall rules for cluster support

https://gerrit.wikimedia.org/r/815398

Change 815400 merged by Dzahn:

[operations/puppet@production] gerrit: add gerrit2002 to puppetized known_hosts file

https://gerrit.wikimedia.org/r/815400

Change 817841 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: add gerrit2002 to list of migration dest hosts

https://gerrit.wikimedia.org/r/817841

Change 817841 merged by Dzahn:

[operations/puppet@production] gerrit: turn gerrit2002 into a gerrit migration dest host

https://gerrit.wikimedia.org/r/817841

on gerrit2002 we now have, created by the migration class:

  • a group "gerrit2"
  • a user "gerrit2"
  • a directory /srv/gerrit
  • package rsync installed, /etc/default/rsync configured
  • 3 rsync fragments created: frag-gerrit-data frag-gerrit-home frag-gerrit-var-lib that allow syncing specific paths
  • /srv/home-gerrit1001.wikimedia.org/ created
  • host added to gerrit "contacts"

The 3 rsync modules (config fragments) are:

  • gerrit-data: path: /srv/gerrit
  • gerrit-home: path: /srv/home-gerrit1001.wikimedia.org
  • gerrit-var-lib: /var/lib/gerrit2/review_site

The hosts_allow line is gerrit1001.wikimedia.org localhost for all of them.

Finally the firewall rule that has been created via puppet->ferm->iptables is ACCEPT tcp -- gerrit1001.wikimedia.org anywhere tcp dpt:rsync.

This means an rsyncd is listening on the new host and is ready to have data pushed to it.. from the prod gerrit server gerrit1001. (with the current settings).

Mentioned in SAL (#wikimedia-operations) [2022-07-28T18:28:19Z] <mutante> gerrit: rsyncing /home from prod gerrit1001 to /srv/home-gerrit1001.wikimedia.org on gerrit2002 new replica T243027 T313250

Change 815395 merged by Dzahn:

[operations/dns@master] add gerrit-replica-new.wikimedia.org, point to 208.80.153.104

https://gerrit.wikimedia.org/r/815395

new in DNS:

[authdns1001:~] $ host gerrit-replica-new.wikimedia.org
gerrit-replica-new.wikimedia.org has address 208.80.153.104
gerrit-replica-new.wikimedia.org has IPv6 address 2620:0:860:4:208:80:153:104

Change 815396 merged by Dzahn:

[operations/puppet@production] gerrit: add hiera settings and IP for new replica gerrit2002

https://gerrit.wikimedia.org/r/815396

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host gerrit2002.wikimedia.org with OS buster

Mentioned in SAL (#wikimedia-operations) [2022-08-02T20:38:01Z] <mutante> re-imaging gerrit2002 with buster - because it's on bullseye, needs git-fat and that has not been ported to python3 yet which blocks upgrading gerrit machines otherwise T313250 T243027 T279509

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host gerrit2002.wikimedia.org with OS buster completed:

  • gerrit2002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202208022036_dzahn_1022103_gerrit2002.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 819760 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/software/gerrit@deploy/wmf/stable-3.4] Add gerrit2002.wikimedia.org to scap targets list

https://gerrit.wikimedia.org/r/819760

Change 819760 merged by jenkins-bot:

[operations/software/gerrit@deploy/wmf/stable-3.4] Add gerrit2002.wikimedia.org to scap targets list

https://gerrit.wikimedia.org/r/819760

Change 820185 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] acme_chief: add gerrit-replica-new to SNI list

https://gerrit.wikimedia.org/r/820185

Change 820185 merged by Dzahn:

[operations/puppet@production] acme_chief: add gerrit-replica-new to SNI list

https://gerrit.wikimedia.org/r/820185

Change 815401 merged by Dzahn:

[operations/puppet@production] gerrit: add hiera data for a second replica

https://gerrit.wikimedia.org/r/815401

Dzahn renamed this task from replacement for gerrit2001 to replacement for gerrit2001, decom gerrit2001.Aug 4 2022, 12:18 AM

Change 820248 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: decom gerrit2001

https://gerrit.wikimedia.org/r/820248

Change 820249 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: remove hiera data for old replica

https://gerrit.wikimedia.org/r/820249

Change 820250 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: remove gerrit2001, merge gerrit1001/2002 regex

https://gerrit.wikimedia.org/r/820250

Change 820474 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] Update the known host key for gerrit2002.wikimedia.org

https://gerrit.wikimedia.org/r/820474

Change 820474 merged by Jbond:

[operations/puppet@production] Update the known host key for gerrit2002.wikimedia.org

https://gerrit.wikimedia.org/r/820474

Change 820573 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerit: remove old replica on gerrit2001 from gerrit config

https://gerrit.wikimedia.org/r/820573

Change 820577 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/dns@master] gerrit: switch gerrit-replica to new machine, remove replica-new

https://gerrit.wikimedia.org/r/820577

Change 820577 merged by Dzahn:

[operations/dns@master] gerrit: switch gerrit-replica to new machine, remove replica-new

https://gerrit.wikimedia.org/r/820577

Change 820573 merged by Dzahn:

[operations/puppet@production] gerit: remove old replica on gerrit2001 from gerrit config

https://gerrit.wikimedia.org/r/820573

Change 820249 abandoned by Dzahn:

[operations/puppet@production] gerrit: remove hiera data for old replica

Reason:

duplicate

https://gerrit.wikimedia.org/r/820249

Change 820248 merged by Dzahn:

[operations/puppet@production] gerrit: decom gerrit2001

https://gerrit.wikimedia.org/r/820248

cookbooks.sre.hosts.decommission executed by dzahn@cumin2002 for hosts: gerrit2001.wikimedia.org

  • gerrit2001.wikimedia.org (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Icinga/Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by dzahn@cumin2002 for hosts: gerrit2001.wikimedia.org

  • gerrit2001.wikimedia.org (FAIL)
    • Host not found on Icinga, unable to downtime it
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • No DNS record found for the mgmt interface gerrit2001.mgmt.codfw.wmnet, trying the asset tag one: wmf6408.mgmt.codfw.wmnet
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

Change 820250 merged by Dzahn:

[operations/puppet@production] site: remove gerrit2001, merge gerrit1001/2002 regex

https://gerrit.wikimedia.org/r/820250

gerrit2002 is production https://gerrit-replica.wikimedia.org

gerrit2001 is shut down and fully decom'ed.

This is now handed over to dcops for physical decom steps and continues still at T315040.

Change 839551 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/software/gerrit@deploy/wmf/stable-3.4] Remove gerrit2001 from deployment targets

https://gerrit.wikimedia.org/r/839551

Change 839551 merged by jenkins-bot:

[operations/software/gerrit@deploy/wmf/stable-3.4] Remove gerrit2001 from deployment targets

https://gerrit.wikimedia.org/r/839551

Change 932021 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: remove gerrit1001 from site and gerrit2001 hiera data

https://gerrit.wikimedia.org/r/932021

Change 932021 merged by Dzahn:

[operations/puppet@production] gerrit: remove gerrit1001 from site and gerrit2001 hiera data

https://gerrit.wikimedia.org/r/932021

Change #1056996 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gerrit: drop gerrit-replica-new.wikimedia.org from list of replicas

https://gerrit.wikimedia.org/r/1056996

Change #1056996 merged by Dzahn:

[operations/puppet@production] gerrit: drop gerrit-replica-new.wikimedia.org from list of replicas

https://gerrit.wikimedia.org/r/1056996