Page MenuHomePhabricator

upgrade releases hosts to bullseye
Closed, ResolvedPublic

Description

The hosts releases1002.eqiad.wmnet and releases2002.codfw.wmnet are on buster and should be replaced by bullseye VMs.

Event Timeline

Dzahn triaged this task as High priority.

Change 922501 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] Create entry for new releases hosts

https://gerrit.wikimedia.org/r/922501

Change 922501 merged by EoghanGaffney:

[operations/puppet@production] Create entry for new releases hosts

https://gerrit.wikimedia.org/r/922501

Cookbook cookbooks.sre.hosts.reimage was started by eoghan@cumin1001 for host releases1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eoghan@cumin1001 for host releases1003.eqiad.wmnet with OS bullseye completed:

  • releases1003 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202305231449_eoghan_3034498_releases1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by eoghan@cumin1001 for host releases2003.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eoghan@cumin1001 for host releases2003.codfw.wmnet with OS bullseye completed:

  • releases2003 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202305231524_eoghan_3041676_releases2003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

@eoghan Thank you for working on this! :) Could you please add a VM request (https://wikitech.wikimedia.org/wiki/SRE/SRE_Team_requests#Virtual_machine_requests_(Production)) for the new VMs? While it's fine that we approve our own requests for something like this (replacing existing VMs in same role), they still use the paper trail for capacity planning etc. Sorry for the paperwork.

@eoghan We want to change the UID/GID for system user 'jenkins'. This exists on contint* but also releases*. Please wait another day or so before applying the production role to the new VMs. Otherwise we just have to immediately touch them again for https://gerrit.wikimedia.org/r/c/operations/puppet/+/917919.

I am planning to deploy that tomorrow and after that jenkins UID/GID will be just fine from the beginning on the new machines.

@Dzahn I've filled out the form, didn't know we needed to do that. Will keep it in mind for the future.

In terms of applying the production roles, I still need to add the second disk to each VM, so not ready to do the production roles yet. I'll check in with you before I apply them!

@eoghan I did that. Now when the puppet role is applied it will use globably reserved uid/gid 924 for the jenkins user. You can go ahead. And thanks for filling out the paper work for the VMs.

Change 924085 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] Ensure rsync jobs get removed on the non-active machine

https://gerrit.wikimedia.org/r/924085

Change 924970 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] releases: Add new hosts to failover servers list

https://gerrit.wikimedia.org/r/924970

Change 924085 merged by Dzahn:

[operations/puppet@production] releases: Ensure rsync jobs get removed on the non-active machine

https://gerrit.wikimedia.org/r/924085

Change 925033 had a related patch set uploaded (by Dzahn; author: Reedy):

[operations/puppet@production] releases: clone repos/releng/release from gitlab

https://gerrit.wikimedia.org/r/925033

" releases: Ensure rsync jobs get removed on the non-active machine" was merged but then reverted. sorry for the noise. see comments or chat to me on the details.

releases1003.eqiad.wmnet has a puppet failure:

Notice: /Stage[main]/Releases/Git::Clone[repos/releng/release]/Exec[git_clone_repos/releng/release]/returns: fatal: unable to access 'https://gerrit.wikimedia.org/r/repos/releng/release/': The requested URL returned error: 403

It tries to clone a repository from Gerrit but uses the Gitlab repository name (this repo has been migrated). It does not happen on the other hosts for some reason.

Anyway, looks like the fix is above: https://gerrit.wikimedia.org/r/c/operations/puppet/+/925033/

I went ahead and manually cloned the repository on releases1003:

sudo rm /srv/mediawiki/release-tools
sudo git clone https://gitlab.wikimedia.org/repos/releng/release.git /srv/mediawiki/release-tools

That has at least fixed Puppet on the host :)

I gave detail on https://gerrit.wikimedia.org/r/c/operations/puppet/+/925033 , in short git::clone on a given directory does not update the remote URL when the remote repository name or type (gerrit/gitlab) is updated.

I went ahead and manually cloned the repository on releases1003

Thank you! Here the response from T290260#8903412 applies as well.

I gave detail on https://gerrit.wikimedia.org/r/c/operations/puppet/+/925033

ACK, also added a question there.

Change 925033 merged by EoghanGaffney:

[operations/puppet@production] releases: clone repos/releng/release from gitlab

https://gerrit.wikimedia.org/r/925033

Change 924970 merged by EoghanGaffney:

[operations/puppet@production] releases: Add new hosts to failover servers list

https://gerrit.wikimedia.org/r/924970

Current status: The two new releases hosts are up and running. The next part is to deploy the jenkins instance that runs on the host, and test to see if that works.

The remaining things to do here are:

  • Pick a date for the switchover (Monday, 26th)
  • Email users who have access to the releases host [1], as well as the wikitech list to inform them of the upcoming switch of host
  • Put a banner up on the releases host warning of the impending change
  • Deploy and test the Jenkins instance on the new host [2]

[1] This was generated by reading the list of users who are in any group that allows shell access to the releases hosts
[2] scap deploy --environment releasing -l <hostname> -f and then looking at the UI via an ssh tunnel on the host, and looking at logs

Change 931600 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] apt: Add jenkins packages to bullseye-wikimedia

https://gerrit.wikimedia.org/r/931600

Change 931600 merged by EoghanGaffney:

[operations/puppet@production] apt: Add jenkins packages to bullseye-wikimedia

https://gerrit.wikimedia.org/r/931600

We've picked Monday 26th June for the migration from releases1002 -> releases1003.

Before migration

Migration

Post migration (about two weeks after migration)

  • Check logs for ssh/http on the old hosts
  • Merge change to remove old hosts from image_builders in docker-registry-ha`
  • Decommission old hosts

Notes:

[1] Jenkins is deployed on a all releases hosts with the correct configuration, but is disabled by hand. During the migration, stop jenkins on releases1002 (systemctl stop jenkins; systemctl disable jenkins), and enable it on releases1003 (systemctl enable jenkins; systemctl start jenkins)

Change 938889 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] Remove references to releases1002/releases2002 for decom

https://gerrit.wikimedia.org/r/938889

Change 938889 merged by EoghanGaffney:

[operations/puppet@production] Remove references to releases1002/releases2002 for decom

https://gerrit.wikimedia.org/r/938889

Icinga downtime and Alertmanager silence (ID=0bd9d0bc-72b8-4503-a69b-7f9ebeb21b24) set by eoghan@cumin1001 for 5 days, 0:00:00 on 2 host(s) and their services with reason: Decommissioning prep

releases2002.codfw.wmnet,releases1002.eqiad.wmnet

The old hosts were shut down this morning. If no issues, I'll decommission them on Wednesday morning.

cookbooks.sre.hosts.decommission executed by eoghan@cumin1001 for hosts: releases2002.codfw.wmnet

  • releases2002.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox

cookbooks.sre.hosts.decommission executed by eoghan@cumin1001 for hosts: releases1002.eqiad.wmnet

  • releases1002.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox

The hosts have been decommissioned!