Page MenuHomePhabricator

Q1:Install cp11[00-15] and rotate into production
Closed, ResolvedPublic

Description

This task will track the setup of the following hosts from reimage until serving live traffic in eqiad.

  • T342159 for information about naming, racking and other details.
  • T350179 for issues with PXE booting
  • This ticket for operations related to provisioning and rotating into production
  • T352253 for decommissioning
  • T352078 for hiera data consolidation

Common details

OS Distro: Bullseye (Debian 11)
Text hosts: cp1100-cp1107
Upload hosts: cp1108-cp1115

General plan

  1. Write hiera configuration for all new hosts in eqiad (test w/ PCC that is a NOOP for other cp hosts in eqiad/other DCs)
  2. Reimage first host without pooling it and check all is fine (confirm BIOS settings are correct (see T349314))
    • `sudo cumin 'cp11*' 'egrep -q "vmx|svm" /proc/cpuinfo && echo yes || echo no'`
    • `sudo cumin 'cp11*' 'grep -P "processor\s*:\s*95$" /proc/cpuinfo' # checking for HT, 96 total (should output 95)`
    • `sudo cumin 'cp11*' 'nvme list'`
  1. Reimage all new hosts without pooling it
  2. Swap old / new host waiting 24h between each host (hosts in text and upload clusters can be swapped in parallel)

Host swap

Considering all "old" host uses the multi-ats-backend configuration, we assume that the safest way to not reduce drastically the hit rate on old hosts and at the same time introduce the new servers is:

For each (text|upload) cluster:

0. (preparation): Set all new cp hosts weight even if they are inactive (1 for cdn, 100 for ats-be).

  1. Depool $oldHost using confctl (eg. confctl select name=<oldhost>.eqiad.wmnet,service=cdn set/pooled=inactive).
    • ONLY THE "cdn" SERVICE should be set to pooled: inactive.
    • The ats-be service will still be pooled: yes. This will allow other old cp hosts to keep using the $oldHost as backend and thus keeping the hit-rate.
    • cdn service weight should be set to 0.
  1. Remove downtime for the $newHost.
  1. Pool the $newHost using confctl (eg. confctl select name=<newhost>.eqiad.wmnet,service=cdn set/pooled=yes) .
    • The cdn service will be set to pooled: yes
    • ats-be service will be set to ats-be: no (even if the latter isn't actually used by new cp hosts).
  1. Wait 24h monitoring the hit rate of the new server and the general behavior of the services in eqiad.
  2. Repeat the swap in the same way with all hosts, always waiting 24h between each one.
  3. When all the legacy hosts are depooled, we can swap also the ats-be service on new nodes to pooled: yes (for consistency only, as is not used).
  4. Legacy hosts will be decommissioned.

Per host setup checklist

cp1100:
  • Add host to manifests/site.pp
  • Add host to conftool-data/node/eqiad.yaml
  • Add host to hieradata/common.yaml
  • Add host to hieradata/common/cache.yaml
  • Create per-host hiera file with configuration for dual disk
  • Confirm host is actually reachable and ready for reimaging
  • OS Installation & initial puppet run via sre.hosts.reimage cookbook
  • Ensure the host is depooled
  • Depool "corresponding" old host (cp1075)
  • Pool this host
cp1101:
  • Add host to manifests/site.pp
  • Add host to conftool-data/node/eqiad.yaml
  • Add host to hieradata/common.yaml
  • Add host to hieradata/common/cache.yaml
  • Create per-host hiera file with configuration for dual disk
  • Confirm host is actually reachable and ready for reimaging
  • OS Installation & initial puppet run via sre.hosts.reimage cookbook
  • Ensure the host is depooled
  • Depool "corresponding" old host (cp1076)
  • Pool this host
cp1102:
  • Add host to manifests/site.pp
  • Add host to conftool-data/node/eqiad.yaml
  • Add host to hieradata/common.yaml
  • Add host to hieradata/common/cache.yaml
  • Create per-host hiera file with configuration for dual disk
  • Confirm host is actually reachable and ready for reimaging
  • OS Installation & initial puppet run via sre.hosts.reimage cookbook
  • Ensure the host is depooled
  • Depool "corresponding" old host (cp1077)
  • Pool this host
cp1103:
  • Add host to manifests/site.pp
  • Add host to conftool-data/node/eqiad.yaml
  • Add host to hieradata/common.yaml
  • Add host to hieradata/common/cache.yaml
  • Create per-host hiera file with configuration for dual disk
  • Confirm host is actually reachable and ready for reimaging
  • OS Installation & initial puppet run via sre.hosts.reimage cookbook
  • Ensure the host is depooled
  • Depool "corresponding" old host (cp1078)
  • Pool this host
cp1104:
  • Add host to manifests/site.pp
  • Add host to conftool-data/node/eqiad.yaml
  • Add host to hieradata/common.yaml
  • Add host to hieradata/common/cache.yaml
  • Create per-host hiera file with configuration for dual disk
  • Confirm host is actually reachable and ready for reimaging
  • OS Installation & initial puppet run via sre.hosts.reimage cookbook
  • Ensure the host is depooled
  • Depool "corresponding" old host (cp1079)
  • Pool this host
cp1105:
  • Add host to manifests/site.pp
  • Add host to conftool-data/node/eqiad.yaml
  • Add host to hieradata/common.yaml
  • Add host to hieradata/common/cache.yaml
  • Create per-host hiera file with configuration for dual disk
  • Confirm host is actually reachable and ready for reimaging
  • OS Installation & initial puppet run via sre.hosts.reimage cookbook
  • Ensure the host is depooled
  • Depool "corresponding" old host (cp1080)
  • Pool this host
cp1106:
  • Add host to manifests/site.pp
  • Add host to conftool-data/node/eqiad.yaml
  • Add host to hieradata/common.yaml
  • Add host to hieradata/common/cache.yaml
  • Create per-host hiera file with configuration for dual disk
  • Confirm host is actually reachable and ready for reimaging
  • OS Installation & initial puppet run via sre.hosts.reimage cookbook
  • Ensure the host is depooled
  • Depool "corresponding" old host (cp1081)
  • Pool this host
cp1107:
  • Add host to manifests/site.pp
  • Add host to conftool-data/node/eqiad.yaml
  • Add host to hieradata/common.yaml
  • Add host to hieradata/common/cache.yaml
  • Create per-host hiera file with configuration for dual disk
  • Confirm host is actually reachable and ready for reimaging
  • OS Installation & initial puppet run via sre.hosts.reimage cookbook
  • Ensure the host is depooled
  • Depool "corresponding" old host (cp1082)
  • Pool this host
cp1108:
  • Add host to manifests/site.pp
  • Add host to conftool-data/node/eqiad.yaml
  • Add host to hieradata/common.yaml
  • Add host to hieradata/common/cache.yaml
  • Create per-host hiera file with configuration for dual disk
  • Confirm host is actually reachable and ready for reimaging
  • OS Installation & initial puppet run via sre.hosts.reimage cookbook
  • Ensure the host is depooled
  • Depool "corresponding" old host (cp1083)
  • Pool this host
cp1109:
  • Add host to manifests/site.pp
  • Add host to conftool-data/node/eqiad.yaml
  • Add host to hieradata/common.yaml
  • Add host to hieradata/common/cache.yaml
  • Create per-host hiera file with configuration for dual disk
  • Confirm host is actually reachable and ready for reimaging
  • OS Installation & initial puppet run via sre.hosts.reimage cookbook
  • Ensure the host is depooled
  • Depool "corresponding" old host (cp1084)
  • Pool this host
cp1110:
  • Add host to manifests/site.pp
  • Add host to conftool-data/node/eqiad.yaml
  • Add host to hieradata/common.yaml
  • Add host to hieradata/common/cache.yaml
  • Create per-host hiera file with configuration for dual disk
  • Confirm host is actually reachable and ready for reimaging
  • OS Installation & initial puppet run via sre.hosts.reimage cookbook
  • Ensure the host is depooled
  • Depool "corresponding" old host (cp1085)
  • Pool this host
cp1111:
  • Add host to manifests/site.pp
  • Add host to conftool-data/node/eqiad.yaml
  • Add host to hieradata/common.yaml
  • Add host to hieradata/common/cache.yaml
  • Create per-host hiera file with configuration for dual disk
  • Confirm host is actually reachable and ready for reimaging
  • OS Installation & initial puppet run via sre.hosts.reimage cookbook
  • Ensure the host is depooled
  • Depool "corresponding" old host (cp1086)
  • Pool this host
cp1112:
  • Add host to manifests/site.pp
  • Add host to conftool-data/node/eqiad.yaml
  • Add host to hieradata/common.yaml
  • Add host to hieradata/common/cache.yaml
  • Create per-host hiera file with configuration for dual disk
  • Confirm host is actually reachable and ready for reimaging
  • OS Installation & initial puppet run via sre.hosts.reimage cookbook
  • Ensure the host is depooled
  • Depool "corresponding" old host (cp1087)
  • Pool this host
cp1113:
  • Add host to manifests/site.pp
  • Add host to conftool-data/node/eqiad.yaml
  • Add host to hieradata/common.yaml
  • Add host to hieradata/common/cache.yaml
  • Create per-host hiera file with configuration for dual disk
  • Confirm host is actually reachable and ready for reimaging
  • OS Installation & initial puppet run via sre.hosts.reimage cookbook
  • Ensure the host is depooled
  • Depool "corresponding" old host (cp1088)
  • Pool this host
cp1114:
  • Add host to manifests/site.pp
  • Add host to conftool-data/node/eqiad.yaml
  • Add host to hieradata/common.yaml
  • Add host to hieradata/common/cache.yaml
  • Create per-host hiera file with configuration for dual disk
  • Confirm host is actually reachable and ready for reimaging
  • OS Installation & initial puppet run via sre.hosts.reimage cookbook
  • Ensure the host is depooled
  • Depool "corresponding" old host (cp1089)
  • Pool this host
cp1115:
  • Add host to manifests/site.pp
  • Add host to conftool-data/node/eqiad.yaml
  • Add host to hieradata/common.yaml
  • Add host to hieradata/common/cache.yaml
  • Create per-host hiera file with configuration for dual disk
  • Confirm host is actually reachable and ready for reimaging
  • OS Installation & initial puppet run via sre.hosts.reimage cookbook
  • Ensure the host is depooled
  • Depool "corresponding" old host (cp1090)
  • Pool this host

Re-ordered Host Reimage

  • cp1101.eqiad.wmnet
  • cp1103.eqiad.wmnet
  • cp1105.eqiad.wmnet
  • cp1107.eqiad.wmnet
  • cp1108.eqiad.wmnet
  • cp1110.eqiad.wmnet
  • cp1112.eqiad.wmnet
  • cp1114.eqiad.wmnet

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye executed with errors:

  • cp1107 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1105.eqiad.wmnet with OS bullseye completed:

  • cp1105 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311091914_sukhe_2269325_cp1105.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1110.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1110.eqiad.wmnet with OS bullseye executed with errors:

  • cp1110 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Set pooled=inactive for the following services on confctl:

{"cp1110.eqiad.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cache_text,service=ats-be"}
{"cp1110.eqiad.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cache_text,service=cdn"}

  • Disabled Puppet
  • Removed from Puppet and PuppetDB if present and deleted any certificates
  • Removed from Debmonitor if present
  • Forced PXE for next reboot
  • Host rebooted via IPMI
  • No changes in confctl are needed to restore the previous state.
  • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1110.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye completed:

  • cp1107 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311091950_sukhe_1798293_cp1107.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1110.eqiad.wmnet with OS bullseye executed with errors:

  • cp1110 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1110.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1110.eqiad.wmnet with OS bullseye executed with errors:

  • cp1110 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1110.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1110.eqiad.wmnet with OS bullseye completed:

  • cp1110 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311092035_sukhe_2308427_cp1110.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
ssingh updated the task description. (Show Details)

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1112.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1112.eqiad.wmnet with OS bullseye executed with errors:

  • cp1112 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Set pooled=inactive for the following services on confctl:

{"cp1112.eqiad.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cache_text,service=ats-be"}
{"cp1112.eqiad.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cache_text,service=cdn"}

  • Disabled Puppet
  • Removed from Puppet and PuppetDB if present and deleted any certificates
  • Removed from Debmonitor if present
  • Forced PXE for next reboot
  • Host rebooted via IPMI
  • No changes in confctl are needed to restore the previous state.
  • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1112.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1112.eqiad.wmnet with OS bullseye executed with errors:

  • cp1112 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1112.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye executed with errors:

  • cp1114 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Set pooled=inactive for the following services on confctl:

{"cp1114.eqiad.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cache_text,service=ats-be"}
{"cp1114.eqiad.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cache_text,service=cdn"}

  • Disabled Puppet
  • Removed from Puppet and PuppetDB if present and deleted any certificates
  • Removed from Debmonitor if present
  • Forced PXE for next reboot
  • Host rebooted via IPMI
  • No changes in confctl are needed to restore the previous state.
  • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye executed with errors:

  • cp1114 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye executed with errors:

  • cp1114 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1112.eqiad.wmnet with OS bullseye completed:

  • cp1112 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311100108_sukhe_2435301_cp1112.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye completed:

  • cp1114 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311100139_sukhe_1961556_cp1114.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye executed with errors:

  • cp1115 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye executed with errors:

  • cp1115 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye executed with errors:

  • cp1115 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye executed with errors:

  • cp1115 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye executed with errors:

  • cp1115 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye completed:

  • cp1115 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311101237_fabfur_2240161_cp1115.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2023-11-13T15:07:11Z] <fabfur> swapped cp1102 <-> cp1077 (T349244)

Mentioned in SAL (#wikimedia-operations) [2023-11-13T15:14:56Z] <fabfur> swapped cp1103 <-> cp1078 (T349244)

Mentioned in SAL (#wikimedia-operations) [2023-11-14T14:28:32Z] <fabfur> swapped cp1104 <-> cp1079 (T349244)

Mentioned in SAL (#wikimedia-operations) [2023-11-14T14:32:29Z] <fabfur> swapped cp1105 <-> cp1080 (T349244)

Mentioned in SAL (#wikimedia-operations) [2023-11-15T15:44:51Z] <fabfur> swapped cp1106 <-> cp1081 (T349244)

Mentioned in SAL (#wikimedia-operations) [2023-11-15T15:48:06Z] <fabfur> swapped cp1107 <-> cp1082 (T349244)

Mentioned in SAL (#wikimedia-operations) [2023-11-16T16:20:28Z] <fabfur> swapped cp1108 <-> cp1083 (T349244)

Mentioned in SAL (#wikimedia-operations) [2023-11-16T16:26:06Z] <fabfur> swapped cp1109 <-> cp1084 (T349244)

Icinga downtime and Alertmanager silence (ID=154babc2-d86e-4f5b-baf5-fb36e9d129e4) set by fabfur@cumin1001 for 14 days, 0:00:00 on 6 host(s) and their services with reason: Extending downtime for depooled cp hosts

cp[1110-1115].eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-11-20T15:44:41Z] <fabfur> swapped cp1110 <-> cp1085 (T349244)

Mentioned in SAL (#wikimedia-operations) [2023-11-20T15:48:24Z] <fabfur> swapped cp1111 <-> cp1086 (T349244)

Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye executed with errors:

  • cp1115 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye completed:

  • cp1115 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311211221_fabfur_733917_cp1115.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2023-11-21T14:42:01Z] <fabfur> swapped cp1112 <-> cp1087 (T349244)

Mentioned in SAL (#wikimedia-operations) [2023-11-21T14:44:19Z] <fabfur> swapped cp1113 <-> cp1088 (T349244)

Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1113.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1113.eqiad.wmnet with OS bullseye completed:

  • cp1113 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311221628_fabfur_1510102_cp1113.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Mentioned in SAL (#wikimedia-operations) [2023-11-22T17:01:04Z] <fabfur> swapped cp1113 <-> cp1088 (T349244)

Change 976805 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] conftool-data: (temporary) remove cp1113

https://gerrit.wikimedia.org/r/976805

Change 976805 merged by Fabfur:

[operations/puppet@production] conftool-data: (temporary) remove cp1113

https://gerrit.wikimedia.org/r/976805

Change 976826 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] conftool-data: re-added cp1113

https://gerrit.wikimedia.org/r/976826

Change 976826 merged by Fabfur:

[operations/puppet@production] conftool-data: re-added cp1113

https://gerrit.wikimedia.org/r/976826

Mentioned in SAL (#wikimedia-operations) [2023-11-27T15:07:14Z] <fabfur> nfctl select name='cp10.*',service=ats-be set/pooled=inactive (cdn and ats-be not used anymore on these hosts) T349244

Mentioned in SAL (#wikimedia-operations) [2023-11-27T15:14:03Z] <fabfur> set pooled=yes on cp11.* hosts in eqiad T349244

Change 977702 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] decom cp1075-1090

https://gerrit.wikimedia.org/r/977702

Looping in @CDanis as the original author for the cp1075 hiera overrides.
Do you think we can safely remove them or do we need to apply the same hiera data to another (new) cp hosts?

Looping in @CDanis as the original author for the cp1075 hiera overrides.
Do you think we can safely remove them or do we need to apply the same hiera data to another (new) cp hosts?

These can be removed!

Change 977702 merged by Fabfur:

[operations/puppet@production] decom cp1075-1090

https://gerrit.wikimedia.org/r/977702

Fabfur updated the task description. (Show Details)

All activities for this task have been completed, refer to the other linked tasks for more details on decommissioning old hosts