- Reimage a couple of scap proxies to be videoscaler nodes for row balance (currently all eqiad videoscaler nodes are in D8)
- Move all scap proxies to videoscaler nodes
- Remove appserver canaries from scap config
- Set pooled=inactive for all remaining appservers (except 1 per cluster to avoid alerts before complete turndown of the clusters)
Description
Details
Event Timeline
Change #1048376 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):
[operations/puppet@production] mediawiki: Reimage scap proxies as videoscalers
Change #1048377 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):
[operations/puppet@production] scap_proxies: move all proxies to videoscalers
Mentioned in SAL (#wikimedia-operations) [2024-06-24T09:34:10Z] <claime> Reimaging scap::proxies, mediawiki deployments may be unavailable - T368058
Change #1048376 merged by Clément Goubert:
[operations/puppet@production] mediawiki: Reimage scap proxies as videoscalers
Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1407.eqiad.wmnet with OS buster
Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1420.eqiad.wmnet with OS buster
Change #1049110 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):
[operations/puppet@production] videoscalers: Pool 2 former appservers
Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1420.eqiad.wmnet with OS buster completed:
- mw1420 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh buster OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406240959_cgoubert_762668_mw1420.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB
Change #1048377 merged by Clément Goubert:
[operations/puppet@production] scap_proxies: move all proxies to videoscalers
Change #1049110 merged by Clément Goubert:
[operations/puppet@production] videoscalers: Pool 2 former appservers
Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1407.eqiad.wmnet with OS buster completed:
- mw1407 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh buster OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406241003_cgoubert_762618_mw1407.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB
Mentioned in SAL (#wikimedia-operations) [2024-06-24T10:39:19Z] <claime> pooling mw1420.eqiad.wmnet,mw1407.eqiad.wmnet as videoscalers - T368058
Change #1049128 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):
[operations/puppet@production] appserver: Remove all canaries
Change #1049128 merged by Clément Goubert:
[operations/puppet@production] appserver: Remove all canaries
Mentioned in SAL (#wikimedia-operations) [2024-06-24T12:06:26Z] <claime> Setting all legacy appservers to inactive - T368058
Mentioned in SAL (#wikimedia-operations) [2024-06-24T12:07:39Z] <claime> Setting all legacy api_appservers to inactive - T368058
Mentioned in SAL (#wikimedia-operations) [2024-06-24T12:09:06Z] <claime> Downtiming all legacy api_appserver and appserver - T368058
Icinga downtime and Alertmanager silence (ID=e16c1946-af75-4fcc-a482-ad56961f3c0b) set by cgoubert@cumin1002 for 21 days, 0:00:00 on 31 host(s) and their services with reason: Waiting for reimage to kubernetes
mw[2268-2277,2307,2309,2365,2392-2393,2432-2433,2438-2439,2441].codfw.wmnet,mw[1364-1366,1373,1413,1417-1418].eqiad.wmnet,mwdebug[2001-2002].codfw.wmnet,mwdebug[1001-1002].eqiad.wmnet
Remaining pooled servers:
cgoubert@cumin1002:~$ sudo confctl select 'cluster=(api_appserver|appserver)' get | grep yes {"mw2276.codfw.wmnet": {"weight": 25, "pooled": "yes"}, "tags": "dc=codfw,cluster=appserver,service=nginx"} {"mw2299.codfw.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=codfw,cluster=api_appserver,service=nginx"} {"mw1364.eqiad.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=eqiad,cluster=appserver,service=nginx"} {"mw1398.eqiad.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=eqiad,cluster=api_appserver,service=nginx"}