Page MenuHomePhabricator

Set all appservers to pooled=inactive in scap
Closed, ResolvedPublic

Description

  • Reimage a couple of scap proxies to be videoscaler nodes for row balance (currently all eqiad videoscaler nodes are in D8)
  • Move all scap proxies to videoscaler nodes
  • Remove appserver canaries from scap config
  • Set pooled=inactive for all remaining appservers (except 1 per cluster to avoid alerts before complete turndown of the clusters)

Event Timeline

Change #1048376 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mediawiki: Reimage scap proxies as videoscalers

https://gerrit.wikimedia.org/r/1048376

Change #1048377 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] scap_proxies: move all proxies to videoscalers

https://gerrit.wikimedia.org/r/1048377

Mentioned in SAL (#wikimedia-operations) [2024-06-24T09:34:10Z] <claime> Reimaging scap::proxies, mediawiki deployments may be unavailable - T368058

Change #1048376 merged by Clément Goubert:

[operations/puppet@production] mediawiki: Reimage scap proxies as videoscalers

https://gerrit.wikimedia.org/r/1048376

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1407.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1420.eqiad.wmnet with OS buster

Change #1049110 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] videoscalers: Pool 2 former appservers

https://gerrit.wikimedia.org/r/1049110

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1420.eqiad.wmnet with OS buster completed:

  • mw1420 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406240959_cgoubert_762668_mw1420.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Change #1048377 merged by Clément Goubert:

[operations/puppet@production] scap_proxies: move all proxies to videoscalers

https://gerrit.wikimedia.org/r/1048377

Change #1049110 merged by Clément Goubert:

[operations/puppet@production] videoscalers: Pool 2 former appservers

https://gerrit.wikimedia.org/r/1049110

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1407.eqiad.wmnet with OS buster completed:

  • mw1407 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406241003_cgoubert_762618_mw1407.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-06-24T10:39:19Z] <claime> pooling mw1420.eqiad.wmnet,mw1407.eqiad.wmnet as videoscalers - T368058

Clement_Goubert changed the task status from Open to In Progress.Mon, Jun 24, 10:52 AM
Clement_Goubert triaged this task as Medium priority.
Clement_Goubert updated the task description. (Show Details)

Change #1049128 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] appserver: Remove all canaries

https://gerrit.wikimedia.org/r/1049128

Change #1049128 merged by Clément Goubert:

[operations/puppet@production] appserver: Remove all canaries

https://gerrit.wikimedia.org/r/1049128

Mentioned in SAL (#wikimedia-operations) [2024-06-24T12:06:26Z] <claime> Setting all legacy appservers to inactive - T368058

Mentioned in SAL (#wikimedia-operations) [2024-06-24T12:07:39Z] <claime> Setting all legacy api_appservers to inactive - T368058

Mentioned in SAL (#wikimedia-operations) [2024-06-24T12:09:06Z] <claime> Downtiming all legacy api_appserver and appserver - T368058

Icinga downtime and Alertmanager silence (ID=e16c1946-af75-4fcc-a482-ad56961f3c0b) set by cgoubert@cumin1002 for 21 days, 0:00:00 on 31 host(s) and their services with reason: Waiting for reimage to kubernetes

mw[2268-2277,2307,2309,2365,2392-2393,2432-2433,2438-2439,2441].codfw.wmnet,mw[1364-1366,1373,1413,1417-1418].eqiad.wmnet,mwdebug[2001-2002].codfw.wmnet,mwdebug[1001-1002].eqiad.wmnet

Remaining pooled servers:

cgoubert@cumin1002:~$ sudo confctl select 'cluster=(api_appserver|appserver)' get | grep yes
{"mw2276.codfw.wmnet": {"weight": 25, "pooled": "yes"}, "tags": "dc=codfw,cluster=appserver,service=nginx"}
{"mw2299.codfw.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=codfw,cluster=api_appserver,service=nginx"}
{"mw1364.eqiad.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=eqiad,cluster=appserver,service=nginx"}
{"mw1398.eqiad.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=eqiad,cluster=api_appserver,service=nginx"}