⚓ T352149 Switchover m3 master db1159 -> db1119

	Subject	Repo	Branch	Lines +/-
	mariadb: Promote db1119 to m3 master	operations/puppet	production	+10 -10

Status	Assigned	Task
Open	None	T356960 Upgrade hosts to MariaDB 10.6
Resolved	Marostegui	T351990 Migrate m3 (phabricator) databases to Bookworm and MariaDB 10.6
Resolved	Marostegui	T352149 Switchover m3 master db1159 -> db1119

Marostegui triaged this task as Medium priority.Nov 28 2023, 9:20 AM

Marostegui created this task.

Marostegui updated the task description. (Show Details)

Marostegui moved this task from Triage to In progress on the DBA board.

Marostegui added a project: Phabricator.

Marostegui added a project: collaboration-services.Nov 28 2023, 9:25 AM

hashar added a subscriber: brennen.Nov 28 2023, 9:29 AM

Just FYI, I will be putting phabricator in RO for a few seconds during this week (early in a European morning) to switchover the master.

Aklapper moved this task from To Triage to Infrastructure on the Phabricator board.Nov 28 2023, 2:51 PM

Dzahn awarded a token.Nov 28 2023, 3:57 PM

Mentioned in SAL (#wikimedia-operations) [2023-11-30T05:41:23Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on db[2134,2160].codfw.wmnet,db[1119,1159,1217].eqiad.wmnet with reason: m3 master switchover T352149

Mentioned in SAL (#wikimedia-operations) [2023-11-30T05:41:43Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2134,2160].codfw.wmnet,db[1119,1159,1217].eqiad.wmnet with reason: m3 master switchover T352149

Change 978721 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1119 to m3 master

https://gerrit.wikimedia.org/r/978721

gerritbot added a project: Patch-For-Review.Nov 30 2023, 5:43 AM

Marostegui updated the task description. (Show Details)Nov 30 2023, 5:45 AM

Change 978721 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1119 to m3 master

https://gerrit.wikimedia.org/r/978721

Marostegui updated the task description. (Show Details)Nov 30 2023, 5:47 AM

Mentioned in SAL (#wikimedia-operations) [2023-11-30T05:47:24Z] <marostegui> Failover m3 from db1159 to db1119 - T352149

Test

Marostegui updated the task description. (Show Details)Nov 30 2023, 5:50 AM

This was done. The RO time was just a few seconds.

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1159.eqiad.wmnet with OS bookworm

Maintenance_bot removed a project: Patch-For-Review.Nov 30 2023, 6:10 AM

Maintenance_bot moved this task from In progress to Done on the DBA board.Nov 30 2023, 6:15 AM

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1159.eqiad.wmnet with OS bookworm completed:

db1159 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311300608_marostegui_2070516_db1159.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2023-12-01T05:31:59Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on db[2134,2160].codfw.wmnet,db[1119,1159,1217].eqiad.wmnet with reason: m3 master switchover T352149

Mentioned in SAL (#wikimedia-operations) [2023-12-01T05:32:17Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2134,2160].codfw.wmnet,db[1119,1159,1217].eqiad.wmnet with reason: m3 master switchover T352149

Switchover m3 master db1159 -> db1119
Closed, ResolvedPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

Switchover m3 master db1159 -> db1119Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Switchover m3 master db1159 -> db1119
Closed, ResolvedPublic
Actions

Related Objects
Search...