Page MenuHomePhabricator

Remove sanitarium hosts from codfw
Closed, ResolvedPublic

Description

Sanitarium hosts in codfw (db2186 and db2187) have never been used. They were eventually set up for two reasons

  1. Set up wikireplicas in codfw
  2. Failover in case sanitarium hosts in eqiad would fail.

It is pretty clear that  #1 will never happen.
In the case of #2, it is a complex process as we'd need to dig into binlogs from each dc and then attempt to move the wikireplicas there. It is a very complex and error prone change and can most likely lead to data corruption, so in the case of breakage of sanitarium hosts, it is more likely that we'd try to fix them or even reclone them from their sanitarium master, rather than risking the whole move to codfw.

Given that, I'd like to remove and repurpose the existing sanitarium hosts in codfw. This would give us two hosts back for production, as currently those hosts do nothing. And also would remove the snowflakes that sanitarium masters are in codfw.

I'd like to hear some thoughts from @taavi @fnegri @Ladsgroup @FCeratto-WMF

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Yes, but for now remember that they are production. The test should be safe enough to be run, but just keep that in mind.

I ran the cookbook in dry-run mode but that has limited utility. I suggest we move forward with this task removing them from prod use and then run restarts if possible.

You can restart one of them in codfw if you like now.

I could add a flag to filter with hosts to restart (it's meant to restart all hosts by default), I'll get back to this tomorrow.

I could add a flag to filter with hosts to restart (it's meant to restart all hosts by default), I'll get back to this tomorrow.

Yeah this is definitely needed

I update the script, can I run a restart on db2186 now?

@FCeratto-WMF let's not hijack this task - let's continue the testing conversation on the gerrit. This task isn't about testing that cookbook.

Since there are no objections, I will remove sanitarium hosts and will reconvert sanitarium masters to normal replicas in codfw.

Change #1151185 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db2*: Remove sanitarium masters

https://gerrit.wikimedia.org/r/1151185

Change #1151185 merged by Marostegui:

[operations/puppet@production] db2*: Remove sanitarium masters

https://gerrit.wikimedia.org/r/1151185

Change #1151198 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Move db2186 to x1

https://gerrit.wikimedia.org/r/1151198

Completed depool of db2191 - Upgrading db2191.codfw.wmnet - marostegui@cumin1002

Upgrade of db2191.codfw.wmnet completed

Upgrade of db2191.codfw.wmnet completed

Change #1151198 merged by Marostegui:

[operations/puppet@production] mariadb: Move db2186 to x1

https://gerrit.wikimedia.org/r/1151198

Started cloning db2191.codfw.wmnet to db2186.codfw.wmnet - marostegui@cumin1002

Change #1151217 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] instances.yaml: Add db2186 to dbctl

https://gerrit.wikimedia.org/r/1151217

Change #1151217 merged by Marostegui:

[operations/puppet@production] instances.yaml: Add db2186 to dbctl

https://gerrit.wikimedia.org/r/1151217

Mentioned in SAL (#wikimedia-operations) [2025-05-27T13:51:42Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Add db2186 to dbctl depooled T394884', diff saved to https://phabricator.wikimedia.org/P76488 and previous config saved to /var/cache/conftool/dbconfig/20250527-135141-marostegui.json

Start pool of db2191 gradually with 4 steps - Pool db2191.codfw.wmnet in after cloning - marostegui@cumin1002

Completed pool of db2191 gradually with 4 steps - Pool db2191.codfw.wmnet in after cloning - marostegui@cumin1002

db2186 has been converted into a x1 slave - given it till tomorrow to replicate before start to pool it for the first time.

Upgrade of db2186.codfw.wmnet completed

Upgrade of db2186.codfw.wmnet completed

Change #1151416 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db2186: Enable notifications

https://gerrit.wikimedia.org/r/1151416

Change #1151416 merged by Marostegui:

[operations/puppet@production] db2186: Enable notifications

https://gerrit.wikimedia.org/r/1151416

db2186 being pooled in x1

Change #1151629 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Move db2187 to x3 (s8)

https://gerrit.wikimedia.org/r/1151629

Change #1151629 merged by Marostegui:

[operations/puppet@production] mariadb: Move db2187 to x3 (s8)

https://gerrit.wikimedia.org/r/1151629

Finished cloning db2191.codfw.wmnet to db2186.codfw.wmnet - marostegui@cumin1002

Change #1151631 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] redact_sanitarium.sh: Remove db2186 db2187

https://gerrit.wikimedia.org/r/1151631

Change #1151631 merged by Marostegui:

[operations/puppet@production] redact_sanitarium.sh: Remove db2186 db2187

https://gerrit.wikimedia.org/r/1151631

Started cloning db2242.codfw.wmnet to db2187.codfw.wmnet - marostegui@cumin1002

Completed depool of db2242 - Depool db2242.codfw.wmnet to then clone it to db2187.codfw.wmnet - marostegui@cumin1002 - marostegui@cumin1002

Change #1151634 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] instances.yaml: Add db2187 to dbctl

https://gerrit.wikimedia.org/r/1151634

Change #1151634 merged by Marostegui:

[operations/puppet@production] instances.yaml: Add db2187 to dbctl

https://gerrit.wikimedia.org/r/1151634

Mentioned in SAL (#wikimedia-operations) [2025-05-28T10:20:16Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Add db2187 to dbctl depooled T394884', diff saved to https://phabricator.wikimedia.org/P76565 and previous config saved to /var/cache/conftool/dbconfig/20250528-102015-marostegui.json

Change #1151664 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] check_private_data_report: Remove db2186, db2187

https://gerrit.wikimedia.org/r/1151664

Change #1151664 merged by Marostegui:

[operations/puppet@production] check_private_data_report: Remove db2186, db2187

https://gerrit.wikimedia.org/r/1151664

Start pool of db2242 gradually with 4 steps - Pool db2242.codfw.wmnet in after cloning - marostegui@cumin1002

Change #1151676 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/software@master] events_sanitarium.sql: Remove db2186, db2187

https://gerrit.wikimedia.org/r/1151676

Change #1151676 merged by jenkins-bot:

[operations/software@master] events_sanitarium.sql: Remove db2186, db2187

https://gerrit.wikimedia.org/r/1151676

Completed pool of db2242 gradually with 4 steps - Pool db2242.codfw.wmnet in after cloning - marostegui@cumin1002

Finished cloning db2242.codfw.wmnet to db2187.codfw.wmnet - marostegui@cumin1002

db2186 needs some extra due to care: <jinxer-wm> FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db2186:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed

I can have a look later.

db2186 needs some extra due to care: <jinxer-wm> FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter@s3.service on db2186:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed

I can have a look later.

Thanks - I will reimage this host.

I believe I fixed db2186, but know that it will require the same kind of cleanup (specially of systemd and /etc/mysql/mysql.d/) for the other hosts, too.

Thank you. db2187 will needed it too, so I am thinking it is just easier/cleaner to reimage keeping the data

Up to you, I can fix other hosts quickly, it is just a few commands.

Up to you, I can fix other hosts quickly, it is just a few commands.

Go for it if you have time! Thanks!

Change #1152046 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] site.pp: Remove codfw sanitarium masters

https://gerrit.wikimedia.org/r/1152046

Change #1152046 merged by Marostegui:

[operations/puppet@production] site.pp: Remove codfw sanitarium masters

https://gerrit.wikimedia.org/r/1152046

Mentioned in SAL (#wikimedia-operations) [2025-05-30T08:18:05Z] <fceratto@cumin1002> dbctl commit (dc=all): 'Depool db2187 for reimaging, see T394884', diff saved to https://phabricator.wikimedia.org/P76702 and previous config saved to /var/cache/conftool/dbconfig/20250530-081804-fceratto.json

Icinga downtime and Alertmanager silence (ID=3f1522fc-3d4f-4682-bff9-5f42de7bfff6) set by fceratto@cumin1002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Reimaging

db2187.codfw.wmnet

Change #1152249 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/puppet@production] db2187.yaml: disable notifications for reimage

https://gerrit.wikimedia.org/r/1152249

Change #1152249 merged by Federico Ceratto:

[operations/puppet@production] db2187.yaml: disable notifications for reimage

https://gerrit.wikimedia.org/r/1152249

Cookbook cookbooks.sre.hosts.reimage was started by fceratto@cumin1002 for host db2187.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by fceratto@cumin1002 for host db2187.codfw.wmnet with OS bookworm completed:

  • db2187 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202505301203_fceratto_1341034_db2187.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Change #1152262 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/puppet@production] db2187.yaml: Enable notifications after reimage

https://gerrit.wikimedia.org/r/1152262

Change #1152262 merged by Federico Ceratto:

[operations/puppet@production] db2187.yaml: Enable notifications after reimage

https://gerrit.wikimedia.org/r/1152262

Notifications enabled in puppet, icinga is green, downtime is gone. Pooling in db2187

Start pool of db2187 gradually with 4 steps - Pooling in after reimage - fceratto@cumin1002

Completed pool of db2187 gradually with 4 steps - Pooling in after reimage - fceratto@cumin1002