Page MenuHomePhabricator

Migrate orchestrator to Trixie
Closed, ResolvedPublic

Description

dborch1001.wikimedia.org is our primary host for orchestrator.
It needs to be migrated to Trixie or Bookworm.

As orchestrator is a critical piece of infra, we should probably spin a different VM (dborch1002) do the migration there and then switch over.
This will require our fleet to have the grants across the fleet, but that's okay, once the migration is done, we can delete them.

This will probably need to be recompiled in bookworm:

root@dborch1001:~# dpkg -l | grep orchestrator
ii  orchestrator                         3.2.6-3                            amd64        service web+cli
ii  orchestrator-client                  3.2.6-3                            amd64        client script

And I'd anticipate that golang too.

We should probably give this some more priority than other on-going things for obvious reasons.

Steps:

  • deploy VM
  • deploy orchestrator packages
  • set orchestrator_srv grants on db1215
  • set orchestrator grants on all instances
  • flip active/standby
  • test discovery
  • configure orchestrator.wikimedia.org CNAME
  • stop dborch1001 in Puppet
  • remove grants for dborch1001 in Puppet see: T416582#11676968
  • remove grants for orchestrator@dborch1001 on databases - see T416582#11676958
  • remove grants for orchestrator_srv@dborch1001 on db1215
  • deprovision dborch1001
  • remove dborch1001 from Puppet https://gerrit.wikimedia.org/r/c/operations/puppet/+/1278452

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Let's discuss this on Monday @FCeratto-WMF

This comment was removed by Marostegui.

Change #1237867 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/debs/orchestrator@master] Repackage for Bookworm

https://gerrit.wikimedia.org/r/1237867

Cookbook cookbooks.sre.hosts.reimage was started by fceratto@cumin1003 for host dborch1002.wikimedia.org with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by fceratto@cumin1003 for host dborch1002.wikimedia.org with OS trixie executed with errors:

  • dborch1002 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console dborch1002.wikimedia.org" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by fceratto@cumin1003 for host dborch1002.wikimedia.org with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by fceratto@cumin1003 for host dborch1002.wikimedia.org with OS trixie executed with errors:

  • dborch1002 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console dborch1002.wikimedia.org" to get a root shell, but depending on the failure this may not work.

Change #1238300 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/puppet@production] site.pp: Add dborch1002 as "insetup"

https://gerrit.wikimedia.org/r/1238300

Change #1238300 merged by Federico Ceratto:

[operations/puppet@production] site.pp: Add dborch1002 as "insetup"

https://gerrit.wikimedia.org/r/1238300

Cookbook cookbooks.sre.hosts.reimage was started by fceratto@cumin1003 for host dborch1002.wikimedia.org with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by fceratto@cumin1003 for host dborch1002.wikimedia.org with OS trixie completed:

  • dborch1002 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202602101348_fceratto_1393016_dborch1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1238381 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/puppet@production] site.pp: Switch dborch1002 to orchestrator role

https://gerrit.wikimedia.org/r/1238381

Change #1238381 merged by Federico Ceratto:

[operations/puppet@production] site.pp: Switch dborch1002 to orchestrator role

https://gerrit.wikimedia.org/r/1238381

Change #1238750 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/puppet@production] mariadb: Allow orchestrator topology discovery

https://gerrit.wikimedia.org/r/1238750

Change #1238750 merged by Federico Ceratto:

[operations/puppet@production] mariadb: Allow orchestrator topology discovery

https://gerrit.wikimedia.org/r/1238750

Following our IRC chat, there must be some FW in between dborch1002 and db1215:

[09:59:36] marostegui@dborch1002:~$ sudo telnet db1215.eqiad.wmnet 3306
Trying 10.64.16.90...

In addition to that, remember to check and add if neededs, grants on db1215 so dborch1002 can run stuff there.

Change #1239328 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] ferm.pp: Adding dborch1002 so it can connect to db1215

https://gerrit.wikimedia.org/r/1239328

Change #1239328 merged by Marostegui:

[operations/puppet@production] ferm.pp: Adding dborch1002 so it can connect to db1215

https://gerrit.wikimedia.org/r/1239328

@FCeratto-WMF fixed the issue you were having with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1239328

root@dborch1002:~# telnet db1215 3306
Trying 10.64.16.90...
Connected to db1215.
Escape character is '^]'.
GHost '208.80.154.9' is not allowed to connect to this MariaDB serverConnection closed by foreign host.

Right now puppet is giving an error with:

Error: /Stage[main]/Profile::Idp::Client::Httpd/Profile::Idp::Client::Httpd::Site[orchestrator.wikimedia.org]/Acme_chief::Cert[orchestrator]/File[/etc/acmecerts/orchestrator]: Failed to generate additional resources using 'eval_generate': Error 403 on SERVER: access denied
Error: /Stage[main]/Profile::Idp::Client::Httpd/Profile::Idp::Client::Httpd::Site[orchestrator.wikimedia.org]/Acme_chief::Cert[orchestrator]/File[/etc/acmecerts/orchestrator]: Could not evaluate: Could not retrieve file metadata for puppet://acmechief2002.codfw.wmnet/acmedata/orchestrator: Error 403 on SERVER: access denied
Notice: /Stage[main]/Httpd/Service[apache2]: Dependency File[/etc/acmecerts/orchestrator] has failures: true
Warning: /Stage[main]/Httpd/Service[apache2]: Skipping because of failed dependencies
Warning: /Stage[main]/Httpd/Systemd::Override[apache2-after-network-online-target]/Systemd::Unit[apache2-apache2-after-network-online-target]/Exec[systemd daemon-reload for apache2.service (apache2-apache2-after-network-online-target)]: Skipping because of failed dependencies
Info: Stage[main]: Unscheduling all events on Stage[main]
Notice: Applied catalog in 18.35 seconds

Which is probably because of what I mentioned at: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1238750/comments/d611264a_049da495

So I've reverted https://gerrit.wikimedia.org/r/c/operations/puppet/+/1239328 so you can push that one along with the change to hieradata/common/profile/acme_chief.yaml - so I don't leave puppet in a weird state for the weekend, but at least you know now what is needed to allow that connection.

Change #1239606 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/puppet@production] mariadb: allow dborch1002 to db1215 connection

https://gerrit.wikimedia.org/r/1239606

Change #1239606 merged by Federico Ceratto:

[operations/puppet@production] mariadb: allow dborch1002 to db1215 connection

https://gerrit.wikimedia.org/r/1239606

Change #1239720 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/puppet@production] acme_chief: remove old dborch1001 node

https://gerrit.wikimedia.org/r/1239720

Change #1239720 merged by Federico Ceratto:

[operations/puppet@production] acme_chief: remove old dborch1001 node

https://gerrit.wikimedia.org/r/1239720

Change #1239725 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/dns@master] orchestrator: switch orchestrator.w.o to dborch1002

https://gerrit.wikimedia.org/r/1239725

Change #1239725 merged by Federico Ceratto:

[operations/dns@master] orchestrator: switch orchestrator.w.o to dborch1002

https://gerrit.wikimedia.org/r/1239725

What is the status? I see no update to the task after yesterday's operations/tests? Most of orchestrator works, but some sections like pc7 and pc8 are showing "seen 15h".

The setup is ongoing. I updated the grant management script to apply orchestrator grants for dborch1002 on the databases and switched the orchestrator CNAME to serve dborch1002. The new instance seems to be in good health in general and most sections are showing up healthy. I'm investigating sections pc7 and pc8 that are still showing red and the s* section showing amber due to "StatementAndRowLoggingReplicasStructureWarning".

The issue with pc7 and pc8 is fixed. Also the replica structure warning looks benign: orchestrator is detecting that some sections are configured with ROW, STATEMENT or MIXED replication across different hosts.

The issue with pc7 and pc8 is fixed. Also the replica structure warning looks benign: orchestrator is detecting that some sections are configured with ROW, STATEMENT or MIXED replication across different hosts.

Yeah, that is to be expected. It's been the same with dborch1001

As test, I'd suggest you delete pc7 and try to discover it again.

Deleting and re-adding pc7 and test-s4 sections works, despite that I'm still seeing an error in the logs:
Feb 17 12:00:02 dborch1002 orchestrator[2102046]: ReadTopologyInstance(pc2017.codfw.wmnet:3306) show slave hosts: Error 1227: Access denied; you need (at least one of) the REPLICATION MASTER ADMIN privilege(s) for this operation
(yet the section has the same grants as others)

I was going to take a look but I noticed that orchestrator-clientisn't installed.

root@dborch1001:~# dpkg -l | grep orchestr
ii  orchestrator                         3.2.6-3                            amd64        service web+cli
ii  orchestrator-client                  3.2.6-3                            amd64        client script




root@dborch1002:~# dpkg -l | grep orchestr
ii  orchestrator                         3.2.6-4                              amd64        service web+cli

Can we get it installed?

Deleting and re-adding pc7 and test-s4 sections works, despite that I'm still seeing an error in the logs:
Feb 17 12:00:02 dborch1002 orchestrator[2102046]: ReadTopologyInstance(pc2017.codfw.wmnet:3306) show slave hosts: Error 1227: Access denied; you need (at least one of) the REPLICATION MASTER ADMIN privilege(s) for this operation
(yet the section has the same grants as others)

I found the issue:
MariaDB 10.5+ requires the REPLICATION MASTER ADMIN privilege for SHOW SLAVE HOSTS, which was not included in the orchestrator user's grant template (I will add it now).
dborch1001 was running simultaneously and connecting to the topology hosts without the updated grant, causing the errors to persist even after the grant was fixed on the hosts being tested.

I added the grants to pc7 and pc8 and rediscovered those sections and the error stopped showing on dborch1002 (I stopped dborch1001).

Change #1239927 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] orchestrator.sql.erb: Add replication master admin to orchestrator

https://gerrit.wikimedia.org/r/1239927

I installed orchestrator-client as well.
We'll need to deploy a different set of grants based on the MariaDB version (unless there can be a set of grants that work consistently across all version) so I might have to tweak the grant management tool to become version-aware.

Perhaps we should also consider if we want to remove some of the grants as we perform changes using cookbooks?

That shouldn't be needed, all hosts are running 10.11, so you can go ahead and merge the patch I sent and deploy the new grant.

Change #1239927 merged by Federico Ceratto:

[operations/puppet@production] orchestrator.sql.erb: Add replication master admin to orchestrator

https://gerrit.wikimedia.org/r/1239927

orchestrator is running and not showing grant errors anymore. I updated the grant manager script at https://gitlab.wikimedia.org/ladsgroup/db-password-rotation/-/merge_requests/3 and sent you the output in a private paste. Later on we can recheck and streamline the grants across hosts and puppet.

orchestrator is running and not showing grant errors anymore. I updated the grant manager script at https://gitlab.wikimedia.org/ladsgroup/db-password-rotation/-/merge_requests/3 and sent you the output in a private paste. Later on we can recheck and streamline the grants across hosts and puppet.

I'd suggest we stop orchestrator on dborch1001 (disable notifications and disable the unit, otherwise puppet will start it) and leave it down for a week or two or so before decomissioning.

Change #1240228 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/puppet@production] orchestrator: disable service on dborch1001

https://gerrit.wikimedia.org/r/1240228

@FCeratto-WMF I am not sure if things are working properly. I hit "forget" on db2230 (test-s4 - which was broken anyway) and I reconnected it to db-test1003 (the current master) and it is not showing up in orchestrator (it should automatically show up after a few mins).
And I saw this on the logs:

2026-02-18T13:34:26.737755+00:00 dborch1002 orchestrator[2114906]: 2026-02-18 13:34:26 ERROR ReadTopologyInstance(db2230.codfw.wmnet:3306) show global status like 'Uptime': dial tcp 10.192.21.15:3306: i/o timeout
2026-02-18T13:34:26.737868+00:00 dborch1002 orchestrator[2114906]: ReadTopologyInstance(db2230.codfw.wmnet:3306) show global status like 'Uptime': dial tcp 10.192.21.15:3306: i/o timeout

Firewall rule somewhere for core hosts?:

root@dborch1002:~# telnet db2230.codfw.wmnet 3306
Trying 10.192.21.15...

@FCeratto-WMF I am not sure if things are working properly. I hit "forget" on db2230 (test-s4 - which was broken anyway) and I reconnected it to db-test1003 (the current master) and it is not showing up in orchestrator (it should automatically show up after a few mins).
And I saw this on the logs:

2026-02-18T13:34:26.737755+00:00 dborch1002 orchestrator[2114906]: 2026-02-18 13:34:26 ERROR ReadTopologyInstance(db2230.codfw.wmnet:3306) show global status like 'Uptime': dial tcp 10.192.21.15:3306: i/o timeout
2026-02-18T13:34:26.737868+00:00 dborch1002 orchestrator[2114906]: ReadTopologyInstance(db2230.codfw.wmnet:3306) show global status like 'Uptime': dial tcp 10.192.21.15:3306: i/o timeout

Firewall rule somewhere for core hosts?:

Puppet is disabled on db2230 since Feb 9 (primary switchover in test-s4 (no Phab task) - fcerrato@cumin1003), so the firewall change to allow dborch1002 was never applied there.

Change #1240228 merged by Federico Ceratto:

[operations/puppet@production] orchestrator: disable service on dborch1001

https://gerrit.wikimedia.org/r/1240228

@FCeratto-WMF I am not sure if things are working properly. I hit "forget" on db2230 (test-s4 - which was broken anyway) and I reconnected it to db-test1003 (the current master) and it is not showing up in orchestrator (it should automatically show up after a few mins).
And I saw this on the logs:

2026-02-18T13:34:26.737755+00:00 dborch1002 orchestrator[2114906]: 2026-02-18 13:34:26 ERROR ReadTopologyInstance(db2230.codfw.wmnet:3306) show global status like 'Uptime': dial tcp 10.192.21.15:3306: i/o timeout
2026-02-18T13:34:26.737868+00:00 dborch1002 orchestrator[2114906]: ReadTopologyInstance(db2230.codfw.wmnet:3306) show global status like 'Uptime': dial tcp 10.192.21.15:3306: i/o timeout

Firewall rule somewhere for core hosts?:

Puppet is disabled on db2230 since Feb 9 (primary switchover in test-s4 (no Phab task) - fcerrato@cumin1003), so the firewall change to allow dborch1002 was never applied there.

@FCeratto-WMF what do you plan to do with this? We shouldn't have puppet disabled for so long on hosts.

The testbed has been used for the ongoing work on the new cookbook in T373436 and for the switchover helper https://gitlab.wikimedia.org/repos/sre/wmfmariadbpy/-/merge_requests/17 for T409926 - the latter is quite "invasive" requiring puppet changes, CRs, monitoring change etc so the testbed has been running sometimes with inconsistent or broken replication as part of the tests.
I was planning to do more tests for T409926 when time allows but if we are not happy with keeping puppet disables I can reset the testbed and enable puppet.

Regarding dborch1002 I'm focusing on making sure it monitors production hosts correctly and let it "bake in" for a couple of days before deprovisioning dborch1001.

I restored the replication on the testbed section and enabled puppet on the hosts. More details in https://phabricator.wikimedia.org/T373436

FCeratto-WMF renamed this task from Migrate dborch1001 to Bookworm to Migrate orchestrator to Trixie.Feb 23 2026, 1:52 PM
FCeratto-WMF updated the task description. (Show Details)

I updated the task title for clarity as we are now running Trixie.
I shut down orchestrator on dborch1001 manually as puppet did not shut it down.
I tested used orchestrator on dborch1002 to monitor changes while testing https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1238368 and it was resposive as expected

Cookbook cookbooks.sre.hosts.reimage was started by fceratto@cumin1003 for host dborch1003.eqiad.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by fceratto@cumin1003 for host dborch1003.eqiad.wmnet with OS trixie executed with errors:

  • dborch1003 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console dborch1003.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

I installed orchestrator-client as well.

This needs to be in Puppet, not manually installed?

I installed orchestrator-client as well.

This needs to be in Puppet, not manually installed?

+1

Change #1244710 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/puppet@production] orchestrator: install orchestrator-client

https://gerrit.wikimedia.org/r/1244710

Change #1244710 merged by Federico Ceratto:

[operations/puppet@production] orchestrator: install orchestrator-client

https://gerrit.wikimedia.org/r/1244710

After the incident past Tuesday I've quickly checked how many hosts still have the grants for the old dborch1002 user and there are not many:

pc2017.codfw.wmnet
pc2018.codfw.wmnet
pc1018.eqiad.wmnet

After the incident past Tuesday I've quickly checked how many hosts still have the grants for the old dborch1002 user and there are not many:

pc2017.codfw.wmnet
pc2018.codfw.wmnet
pc1018.eqiad.wmnet

I've dropped the grant from there - a final check should still be performed to make sure it is gone everywhere

Change #1248418 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] orchestrator.sql.erb: Remove grants from dborch1001

https://gerrit.wikimedia.org/r/1248418

Change #1248418 merged by Marostegui:

[operations/puppet@production] orchestrator.sql.erb: Remove grants from dborch1001

https://gerrit.wikimedia.org/r/1248418

I've pushed: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1248418
Pending: remove it from the rest of .sql.erb and add the grants for dborch1003 on those per-role .sql.erb

Blocked on finishing dborch1003 and the completion of this task: T317179

Mentioned in SAL (#wikimedia-operations) [2026-04-15T10:37:18Z] <fceratto@cumin1003> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 365 days, 0:00:00 on dborch1001.wikimedia.org with reason: T416582

Can we go ahead and decom dborch1001? The current version on 1002 seems to be working fine and I ran across dborch1001 as part of https://phabricator.wikimedia.org/T422596

We discussed the next steps and we are going to decommission in shortly.

Change #1278452 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/puppet@production] common, site, ferm: Remove dborch1001

https://gerrit.wikimedia.org/r/1278452

cookbooks.sre.hosts.decommission executed by fceratto@cumin1003 for hosts: dborch1001.wikimedia.org

  • dborch1001.wikimedia.org (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet server and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox

@FCeratto-WMF after today's s5 switchover, dborch1002 started alerting with:

[07:24:31]  <+icinga-wm> PROBLEM - orchestrator resolve cache non-FQDNs on dborch1002 is CRITICAL: CRITICAL: 2 non-FQDN entries in orchestrator resolve cache: https://wikitech.wikimedia.org/wiki/Orchestrator

That's usually resolved by:

[05:20:58] marostegui@dborch1002:~$ sudo orchestrator -c reset-hostname-resolve-cache
2026-04-29 05:21:01 DEBUG Connected to orchestrator backend: orchestrator_srv:?@tcp(db1215.eqiad.wmnet:3306)/orchestrator?timeout=1s&readTimeout=30s&rejectReadOnly=false&interpolateParams=true
2026-04-29 05:21:01 DEBUG Orchestrator pool SetMaxOpenConns: 128
2026-04-29 05:21:01 DEBUG Initializing orchestrator
2026-04-29 05:21:01 INFO Connecting to backend db1215.eqiad.wmnet:3306: maxConnections: 128, maxIdleConns: 32
hostname resolve cache cleared

However after a few minutes it alerts again.
I am not sure if something has changed on 1002, or not, but we should probably

  1. ack this and just leave it there as we know 1002 will be decommissioned at some point
  2. check further.

I will leave this up to you

dborch1003 is also showing the same error, so it might need investigation

I'd suggest maybe to fix it in both 1002 and 1003 in case it is making both failing as it is reading from the same backend.

I'd suggest maybe to fix it in both 1002 and 1003 in case it is making both failing as it is reading from the same backend.

I ran this and it fixed the issue.

I've been looking at hostname_unresolve hostname_resolve and database_instance tables but they did only had FQDNs, not just hostnames.

The logs show something is calling /api/tag/... without FQDN but it's then resolved:

Apr 28 05:09:14 dborch1003 orchestrator[19108]: [martini] Started GET /api/tag/db1173/3306 for 127.0.0.1:59854
Apr 28 05:09:14 dborch1003 orchestrator[19108]: 2026-04-28 05:09:14 DEBUG Hostname unresolved yet: db1173
Apr 28 05:09:14 dborch1003 orchestrator[19108]: 2026-04-28 05:09:14 DEBUG Cache hostname resolve db1173 as db1173.eqiad.wmnet
Apr 28 05:09:14 dborch1003 orchestrator[19108]: [martini] Completed 200 OK in 6.967703ms
Apr 28 11:17:20 dborch1003 orchestrator[19108]: [martini] Started GET /api/untag/db2192/3306 for 127.0.0.1:39786
Apr 28 11:17:20 dborch1003 orchestrator[19108]: 2026-04-28 11:17:20 DEBUG Hostname unresolved yet: db2192
Apr 28 11:17:20 dborch1003 orchestrator[19108]: 2026-04-28 11:17:20 DEBUG Cache hostname resolve db2192 as db2192.codfw.wmnet
Apr 28 11:17:20 dborch1003 orchestrator[19108]: [martini] Completed 200 OK in 8.889662ms
Apr 28 11:17:20 dborch1003 orchestrator[19108]: auditType:delete-instance-tag instance:db2192.codfw.wmnet:3306 cluster:db1230.eqiad.wmnet:3306 message:name=candidate
Apr 28 11:17:25 dborch1003 orchestrator[19108]: [martini] Started GET /api/tag/db2213/3306 for 127.0.0.1:39268
Apr 28 11:17:25 dborch1003 orchestrator[19108]: 2026-04-28 11:17:25 DEBUG Hostname unresolved yet: db2213
Apr 28 11:17:25 dborch1003 orchestrator[19108]: 2026-04-28 11:17:25 DEBUG Cache hostname resolve db2213 as db2213.codfw.wmnet
Apr 28 11:17:25 dborch1003 orchestrator[19108]: [martini] Completed 200 OK in 9.700992ms
Apr 29 05:12:11 dborch1003 orchestrator[19108]: [martini] Started GET /api/untag/db1210/3306 for 127.0.0.1:42758
Apr 29 05:12:11 dborch1003 orchestrator[19108]: 2026-04-29 05:12:11 DEBUG Hostname unresolved yet: db1210
Apr 29 05:12:11 dborch1003 orchestrator[19108]: 2026-04-29 05:12:11 DEBUG Cache hostname resolve db1210 as db1210.eqiad.wmnet
Apr 29 05:12:11 dborch1003 orchestrator[19108]: [martini] Completed 200 OK in 8.947804ms
Apr 29 05:12:11 dborch1003 orchestrator[19108]: auditType:delete-instance-tag instance:db1210.eqiad.wmnet:3306 cluster:db1210.eqiad.wmnet:3306 message:name=candidate
Apr 29 05:12:15 dborch1003 orchestrator[19108]: [martini] Started GET /api/tag/db1230/3306 for 127.0.0.1:59020
Apr 29 05:12:15 dborch1003 orchestrator[19108]: 2026-04-29 05:12:15 DEBUG Hostname unresolved yet: db1230
Apr 29 05:12:15 dborch1003 orchestrator[19108]: 2026-04-29 05:12:15 DEBUG Cache hostname resolve db1230 as db1230.eqiad.wmnet
Apr 29 05:12:15 dborch1003 orchestrator[19108]: [martini] Completed 200 OK in 6.459455ms

Let's stop orchestrator on one of the orchestrator hosts, to avoid having two instances writing to the same backend - until we've migrated finally behind the CDN.

dborch1003 was in standby mode (dborch1002 being active) so it should not write to the shared DB but anyhow I stopped the service for now.

Mentioned in SAL (#wikimedia-operations) [2026-05-05T13:22:04Z] <fceratto@cumin1003> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on dborch1002.wikimedia.org with reason: T416582

Change #1278452 merged by Federico Ceratto:

[operations/puppet@production] common, site, ferm: Remove dborch1001

https://gerrit.wikimedia.org/r/1278452

FCeratto-WMF updated the task description. (Show Details)
FCeratto-WMF moved this task from Blocked to Done on the DBA board.