Page MenuHomePhabricator

Migrate dborch1001 to Bookworm
Open, HighPublic

Description

dborch1001.wikimedia.org is our primary host for orchestrator.
It needs to be migrated to Bookworm.

As orchestrator is a critical piece of infra, we should probably spin a different VM (dborch1002) do the migration there and then switch over.
This will require our fleet to have the grants across the fleet, but that's okay, once the migration is done, we can delete them.

This will probably need to be recompiled in bookworm:

root@dborch1001:~# dpkg -l | grep orchestrator
ii  orchestrator                         3.2.6-3                            amd64        service web+cli
ii  orchestrator-client                  3.2.6-3                            amd64        client script

And I'd anticipate that golang too.

We should probably give this some more priority than other on-going things for obvious reasons.

Event Timeline

Let's discuss this on Monday @FCeratto-WMF

This comment was removed by Marostegui.

Change #1237867 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/debs/orchestrator@master] Repackage for Bookworm

https://gerrit.wikimedia.org/r/1237867

Cookbook cookbooks.sre.hosts.reimage was started by fceratto@cumin1003 for host dborch1002.wikimedia.org with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by fceratto@cumin1003 for host dborch1002.wikimedia.org with OS trixie executed with errors:

  • dborch1002 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console dborch1002.wikimedia.org" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by fceratto@cumin1003 for host dborch1002.wikimedia.org with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by fceratto@cumin1003 for host dborch1002.wikimedia.org with OS trixie executed with errors:

  • dborch1002 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console dborch1002.wikimedia.org" to get a root shell, but depending on the failure this may not work.

Change #1238300 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/puppet@production] site.pp: Add dborch1002 as "insetup"

https://gerrit.wikimedia.org/r/1238300

Change #1238300 merged by Federico Ceratto:

[operations/puppet@production] site.pp: Add dborch1002 as "insetup"

https://gerrit.wikimedia.org/r/1238300

Cookbook cookbooks.sre.hosts.reimage was started by fceratto@cumin1003 for host dborch1002.wikimedia.org with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by fceratto@cumin1003 for host dborch1002.wikimedia.org with OS trixie completed:

  • dborch1002 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202602101348_fceratto_1393016_dborch1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1238381 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/puppet@production] site.pp: Switch dborch1002 to orchestrator role

https://gerrit.wikimedia.org/r/1238381

Change #1238381 merged by Federico Ceratto:

[operations/puppet@production] site.pp: Switch dborch1002 to orchestrator role

https://gerrit.wikimedia.org/r/1238381

Change #1238750 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/puppet@production] mariadb: Allow orchestrator topology discovery

https://gerrit.wikimedia.org/r/1238750

Change #1238750 merged by Federico Ceratto:

[operations/puppet@production] mariadb: Allow orchestrator topology discovery

https://gerrit.wikimedia.org/r/1238750

Following our IRC chat, there must be some FW in between dborch1002 and db1215:

[09:59:36] marostegui@dborch1002:~$ sudo telnet db1215.eqiad.wmnet 3306
Trying 10.64.16.90...

In addition to that, remember to check and add if neededs, grants on db1215 so dborch1002 can run stuff there.

Change #1239328 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] ferm.pp: Adding dborch1002 so it can connect to db1215

https://gerrit.wikimedia.org/r/1239328

Change #1239328 merged by Marostegui:

[operations/puppet@production] ferm.pp: Adding dborch1002 so it can connect to db1215

https://gerrit.wikimedia.org/r/1239328

@FCeratto-WMF fixed the issue you were having with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1239328

root@dborch1002:~# telnet db1215 3306
Trying 10.64.16.90...
Connected to db1215.
Escape character is '^]'.
GHost '208.80.154.9' is not allowed to connect to this MariaDB serverConnection closed by foreign host.

Right now puppet is giving an error with:

Error: /Stage[main]/Profile::Idp::Client::Httpd/Profile::Idp::Client::Httpd::Site[orchestrator.wikimedia.org]/Acme_chief::Cert[orchestrator]/File[/etc/acmecerts/orchestrator]: Failed to generate additional resources using 'eval_generate': Error 403 on SERVER: access denied
Error: /Stage[main]/Profile::Idp::Client::Httpd/Profile::Idp::Client::Httpd::Site[orchestrator.wikimedia.org]/Acme_chief::Cert[orchestrator]/File[/etc/acmecerts/orchestrator]: Could not evaluate: Could not retrieve file metadata for puppet://acmechief2002.codfw.wmnet/acmedata/orchestrator: Error 403 on SERVER: access denied
Notice: /Stage[main]/Httpd/Service[apache2]: Dependency File[/etc/acmecerts/orchestrator] has failures: true
Warning: /Stage[main]/Httpd/Service[apache2]: Skipping because of failed dependencies
Warning: /Stage[main]/Httpd/Systemd::Override[apache2-after-network-online-target]/Systemd::Unit[apache2-apache2-after-network-online-target]/Exec[systemd daemon-reload for apache2.service (apache2-apache2-after-network-online-target)]: Skipping because of failed dependencies
Info: Stage[main]: Unscheduling all events on Stage[main]
Notice: Applied catalog in 18.35 seconds

Which is probably because of what I mentioned at: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1238750/comments/d611264a_049da495

So I've reverted https://gerrit.wikimedia.org/r/c/operations/puppet/+/1239328 so you can push that one along with the change to hieradata/common/profile/acme_chief.yaml - so I don't leave puppet in a weird state for the weekend, but at least you know now what is needed to allow that connection.

Change #1239606 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/puppet@production] mariadb: allow dborch1002 to db1215 connection

https://gerrit.wikimedia.org/r/1239606

Change #1239606 merged by Federico Ceratto:

[operations/puppet@production] mariadb: allow dborch1002 to db1215 connection

https://gerrit.wikimedia.org/r/1239606

Change #1239720 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/puppet@production] acme_chief: remove old dborch1001 node

https://gerrit.wikimedia.org/r/1239720

Change #1239720 merged by Federico Ceratto:

[operations/puppet@production] acme_chief: remove old dborch1001 node

https://gerrit.wikimedia.org/r/1239720

Change #1239725 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/dns@master] orchestrator: switch orchestrator.w.o to dborch1002

https://gerrit.wikimedia.org/r/1239725

Change #1239725 merged by Federico Ceratto:

[operations/dns@master] orchestrator: switch orchestrator.w.o to dborch1002

https://gerrit.wikimedia.org/r/1239725

What is the status? I see no update to the task after yesterday's operations/tests? Most of orchestrator works, but some sections like pc7 and pc8 are showing "seen 15h".

The setup is ongoing. I updated the grant management script to apply orchestrator grants for dborch1002 on the databases and switched the orchestrator CNAME to serve dborch1002. The new instance seems to be in good health in general and most sections are showing up healthy. I'm investigating sections pc7 and pc8 that are still showing red and the s* section showing amber due to "StatementAndRowLoggingReplicasStructureWarning".

The issue with pc7 and pc8 is fixed. Also the replica structure warning looks benign: orchestrator is detecting that some sections are configured with ROW, STATEMENT or MIXED replication across different hosts.

The issue with pc7 and pc8 is fixed. Also the replica structure warning looks benign: orchestrator is detecting that some sections are configured with ROW, STATEMENT or MIXED replication across different hosts.

Yeah, that is to be expected. It's been the same with dborch1001

As test, I'd suggest you delete pc7 and try to discover it again.

Deleting and re-adding pc7 and test-s4 sections works, despite that I'm still seeing an error in the logs:
Feb 17 12:00:02 dborch1002 orchestrator[2102046]: ReadTopologyInstance(pc2017.codfw.wmnet:3306) show slave hosts: Error 1227: Access denied; you need (at least one of) the REPLICATION MASTER ADMIN privilege(s) for this operation
(yet the section has the same grants as others)

I was going to take a look but I noticed that orchestrator-clientisn't installed.

root@dborch1001:~# dpkg -l | grep orchestr
ii  orchestrator                         3.2.6-3                            amd64        service web+cli
ii  orchestrator-client                  3.2.6-3                            amd64        client script




root@dborch1002:~# dpkg -l | grep orchestr
ii  orchestrator                         3.2.6-4                              amd64        service web+cli

Can we get it installed?

Deleting and re-adding pc7 and test-s4 sections works, despite that I'm still seeing an error in the logs:
Feb 17 12:00:02 dborch1002 orchestrator[2102046]: ReadTopologyInstance(pc2017.codfw.wmnet:3306) show slave hosts: Error 1227: Access denied; you need (at least one of) the REPLICATION MASTER ADMIN privilege(s) for this operation
(yet the section has the same grants as others)

I found the issue:
MariaDB 10.5+ requires the REPLICATION MASTER ADMIN privilege for SHOW SLAVE HOSTS, which was not included in the orchestrator user's grant template (I will add it now).
dborch1001 was running simultaneously and connecting to the topology hosts without the updated grant, causing the errors to persist even after the grant was fixed on the hosts being tested.

I added the grants to pc7 and pc8 and rediscovered those sections and the error stopped showing on dborch1002 (I stopped dborch1001).

Change #1239927 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] orchestrator.sql.erb: Add replication master admin to orchestrator

https://gerrit.wikimedia.org/r/1239927

I installed orchestrator-client as well.
We'll need to deploy a different set of grants based on the MariaDB version (unless there can be a set of grants that work consistently across all version) so I might have to tweak the grant management tool to become version-aware.

Perhaps we should also consider if we want to remove some of the grants as we perform changes using cookbooks?

That shouldn't be needed, all hosts are running 10.11, so you can go ahead and merge the patch I sent and deploy the new grant.

Change #1239927 merged by Federico Ceratto:

[operations/puppet@production] orchestrator.sql.erb: Add replication master admin to orchestrator

https://gerrit.wikimedia.org/r/1239927

orchestrator is running and not showing grant errors anymore. I updated the grant manager script at https://gitlab.wikimedia.org/ladsgroup/db-password-rotation/-/merge_requests/3 and sent you the output in a private paste. Later on we can recheck and streamline the grants across hosts and puppet.

orchestrator is running and not showing grant errors anymore. I updated the grant manager script at https://gitlab.wikimedia.org/ladsgroup/db-password-rotation/-/merge_requests/3 and sent you the output in a private paste. Later on we can recheck and streamline the grants across hosts and puppet.

I'd suggest we stop orchestrator on dborch1001 (disable notifications and disable the unit, otherwise puppet will start it) and leave it down for a week or two or so before decomissioning.

Change #1240228 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/puppet@production] orchestrator: disable service on dborch1001

https://gerrit.wikimedia.org/r/1240228

@FCeratto-WMF I am not sure if things are working properly. I hit "forget" on db2230 (test-s4 - which was broken anyway) and I reconnected it to db-test1003 (the current master) and it is not showing up in orchestrator (it should automatically show up after a few mins).
And I saw this on the logs:

2026-02-18T13:34:26.737755+00:00 dborch1002 orchestrator[2114906]: 2026-02-18 13:34:26 ERROR ReadTopologyInstance(db2230.codfw.wmnet:3306) show global status like 'Uptime': dial tcp 10.192.21.15:3306: i/o timeout
2026-02-18T13:34:26.737868+00:00 dborch1002 orchestrator[2114906]: ReadTopologyInstance(db2230.codfw.wmnet:3306) show global status like 'Uptime': dial tcp 10.192.21.15:3306: i/o timeout

Firewall rule somewhere for core hosts?:

root@dborch1002:~# telnet db2230.codfw.wmnet 3306
Trying 10.192.21.15...

@FCeratto-WMF I am not sure if things are working properly. I hit "forget" on db2230 (test-s4 - which was broken anyway) and I reconnected it to db-test1003 (the current master) and it is not showing up in orchestrator (it should automatically show up after a few mins).
And I saw this on the logs:

2026-02-18T13:34:26.737755+00:00 dborch1002 orchestrator[2114906]: 2026-02-18 13:34:26 ERROR ReadTopologyInstance(db2230.codfw.wmnet:3306) show global status like 'Uptime': dial tcp 10.192.21.15:3306: i/o timeout
2026-02-18T13:34:26.737868+00:00 dborch1002 orchestrator[2114906]: ReadTopologyInstance(db2230.codfw.wmnet:3306) show global status like 'Uptime': dial tcp 10.192.21.15:3306: i/o timeout

Firewall rule somewhere for core hosts?:

Puppet is disabled on db2230 since Feb 9 (primary switchover in test-s4 (no Phab task) - fcerrato@cumin1003), so the firewall change to allow dborch1002 was never applied there.