Page MenuHomePhabricator

Core DB testbed on VMs
Closed, ResolvedPublic

Description

Creation of a core DB testbed with 2-3 VMs in eqiad and 2 in codfw with intra-dc and cross-dc replication just like real clusters. Use VMs (1 VCPU, 2 GB of RAM, 10 GB of disk space)

  • Acquire VMs
  • Configure them as MariaDB hosts
  • Setup replication & some tables with tiny amount of data
  • Configure them in dbctl with @Ladsgroup when safe

Support testing:

  • changes in Puppet configuration, MariaDB etc
  • dbctl changes
  • current and future cookbooks
  • OS upgrades
  • automated pool/depool, DC master failover and flip
  • auto_schema schema changes
  • future improvements in monitoring, heartbeat etc

Host setup:

HostDecommissionProvisioningCloneOn ZarcilloNotes
db-test1001
db-test1002
db-test1003
db-test2001
db-test2002

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1196910 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/puppet@production] site.pp: set role for db-test* hosts

https://gerrit.wikimedia.org/r/1196910

Change #1196910 merged by Federico Ceratto:

[operations/puppet@production] site.pp: set role for db-test* hosts

https://gerrit.wikimedia.org/r/1196910

Change #1197599 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db-test*: Change section

https://gerrit.wikimedia.org/r/1197599

Change #1197599 merged by Marostegui:

[operations/puppet@production] db-test*: Change section

https://gerrit.wikimedia.org/r/1197599

db-test1003 has been reimaged with trixie for T407472

Change #1201684 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/cookbooks@master] mysql.clone.py: support VMs

https://gerrit.wikimedia.org/r/1201684

Started cloning db2230.codfw.wmnet to db-test2001.codfw.wmnet - fceratto@cumin1002

Started cloning db2230.codfw.wmnet to db-test2001.codfw.wmnet - fceratto@cumin1002

Started cloning db1176.eqiad.wmnet to db2230.codfw.wmnet - fceratto@cumin1003

Started cloning db1176.eqiad.wmnet to db2230.codfw.wmnet - fceratto@cumin1003

Started cloning db2230.codfw.wmnet to db-test2001.codfw.wmnet - fceratto@cumin1003

Icinga downtime and Alertmanager silence (ID=93a4d594-fc65-4747-b94f-404af8d0000b) set by fceratto@cumin1003 for 4:00:00 on 1 host(s) and their services with reason: Clone T400056

db2230.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2025-11-12T14:13:08Z] <fceratto@cumin1003> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2230.codfw.wmnet with reason: Clone T400056

Icinga downtime and Alertmanager silence (ID=7d5d54a9-81eb-4dce-bed5-dd3965b60dd1) set by fceratto@cumin1003 for 4:00:00 on 1 host(s) and their services with reason: Clone T400056

db-test2001.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2025-11-12T14:14:07Z] <fceratto@cumin1003> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db-test2001.codfw.wmnet with reason: Clone T400056

VM db-test1001.eqiad.wmnet rebooted by fceratto@cumin1003 with reason: Reboot

VM db-test1002.eqiad.wmnet rebooted by fceratto@cumin1003 with reason: Reboot

VM db-test2002.codfw.wmnet rebooted by fceratto@cumin1003 with reason: Reboot

Icinga downtime and Alertmanager silence (ID=14e5ede4-3b9a-44f0-8c7c-71166ecffcb3) set by fceratto@cumin1003 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Cloning

db-test1001.eqiad.wmnet

Started cloning db2230.codfw.wmnet to db-test1001.eqiad.wmnet - fceratto@cumin1003

Started cloning db2230.codfw.wmnet to db-test1001.eqiad.wmnet - fceratto@cumin1003

Started cloning db2230.codfw.wmnet to db-test1001.eqiad.wmnet - fceratto@cumin1003

Started cloning db2230.codfw.wmnet to db-test1001.eqiad.wmnet - fceratto@cumin1003

Started cloning db2230.codfw.wmnet to db-test1001.eqiad.wmnet - fceratto@cumin1003

Started cloning db2230.codfw.wmnet to db-test1001.eqiad.wmnet - fceratto@cumin1003

Started cloning db2230.codfw.wmnet to db-test1001.eqiad.wmnet - fceratto@cumin1003

Started cloning db2230.codfw.wmnet to db-test1001.eqiad.wmnet - fceratto@cumin1003

Started cloning db2230.codfw.wmnet to db-test1001.eqiad.wmnet - fceratto@cumin1003

Started cloning db2230.codfw.wmnet to db-test1001.eqiad.wmnet - fceratto@cumin1003

Started cloning db2230.codfw.wmnet to db-test1001.eqiad.wmnet - fceratto@cumin1003

Started cloning db2230.codfw.wmnet to db-test1001.eqiad.wmnet - fceratto@cumin1003

Started cloning db2230.codfw.wmnet to db-test1001.eqiad.wmnet - fceratto@cumin1003

Started cloning db2230.codfw.wmnet to db-test1001.eqiad.wmnet - fceratto@cumin1003

Started cloning db2230.codfw.wmnet to db-test1001.eqiad.wmnet - fceratto@cumin1003

Started cloning db2230.codfw.wmnet to db-test1001.eqiad.wmnet - fceratto@cumin1003

Started cloning db2230.codfw.wmnet to db-test1001.eqiad.wmnet - fceratto@cumin1003

cookbooks.sre.hosts.decommission executed by fceratto@cumin1003 for hosts: db-test2001.codfw.wmnet

  • db-test2001.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox

Cookbook cookbooks.sre.hosts.reimage was started by fceratto@cumin1003 for host db-test2001.codfw.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by fceratto@cumin1003 for host db-test2001.codfw.wmnet with OS trixie completed:

  • db-test2001 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202511141634_fceratto_94958_db-test2001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Started cloning db2230.codfw.wmnet to db-test2001.codfw.wmnet - fceratto@cumin1003

cookbooks.sre.hosts.decommission executed by fceratto@cumin1003 for hosts: db-test2002.codfw.wmnet

  • db-test2002.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
  • COMMON_STEPS (FAIL)
    • Failed to run the sre.dns.netbox cookbook, run it manually

ERROR: some step on some host failed, check the bolded items above

Cookbook cookbooks.sre.hosts.reimage was started by fceratto@cumin1003 for host db-test2002.codfw.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by fceratto@cumin1003 for host db-test2002.codfw.wmnet with OS trixie completed:

  • db-test2002 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202511171138_fceratto_1006226_db-test2002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Started cloning db2230.codfw.wmnet to db-test2002.codfw.wmnet - fceratto@cumin1003

@FCeratto-WMF I fixed the lag on db-test1001 - it was an old pt-heartbeat entry on the table, this fixed it:

cumin2024@db-test1001.eqiad.wmnet[heartbeat]> select * from heartbeat;
+----------------------------+-----------+------------------------+-----------+-----------------------+---------------------+---------+------------+
| ts                         | server_id | file                   | position  | relay_master_log_file | exec_master_log_pos | shard   | datacenter |
+----------------------------+-----------+------------------------+-----------+-----------------------+---------------------+---------+------------+
| 2025-11-17T14:01:38.000820 | 171966607 | db1176-bin.000389      | 378375792 | NULL                  |                NULL | test-s4 | eqiad      |
| 2025-11-17T14:01:38.001370 | 171970587 | db-test1001-bin.000001 | 217495825 | db1176-bin.000389     |           378375792 | test-s4 | eqiad      |
| 2025-11-14T11:40:50.001650 | 180360463 | db2230-bin.000016      |    976545 | db1176-bin.000389     |           273103275 | test-s4 | codfw      |
+----------------------------+-----------+------------------------+-----------+-----------------------+---------------------+---------+------------+
3 rows in set (0.002 sec)

cumin2024@db-test1001.eqiad.wmnet[heartbeat]> delete from heartbeat where server_id=180360463;
Query OK, 1 row affected (0.008 sec)

cookbooks.sre.hosts.decommission executed by fceratto@cumin1003 for hosts: db-test1002.eqiad.wmnet

  • db-test1002.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox

Cookbook cookbooks.sre.hosts.reimage was started by fceratto@cumin1003 for host db-test1002.eqiad.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by fceratto@cumin1003 for host db-test1002.eqiad.wmnet with OS trixie completed:

  • db-test1002 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202511171619_fceratto_1405626_db-test1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Started cloning db2230.codfw.wmnet to db-test1002.eqiad.wmnet - fceratto@cumin1003

cookbooks.sre.hosts.decommission executed by fceratto@cumin1003 for hosts: db-test1003.eqiad.wmnet

  • db-test1003.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox

Cookbook cookbooks.sre.hosts.reimage was started by fceratto@cumin1003 for host db-test1003.eqiad.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by fceratto@cumin1003 for host db-test1003.eqiad.wmnet with OS trixie completed:

  • db-test1003 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202511171732_fceratto_1477624_db-test1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Started cloning db2230.codfw.wmnet to db-test1003.eqiad.wmnet - fceratto@cumin1003

The 5 VMs are showing up on zarcillo and replicating https://zarcillo.wikimedia.org/ui/sections#test-s4 - the Prometheus metrics will start working as expected once the old metrics created before the VMs redeploy (with a different server_id) disappear.

Reopening while fixing replication after host reboots

The issue was the same as the one at https://phabricator.wikimedia.org/T400056#11378959
The fix

MariaDB [heartbeat]> select * from heartbeat;
+----------------------------+-----------+-------------------+-----------+-----------------------+---------------------+---------+------------+
| ts                         | server_id | file              | position  | relay_master_log_file | exec_master_log_pos | shard   | datacenter |
+----------------------------+-----------+-------------------+-----------+-----------------------+---------------------+---------+------------+
| 2025-12-01T14:42:17.000760 | 171966607 | db1176-bin.000389 | 854707119 | NULL                  |                NULL | test-s4 | eqiad      |
| 2025-11-17T16:40:53.001810 | 180360463 | db2230-bin.000019 |  12086824 | db1176-bin.000389     |           382129335 | test-s4 | codfw      |
+----------------------------+-----------+-------------------+-----------+-----------------------+---------------------+---------+------------+
2 rows in set (0.001 sec)
MariaDB [heartbeat]> delete from heartbeat where server_id='180360463';
Query OK, 1 row affected (0.007 sec)

MariaDB [heartbeat]>

Change #1201684 abandoned by Federico Ceratto:

[operations/cookbooks@master] mysql.clone.py: support VMs

Reason:

The VMs have been deployed. No need to merge this.

https://gerrit.wikimedia.org/r/1201684