Page MenuHomePhabricator

eqiad: VMs requested for Data Persistence automation and testbeds
Open, LowPublic

Description

Site/Location: eqiad and codfw
Number of systems: 3
Service: Data Persistence automation
Networking Requirements: internal ipaddrs only
Processor Requirements: How many VCPUs? 1
Memory: How many GB of RAM? 2
Disks: How many GB disk? 10
Other Requirements: access to prod

Details

Related Changes in Gerrit:

Event Timeline

FCeratto-WMF renamed this task from eqiad: 1 VMs requested for Data Persistence automation to eqiad: VMs requested for Data Persistence automation and testbeds.Jul 16 2025, 1:18 PM
FCeratto-WMF removed akosiaris as the assignee of this task.
FCeratto-WMF updated the task description. (Show Details)
FCeratto-WMF added a subscriber: akosiaris.

Hi,

Thanks for tagging me in this one. This is more Infrastructure-Foundations territory these days, so I am adding the relevant people as well for their information.

That being said, this definitely looks doable. However, a couple of clarifying questions:

  • Public IPv4 addresses are a scarce resource. Per your note in the parenthesis, (or one TCP port forwarded to an internal ipaddr), do I understand correctly that what we are talking about is the need to expose a web interface under some public URL like e.g. dbautomation.wikimedia.org? Or is this understanding incorrect?
  • The request is for 3 systems. Should these be somehow be split across the intra DC availability zones? e.g. 1 VM in rack row A, 1 VM in rack row B, 1 VM in rack row C (we have 6 rack rows in eqiad, you can think of them as our AZ equivalent).

Hello, we are in the process of discussing the requirements more in details within the team but I think I can anticipate:

  • We can get away without public ipaddrs. (The web UI has been deployed in k8s in the meantime. I'll update the task description)
  • The only remaining need is to have few VMs to run a test core DB cluster and it seems I can create them myself using the ganeti cookbook
  • We don't have strict requirements around the intra DC availability zones. Ideally we would also add 2 or 3 VMs in codfw to implement inter-DC replication just like any core s* section

Based on free capacity in the rows, best to use these rows (with 1. being the row with most free capacity):

eqiad: 1. row B 2. row D and 3. row C
codfw: 1. row C 2. row B and 3. row D

And best use a naming scheme like db-testXXXX to differentiate these from the prod ones also based on the hostname?

And remember that before you create the first VM you need to designate a partman scheme in modules/profile/data/profile/installserver/preseed.yaml

Cool, thanks.

We don't have strict requirements around the intra DC availability zones.

Fair enough. I looked a bit into free capacity in all the row groups for both eqiad and codfw and there's is plenty across the board in all of them.

I can create them myself using the ganeti cookbook

I can offer help with that if you get stuck anywhere.

Ideally we would also add 2 or 3 VMs in codfw to implement inter-DC replication just like any core s* section

Sounds good to me. I suggest to create them from the start then, it can happen in parallel anyway.

Change #1171597 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/puppet@production] Add MariaDB test-s8 section VMs

https://gerrit.wikimedia.org/r/1171597

I opened a puppet CR with the following setup:

db-test1001  eqiad primary master
db-test1002
db-test1003
db-test2001 codfw dc-master
db-test2002

Change #1171597 merged by Federico Ceratto:

[operations/puppet@production] Add MariaDB test-s8 section VMs

https://gerrit.wikimedia.org/r/1171597

Cookbook cookbooks.sre.hosts.reimage was started by fceratto@cumin1002 for host db-test2001.codfw.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by fceratto@cumin1002 for host db-test2001.codfw.wmnet with OS trixie completed:

  • db-test2001 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202510131743_fceratto_2243428_db-test2001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by fceratto@cumin1002 for host db-test2002.codfw.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by fceratto@cumin1002 for host db-test2002.codfw.wmnet with OS trixie completed:

  • db-test2002 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202510140938_fceratto_3504326_db-test2002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by fceratto@cumin1002 for host db-test1002.eqiad.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by fceratto@cumin1002 for host db-test1002.eqiad.wmnet with OS trixie completed:

  • db-test1002 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202510141043_fceratto_3611363_db-test1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by fceratto@cumin1002 for host db-test1001.eqiad.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by fceratto@cumin1002 for host db-test1001.eqiad.wmnet with OS trixie completed:

  • db-test1001 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202510141418_fceratto_4028722_db-test1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by fceratto@cumin1002 for host db-test1003.eqiad.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by fceratto@cumin1002 for host db-test1003.eqiad.wmnet with OS trixie completed:

  • db-test1003 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202510141604_fceratto_4156671_db-test1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by fceratto@cumin1003 for host db-test1001.eqiad.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by fceratto@cumin1003 for host db-test1001.eqiad.wmnet with OS trixie completed:

  • db-test1001 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202511131012_fceratto_3005470_db-test1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB