Site/Location: eqiad and codfw
Number of systems: 3
Service: Data Persistence automation
Networking Requirements: internal ipaddrs only
Processor Requirements: How many VCPUs? 1
Memory: How many GB of RAM? 2
Disks: How many GB disk? 10
Other Requirements: access to prod
Description
Details
| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| Add MariaDB test-s8 section VMs | operations/puppet | production | +34 -0 |
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| In Progress | FCeratto-WMF | T384810 MariaDB lifetime management system | |||
| Resolved | FCeratto-WMF | T400056 Core DB testbed on VMs | |||
| Open | None | T390087 eqiad: VMs requested for Data Persistence automation and testbeds |
Event Timeline
Hi,
Thanks for tagging me in this one. This is more Infrastructure-Foundations territory these days, so I am adding the relevant people as well for their information.
That being said, this definitely looks doable. However, a couple of clarifying questions:
- Public IPv4 addresses are a scarce resource. Per your note in the parenthesis, (or one TCP port forwarded to an internal ipaddr), do I understand correctly that what we are talking about is the need to expose a web interface under some public URL like e.g. dbautomation.wikimedia.org? Or is this understanding incorrect?
- The request is for 3 systems. Should these be somehow be split across the intra DC availability zones? e.g. 1 VM in rack row A, 1 VM in rack row B, 1 VM in rack row C (we have 6 rack rows in eqiad, you can think of them as our AZ equivalent).
Hello, we are in the process of discussing the requirements more in details within the team but I think I can anticipate:
- We can get away without public ipaddrs. (The web UI has been deployed in k8s in the meantime. I'll update the task description)
- The only remaining need is to have few VMs to run a test core DB cluster and it seems I can create them myself using the ganeti cookbook
- We don't have strict requirements around the intra DC availability zones. Ideally we would also add 2 or 3 VMs in codfw to implement inter-DC replication just like any core s* section
Based on free capacity in the rows, best to use these rows (with 1. being the row with most free capacity):
eqiad: 1. row B 2. row D and 3. row C
codfw: 1. row C 2. row B and 3. row D
And best use a naming scheme like db-testXXXX to differentiate these from the prod ones also based on the hostname?
And remember that before you create the first VM you need to designate a partman scheme in modules/profile/data/profile/installserver/preseed.yaml
Cool, thanks.
We don't have strict requirements around the intra DC availability zones.
Fair enough. I looked a bit into free capacity in all the row groups for both eqiad and codfw and there's is plenty across the board in all of them.
I can create them myself using the ganeti cookbook
I can offer help with that if you get stuck anywhere.
Ideally we would also add 2 or 3 VMs in codfw to implement inter-DC replication just like any core s* section
Sounds good to me. I suggest to create them from the start then, it can happen in parallel anyway.
Change #1171597 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):
[operations/puppet@production] Add MariaDB test-s8 section VMs
I opened a puppet CR with the following setup:
db-test1001 eqiad primary master db-test1002 db-test1003 db-test2001 codfw dc-master db-test2002
Change #1171597 merged by Federico Ceratto:
[operations/puppet@production] Add MariaDB test-s8 section VMs
Cookbook cookbooks.sre.hosts.reimage was started by fceratto@cumin1002 for host db-test2001.codfw.wmnet with OS trixie
Cookbook cookbooks.sre.hosts.reimage started by fceratto@cumin1002 for host db-test2001.codfw.wmnet with OS trixie completed:
- db-test2001 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata (7) to Debian installer
- Set boot media to disk
- Host up (new fresh trixie OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202510131743_fceratto_2243428_db-test2001.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
Cookbook cookbooks.sre.hosts.reimage was started by fceratto@cumin1002 for host db-test2002.codfw.wmnet with OS trixie
Cookbook cookbooks.sre.hosts.reimage started by fceratto@cumin1002 for host db-test2002.codfw.wmnet with OS trixie completed:
- db-test2002 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata (7) to Debian installer
- Set boot media to disk
- Host up (new fresh trixie OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202510140938_fceratto_3504326_db-test2002.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
Cookbook cookbooks.sre.hosts.reimage was started by fceratto@cumin1002 for host db-test1002.eqiad.wmnet with OS trixie
Cookbook cookbooks.sre.hosts.reimage started by fceratto@cumin1002 for host db-test1002.eqiad.wmnet with OS trixie completed:
- db-test1002 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata (7) to Debian installer
- Set boot media to disk
- Host up (new fresh trixie OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202510141043_fceratto_3611363_db-test1002.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
Cookbook cookbooks.sre.hosts.reimage was started by fceratto@cumin1002 for host db-test1001.eqiad.wmnet with OS trixie
Cookbook cookbooks.sre.hosts.reimage started by fceratto@cumin1002 for host db-test1001.eqiad.wmnet with OS trixie completed:
- db-test1001 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata (7) to Debian installer
- Set boot media to disk
- Host up (new fresh trixie OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202510141418_fceratto_4028722_db-test1001.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
Cookbook cookbooks.sre.hosts.reimage was started by fceratto@cumin1002 for host db-test1003.eqiad.wmnet with OS trixie
Cookbook cookbooks.sre.hosts.reimage started by fceratto@cumin1002 for host db-test1003.eqiad.wmnet with OS trixie completed:
- db-test1003 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata (7) to Debian installer
- Set boot media to disk
- Host up (new fresh trixie OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202510141604_fceratto_4156671_db-test1003.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
Cookbook cookbooks.sre.hosts.reimage was started by fceratto@cumin1003 for host db-test1001.eqiad.wmnet with OS trixie
Cookbook cookbooks.sre.hosts.reimage started by fceratto@cumin1003 for host db-test1001.eqiad.wmnet with OS trixie completed:
- db-test1001 (WARN)
- Downtimed on Icinga/Alertmanager
- Unable to disable Puppet, the host may have been unreachable
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata (7) to Debian installer
- Set boot media to disk
- Host up (new fresh trixie OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202511131012_fceratto_3005470_db-test1001.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB