Page MenuHomePhabricator

eqiad/codfw: 6 VM request for Zuul upgrade project
Closed, ResolvedPublic

Description

Cloud VPS Project Tested: zuul3 WMCS project
Site/Location: eqiad & codfw
Number of systems: 2
Service: zuul3 (main)
Networking Requirements: internal
Processor Requirements: 4
Memory: 8
Disks: 100GB
Other Requirements: n/a


Cloud VPS Project Tested: zuul3 WMCS project
Site/Location: eqiad & codfw
Number of systems: 2
Service: zuul3 (executor)
Networking Requirements: internal
Processor Requirements: 4
Memory: 2
Disks: 100GB
Other Requirements: n/a


Cloud VPS Project Tested: zuul3 WMCS project
Site/Location: eqiad & codfw
Number of systems: 2
Service: zuul3 (trusted runner)
Networking Requirements: internal
Processor Requirements: 4
Memory: 4
Disks: 100GB
Other Requirements: n/a


2 VMs - main zuul (8GB)

  • zuul1001
  • zuul2001

2 VMs - executor (2GB)

  • zuul1002
  • zuul2002

2 VMs - trusted runner (4GB)

  • zuul1003
  • zuul2003

Event Timeline

If we can pick freely, then let's use codfw/C.

I will do this. But for both data centers. Not just one. We have said we do not want to create
single VMs again in the future. And doing so will just lead to follow-up work in the future.

(Almost certainly it will turn into actually used machines, not a temporary proof of concept.)

Change #1147855 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: add zuul VMs with collab-insetup-role

https://gerrit.wikimedia.org/r/1147855

Change #1147855 merged by Dzahn:

[operations/puppet@production] site: add zuul VMs with collab-insetup-role

https://gerrit.wikimedia.org/r/1147855

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1002 for host zuul2001.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1002 for host zuul2001.codfw.wmnet with OS bookworm executed with errors:

  • zuul2001 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console zuul2001.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Change #1147878 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] installserver: add partman stanza for zuul* hosts

https://gerrit.wikimedia.org/r/1147878

Change #1147878 merged by Dzahn:

[operations/puppet@production] installserver: add partman stanza for zuul* hosts

https://gerrit.wikimedia.org/r/1147878

Thanks @Dzahn for picking this up. I left a comment in operations/puppet/+/1147855. I don't think we can use nftables because we most likely use docker/containers to run zuul. So we have to use the insetup role without nftables unfortunately.

Thanks @Dzahn for picking this up. I left a comment in operations/puppet/+/1147855. I don't think we can use nftables because we most likely use docker/containers to run zuul. So we have to use the insetup role without nftables unfortunately.

Indeed

Change #1148359 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: switch zuul3 VMs to use ferm insetup role, not nftables

https://gerrit.wikimedia.org/r/1148359

Change #1148359 merged by Dzahn:

[operations/puppet@production] site: switch zuul3 VMs to use ferm insetup role, not nftables

https://gerrit.wikimedia.org/r/1148359

I don't think we can use nftables because we most likely use docker/containers to run zuul. So we have to use the insetup role without nftables unfortunately.

Yep, good point of course! Adjusted to ferm. Thanks

Dzahn renamed this task from codfw: 1VM request for zuul3+ to eqiad/codfw: 6 VM request for zuul3+.May 20 2025, 4:22 PM
Dzahn updated the task description. (Show Details)

after today's meetings with SRE-collab/releng/zuul author we update this request to 3 VMs per DC, for a total of 6

  • 1 machine for "main zuul components"
  • 1 machine as the executor
  • 1 machine to run trusted jobs

all with the same specs for now.

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1002 for host zuul2001.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1002 for host zuul2001.codfw.wmnet with OS bullseye completed:

  • zuul2001 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202505201711_dzahn_3962313_zuul2001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1002 for host zuul1001.eqiad.wmnet with OS bookworm

Dzahn updated the task description. (Show Details)

Change #1148411 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: broaden regex for zuul hosts to [12]00[1-3]

https://gerrit.wikimedia.org/r/1148411

Change #1148411 merged by Dzahn:

[operations/puppet@production] site: broaden regex for zuul hosts to [12]00[1-3]

https://gerrit.wikimedia.org/r/1148411

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1002 for host zuul1001.eqiad.wmnet with OS bookworm completed:

  • zuul1001 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202505201756_dzahn_3969838_zuul1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1002 for host zuul1002.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1002 for host zuul1002.eqiad.wmnet with OS bookworm completed:

  • zuul1002 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202505201936_dzahn_3982637_zuul1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Dzahn updated the task description. (Show Details)

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1002 for host zuul1003.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1002 for host zuul1003.eqiad.wmnet with OS bookworm completed:

  • zuul1003 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202505202232_dzahn_4011770_zuul1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
hashar renamed this task from eqiad/codfw: 6 VM request for zuul3+ to eqiad/codfw: 6 VM request for Zuul upgrade project.May 21 2025, 6:52 AM
Dzahn updated the task description. (Show Details)

6 VMs have been created:

2 VMs - main zuul (8GB)

zuul1001
zuul2001

2 VMs - executor (2GB)

zuul1002
zuul2002

2 VMs - trusted runner (4GB)

zuul1003
zuul2003

Change #1148902 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: separate zuul regex, make it clear what is doing what

https://gerrit.wikimedia.org/r/1148902

Change #1148902 merged by Dzahn:

[operations/puppet@production] site: separate zuul regex, make it clear what is doing what

https://gerrit.wikimedia.org/r/1148902

Change #1148930 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] zuul: create basic role/profile for zuul::man and install docker.io

https://gerrit.wikimedia.org/r/1148930

Change #1148930 merged by Dzahn:

[operations/puppet@production] zuul: create role/profile for new zuul main servers, install docker.io

https://gerrit.wikimedia.org/r/1148930

Change #1149474 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] zuul: let puppet manage docker service, install docker-compose

https://gerrit.wikimedia.org/r/1149474

Change #1149474 merged by Dzahn:

[operations/puppet@production] zuul: let puppet manage docker service, install docker-compose

https://gerrit.wikimedia.org/r/1149474

Change #1149476 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] zuul: enforce puppet7 on new zuul::main role

https://gerrit.wikimedia.org/r/1149476

Change #1149476 merged by Dzahn:

[operations/puppet@production] zuul: enforce puppet7 on new zuul::main role

https://gerrit.wikimedia.org/r/1149476

zuul2001 is running Bullseye, that seems like a mistake? The other five are on Bookworm

uhm, yea, that is a mistake. not sure how that happened since I think i just went "cursor up" to edit my cookbook commands :)

thanks for pointing it out.

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1002 for host zuul2001.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1002 for host zuul2001.codfw.wmnet with OS bookworm completed:

  • zuul2001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202505271711_dzahn_3409845_zuul2001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1151279 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site/zuul: create skeleton role/profile for new zuul executors/runners

https://gerrit.wikimedia.org/r/1151279

zuul2001 has been reimaged with bookworm now.

Change #1151281 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] zuul::main: stop installing python docker-compose package

https://gerrit.wikimedia.org/r/1151281

Change #1151279 merged by Dzahn:

[operations/puppet@production] site/zuul: create skeleton role/profile for new zuul executors/runners

https://gerrit.wikimedia.org/r/1151279

Change #1151281 merged by Dzahn:

[operations/puppet@production] zuul::main: stop installing python docker-compose package

https://gerrit.wikimedia.org/r/1151281

Further setup from here on will be part of T395938 to avoid amending to the VM request again and again.