Page MenuHomePhabricator

Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy)
Closed, ResolvedPublic

Description

Cloud VPS Project Tested: devtools (wip)
Site/Location: ulsfo, codfw, eqiad, magru, esams, drmrs, eqsin
Number of systems: 14 in total, 2 per DC
Service: tcp-proxy gerrit-ssh-proxy git-ssh-proxy ( might be used for GitLab and others as well)
Networking Requirements: internal
Processor Requirements: 2
Memory: 2Gi
Disks: 20Gi
Other Requirements:

To move Gerrit behind the CDN (T365259) small proxy VMs are needed to properly tunnel the SSH tcp traffic to the correct active gerrit host.

The same setup is needed also for GitLab (on port 22), so maybe it's possible to make the name more generic like ssh-proxy instead of gerrit-ssh-proxy.


https://netbox.wikimedia.org/search/?q=tcp-proxy

  • tcp-proxy1001 (eqiad)
  • tcp-proxy1002 (eqiad)
  • tcp-proxy2001 (codfw)
  • tcp-proxy2002 (codfw)
  • tcp-proxy3001 (esams)
  • tcp-proxy3002 (esams)
  • tcp-proxy4001 (ulsfo)
  • tcp-proxy4002 (ulsfo)
  • tcp-proxy5001 (eqsin)
  • tcp-proxy5002 (eqsin)
  • tcp-proxy6001 (drmrs)
  • tcp-proxy6002 (drmrs)
  • tcp-proxy7001 (magru)
  • tcp-proxy7002 (magru)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1198380 merged by Dzahn:

[operations/puppet@production] installserver: add partman for tcp-proxy VMs, standard

https://gerrit.wikimedia.org/r/1198380

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-proxy2001.codfw.wmnet with OS trixie executed with errors:

  • tcp-proxy2001 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console tcp-proxy2001.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1002 for host tcp-proxy1001.eqiad.wmnet with OS trixie executed with errors:

  • tcp-proxy1001 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console tcp-proxy1001.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1002 for host tcp-proxy1001.eqiad.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1002 for host tcp-proxy1001.eqiad.wmnet with OS trixie completed:

  • tcp-proxy1001 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202510231950_dzahn_1323305_tcp-proxy1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1198393 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: add tcp-proxy in all 7 DCs

https://gerrit.wikimedia.org/r/1198393

Change #1198397 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site/role: create placeholder role for tcpproxy

https://gerrit.wikimedia.org/r/1198397

As you can see from all the notifications I had some issues creating the VMs but:

tcp-proxy1001.eqiad.wmnet exists with insetup role. (ok to disable puppet and use!)

tcp-proxy2001 had problems I will get back to tomorrow.

Some patches waiting to continue creating VMs in POPs and create a placeholder role/profile that just installs envoy and maybe the config @Jelto showed at T365259#11303486.

Change #1198393 merged by Dzahn:

[operations/puppet@production] site: add tcp-proxy in all 7 DCs

https://gerrit.wikimedia.org/r/1198393

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host tcp-proxy4001.ulsfo.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-proxy4001.ulsfo.wmnet with OS trixie completed:

  • tcp-proxy4001 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202510242134_dzahn_490264_tcp-proxy4001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host tcp-proxy5001.eqsin.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-proxy5001.eqsin.wmnet with OS trixie completed:

  • tcp-proxy5001 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202510271758_dzahn_1290263_tcp-proxy5001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host tcp-proxy5002.eqsin.wmnet with OS trixie

Dzahn updated the task description. (Show Details)
Dzahn updated the task description. (Show Details)

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host tcp-proxy6001.drmrs.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-proxy5002.eqsin.wmnet with OS trixie completed:

  • tcp-proxy5002 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202510271931_dzahn_1316593_tcp-proxy5002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host tcp-proxy4002.ulsfo.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-proxy6001.drmrs.wmnet with OS trixie completed:

  • tcp-proxy6001 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202510272029_dzahn_1330199_tcp-proxy6001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-proxy4002.ulsfo.wmnet with OS trixie completed:

  • tcp-proxy4002 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202510272036_dzahn_1339403_tcp-proxy4002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host tcp-proxy6002.drmrs.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host tcp-proxy1002.eqiad.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-proxy1002.eqiad.wmnet with OS trixie completed:

  • tcp-proxy1002 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202510272129_dzahn_1354352_tcp-proxy1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-proxy6002.drmrs.wmnet with OS trixie completed:

  • tcp-proxy6002 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202510272132_dzahn_1352631_tcp-proxy6002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host tcp-proxy3001.esams.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host tcp-proxy3002.esams.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-proxy3001.esams.wmnet with OS trixie completed:

  • tcp-proxy3001 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202510272229_dzahn_1369414_tcp-proxy3001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-proxy3002.esams.wmnet with OS trixie executed with errors:

  • tcp-proxy3002 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console tcp-proxy3002.esams.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host tcp-proxy3002.esams.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-proxy3002.esams.wmnet with OS trixie executed with errors:

  • tcp-proxy3002 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console tcp-proxy3002.esams.wmnet" to get a root shell, but depending on the failure this may not work.
Dzahn renamed this task from Site: 14 VMs request for gerrit-ssh-proxy to Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy).Oct 28 2025, 3:23 AM
Dzahn updated the task description. (Show Details)

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host tcp-proxy7001.magru.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host tcp-proxy7002.magru.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-proxy7001.magru.wmnet with OS trixie executed with errors:

  • tcp-proxy7001 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console tcp-proxy7001.magru.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-proxy7002.magru.wmnet with OS trixie completed:

  • tcp-proxy7002 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202510282044_dzahn_1635660_tcp-proxy7002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host tcp-proxy7001.magru.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host tcp-proxy2002.codfw.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host tcp-proxy3002.esams.wmnet with OS trixie

All VMs exist now. --> https://netbox.wikimedia.org/search/?q=tcp-proxy

some still need to be reimaged because they had some kind of issue

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-proxy2002.codfw.wmnet with OS trixie completed:

  • tcp-proxy2002 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202510282309_dzahn_1675346_tcp-proxy2002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-proxy7001.magru.wmnet with OS trixie executed with errors:

  • tcp-proxy7001 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console tcp-proxy7001.magru.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-proxy3002.esams.wmnet with OS trixie executed with errors:

  • tcp-proxy3002 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console tcp-proxy3002.esams.wmnet" to get a root shell, but depending on the failure this may not work.

Noticed other inconsistencies like:

I can ssh to tcp-proxy2002 and it exists and is fine:

[tcp-proxy2002:~] $ facter is_virtual
true

while ganeti thinks it does not exist:

[ganeti2033:~] $ sudo gnt-instance info tcp-proxy2002.codfw.wmnet
Failure: prerequisites not met for this operation:
error type: unknown_entity, error details:
Instance 'tcp-proxy2002.codfw.wmnet' not known

ganeti2033 is a new routet-ganeti cluster:

elukey@ganeti2033:~$ sudo gnt-instance list 
Instance                 Hypervisor OS                  Primary_node           Status  Memory
netflow2004.codfw.wmnet  kvm        debootstrap+default ganeti2034.codfw.wmnet running   4.0G
rpki2003.codfw.wmnet     kvm        debootstrap+default ganeti2034.codfw.wmnet running   6.0G
testvm2006.codfw.wmnet   kvm        debootstrap+default ganeti2034.codfw.wmnet running   1.5G
testvm2007.codfw.wmnet   kvm        debootstrap+default ganeti2033.codfw.wmnet running   1.5G
testvm2008.wikimedia.org kvm        debootstrap+default ganeti2033.codfw.wmnet running   1.5G

From 2048 it seems all good:

elukey@ganeti2048:~$ sudo gnt-instance list | grep tcp-proxy
tcp-proxy2001.codfw.wmnet           kvm        debootstrap+default ganeti2035.codfw.wmnet running   2.0G
tcp-proxy2002.codfw.wmnet           kvm        debootstrap+default ganeti2026.codfw.wmnet running   2.0G

tcp-proxy2001 is in a weird state in netbox, I don't see any IPs associated with it:

https://netbox.wikimedia.org/search/?q=tcp-proxy2001

I didn't find trace of what happened with the makevm cookbook, but at this point I think it is probably safer to just clean up everything and re-run makevm? Or is it better to complete the missing bit in Netbox? Cc: @MoritzMuehlenhoff

And 3002 seems in a bad state too:

elukey@ganeti3005:~$ sudo gnt-instance console tcp-proxy3002.esams.wmnet
No ED25519 host key is known for ganeti03.svc.esams.wmnet and you have requested strict checking.
Host key verification failed.
Failure: command execution error:
Connection to console of instance tcp-proxy3002.esams.wmnet failed, please check cluster configuration

Tried to reimage tcp-proxy7001 while being attached to the gnt-console but I don't see any movement in it, so it seems that the VM is stuck in a weird state.

Tried to reimage tcp-proxy7001 while being attached to the gnt-console but I don't see any movement in it, so it seems that the VM is stuck in a weird state.

The VM is fine, this is a corner case with routed Ganeti: The DHCP relay which passes the DHCP request of the VM to the install server doesn't work with the install server runs on the same Ganeti node as the VM to be installed. This is actually fixed; Arzhel worked with dnsmasq upstream who was extremely helpful and added the necessary support in dnsmasq. But we still need to wait for the release of dnsmasq 2.92 to productionise it. All the finer details are at https://phabricator.wikimedia.org/T396864. But there's a workaround: If shuffled the VM to a different node and kicked off a reimage of tcp-proxy7001.

And 3002 seems in a bad state too:

This was the same DHCP issue with routed Ganeti as 7001, I've moved the VM and kicked off a reimage.

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: tcp-proxy2001.codfw.wmnet

  • tcp-proxy2001.codfw.wmnet (WARN)
    • Missing DNSName in Nebox for tcp-proxy2001, unable to verify it.
    • Missing DNS record for tcp-proxy2001.codfw.wmnet, the steps requiring DNS will fail.
    • Host not found on Icinga, unable to downtime it
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox

Thank you @elukey and @MoritzMuehlenhoff for looking at this. I was suspecting it's somehow related to these being on routed ganeti but had no further ideas besides to keep retrying a reimage or recreating them. For whatever reason just reimaging also did not work for me though the other day for the magru machines. It was nice to see this when coming back to work.

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host tcp-proxy2001.codfw.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host tcp-proxy2001.codfw.wmnet with OS trixie completed:

  • tcp-proxy2001 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202510291612_jmm_1884623_tcp-proxy2001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

tcp-proxy2001 had problems I will get back to tomorrow.

This was unrelated to Ganeti, the VM was present, but looking at the logs the reimage was started before the patch was merged which configured the partman recipe, so the reimage failed and apparently the cleanup/rollback in the cookbook was partly triggered (since the VM was still present, but no IP was assigned to the VM). I decommed what was left of it and re-created the VM, it now installed fine.

All working now except 3002.

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host tcp-proxy3002.esams.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-proxy3002.esams.wmnet with OS trixie completed:

  • tcp-proxy3002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202510292104_dzahn_1948622_tcp-proxy3002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Dzahn updated the task description. (Show Details)

verified all 14 VMs are up and can SSH to them

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host tcp-proxy7001.magru.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-proxy7001.magru.wmnet with OS trixie completed:

  • tcp-proxy7001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202511052158_dzahn_3994426_tcp-proxy7001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

@MoritzMuehlenhoff Could you take a look one more time? After debugging some strange issues over at T408532#11347743 I realized the VMs in esams and magru only got a link-local fe80 IPv6 address on the interface.

Since esams and magru are routed ganeti (https://netbox.wikimedia.org/virtualization/cluster-groups/) and you mentioned the DHCP issue above.. this seems related.

[cumin2002:~] $ sudo cumin 'tcp-proxy3*' 'ip a s | grep inet6'
...
(1) tcp-proxy3002.esams.wmnet                                                                                                                                                                 
----- OUTPUT of 'ip a s | grep inet6' -----                                                                                                                                                   
    inet6 ::1/128 scope host noprefixroute                                                                                                                                                    
    inet6 fe80::10:80:2:10/64 scope link                                                                                                                                                      
    inet6 fe80::a800:ff:fec4:93c1/64 scope link proto kernel_ll 
===== NODE GROUP =====                                                                                                                                                                        
(1) tcp-proxy3001.esams.wmnet                                                                                                                                                                 
----- OUTPUT of 'ip a s | grep inet6' -----                                                                                                                                                   
    inet6 ::1/128 scope host noprefixroute                                                                                                                                                    
    inet6 fe80::10:80:2:9/64 scope link                                                                                                                                                       
    inet6 fe80::a800:ff:fe8e:1631/64 scope link proto kernel_ll 

--

[cumin2002:~] $ sudo cumin 'tcp-proxy7*' 'ip a s | grep inet6'
...
(1) tcp-proxy7001.magru.wmnet                                                                                                                                                                 
----- OUTPUT of 'ip a s | grep inet6' -----                                                                                                                                                   
    inet6 ::1/128 scope host noprefixroute                                                                                                                                                    
    inet6 fe80::10:140:2:10/64 scope link                                                                                                                                                     
    inet6 fe80::a800:ff:fe6f:f4a4/64 scope link proto kernel_ll 
===== NODE GROUP =====                                                                                                                                                                        
(1) tcp-proxy7002.magru.wmnet                                                                                                                                                                 
----- OUTPUT of 'ip a s | grep inet6' -----                                                                                                                                                   
    inet6 ::1/128 scope host noprefixroute                                                                                                                                                    
    inet6 fe80::10:140:2:11/64 scope link                                                                                                                                                     
    inet6 fe80::a800:ff:fe16:59d/64 scope link proto kernel_ll
$ ping6 tcp-proxy3001.esams.wmnet
PING tcp-proxy3001.esams.wmnet(tcp-proxy3001.esams.wmnet (2a02:ec80:300:103:10:80:2:9)) 56 data bytes
From ganeti3008.esams.wmnet (2a02:ec80:300:101:10:80:0:4) icmp_seq=1 Destination unreachable: Address unreachable

$ ping6 tcp-proxy3001.esams.wmnet
PING tcp-proxy3001.esams.wmnet(tcp-proxy3001.esams.wmnet (2a02:ec80:300:103:10:80:2:9)) 56 data bytes
From ganeti3006.esams.wmnet (2a02:ec80:300:101:10:80:0:5) icmp_seq=1 Destination unreachable: Address unreachable

$ ping6 tcp-proxy7001.magru.wmnet
PING tcp-proxy7001.magru.wmnet(tcp-proxy7001.magru.wmnet (2a02:ec80:700:103:10:140:2:10)) 56 data bytes
From ganeti7004.magru.wmnet (2a02:ec80:700:102:10:140:1:13) icmp_seq=1 Destination unreachable: Address unreachable

$ ping6 tcp-proxy7002.magru.wmnet
PING tcp-proxy7002.magru.wmnet(tcp-proxy7002.magru.wmnet (2a02:ec80:700:103:10:140:2:11)) 56 data bytes
From ganeti7003.magru.wmnet (2a02:ec80:700:101:10:140:0:12) icmp_seq=1 Destination unreachable: Address unreachable

I tried reimaging 7001 but it stayed the same. DNS is ok both ways (2a02:ec80:700:103:10:140:2:10 <--> tcp-proxy7001.magru.wmnet.

The IP is not bound on the interface though.

Change #1203052 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/puppet@production] autoinstall: routed Ganeti: fix ipv6 on trixie

https://gerrit.wikimedia.org/r/1203052

Change #1203052 merged by CDanis:

[operations/puppet@production] autoinstall: routed Ganeti: fix ipv6 on trixie

https://gerrit.wikimedia.org/r/1203052

On trixie, the attempt to read the v6 address from the qemu variables in late_command.sh isn't working, and so the host gets configured for SLAAC instead (shown in P85100 from /var/log/installer/syslog). Which doesn't work at all.

But of course, the mere existence of the AAAA record for the host poisons most everything that isn't a web browser or libcurl. (On Linux, at least.)

Anyway, it turns out the issue was that on trixie, kernel modules are .ko.xz instead of plain .ko, which the script doesn't expect. I produced an ugly but minimal patch, and for now, live-hacked apt1002 and reimaged the four relevant VMs here.

Thanks a lot for figuring this out and fixing it!