Page MenuHomePhabricator

Site: 2 VMs %request for planet
Closed, ResolvedPublic

Description

Cloud VPS Project Tested: planet/production
Site/Location: eqiad/codfw
Number of systems: 2
Service: planet
Networking Requirements: internal
Processor Requirements: 1
Memory: 1GB
Disks: 20GB
Other Requirements: none

for T348392. follow-up / replacement for T248863 , existing VMs on buster

Event Timeline

cookbook [GLOBAL_ARGS] sre.ganeti.makevm: error: argument --memory: Memory must be at least 1.5G

Oh really? Well then 1.5G. But we used to have VMs with 256MB, didnt we

sudo cookbook sre.ganeti.makevm --vcpus 1 --memory 1.5G ...
..
error: argument --memory: invalid validate_memory value: '1.5G'
sudo cookbook sre.ganeti.makevm --vcpus 1 --memory 1.5 ...
...
 error: argument --memory: Memory must be at least 1.5G

Change 976858 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: add planet[12]003 with insetup role

https://gerrit.wikimedia.org/r/976858

Change 976858 merged by Dzahn:

[operations/puppet@production] site: add planet[12]003 with insetup role

https://gerrit.wikimedia.org/r/976858

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1001 for host planet1003.eqiad.wmnet with OS bookworm

Change 976867 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] hieradata: set planet[12]003 to use puppet7

https://gerrit.wikimedia.org/r/976867

Change 976867 merged by Dzahn:

[operations/puppet@production] hieradata: set planet[12]003 to use puppet7

https://gerrit.wikimedia.org/r/976867

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1001 for host planet1003.eqiad.wmnet with OS bookworm executed with errors:

  • planet1003 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311222119_dzahn_1639445_planet1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1001 for host planet1003.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1001 for host planet2003.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1001 for host planet1003.eqiad.wmnet with OS bookworm executed with errors:

  • planet1003 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311222224_dzahn_1679959_planet1003.out
    • Unable to run puppet on config-master2001.codfw.wmnet,config-master1001.eqiad.wmnet to update configmaster.wikimedia.org with the new host SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1001 for host planet1003.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1001 for host planet2003.codfw.wmnet with OS bookworm executed with errors:

  • planet2003 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311222305_dzahn_1693000_planet2003.out
    • Unable to run puppet on config-master2001.codfw.wmnet,config-master1001.eqiad.wmnet to update configmaster.wikimedia.org with the new host SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1001 for host planet2003.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1001 for host planet1003.eqiad.wmnet with OS bookworm executed with errors:

  • planet1003 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311230034_dzahn_1743638_planet1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1001 for host planet2003.codfw.wmnet with OS bookworm executed with errors:

  • planet2003 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311230044_dzahn_1749349_planet2003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details
Dzahn changed the task status from Open to In Progress.Nov 27 2023, 6:07 PM

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1001 for host planet1003.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1001 for host planet1003.eqiad.wmnet with OS bookworm completed:

  • planet1003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311271825_dzahn_476197_planet1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Dzahn claimed this task.