Page MenuHomePhabricator

ops-monitoring-bot (Operations Monitoring Bot)
UserBot

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Aug 12 2016, 1:45 PM (371 w, 2 d)
Roles
Bot
Availability
Available
LDAP User
Unknown
MediaWiki User
Unknown

Bot managed by SRE for automated interaction with Phabricator from monitoring tools.

Recent Activity

Thu, Sep 21

ops-monitoring-bot added a comment to T347032: Site: 1 VM request for apt-staging.

Cookbook cookbooks.sre.hosts.reimage started by eoghan@cumin1001 for host apt-staging2001.codfw.wmnet with OS bookworm executed with errors:

  • apt-staging2001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details
Thu, Sep 21, 5:27 PM · vm-requests, Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T331713: Migrate restbase servers to Bullseye.

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase2014.codfw.wmnet with OS bullseye completed:

  • restbase2014 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309211645_eevans_451030_restbase2014.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
Thu, Sep 21, 5:21 PM · Cassandra, Platform Engineering, Data-Persistence, SRE
ops-monitoring-bot added a comment to T331713: Migrate restbase servers to Bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase2014.codfw.wmnet with OS bullseye

Thu, Sep 21, 4:26 PM · Cassandra, Platform Engineering, Data-Persistence, SRE
ops-monitoring-bot added a comment to T347032: Site: 1 VM request for apt-staging.

Cookbook cookbooks.sre.hosts.reimage was started by eoghan@cumin1001 for host apt-staging2001.codfw.wmnet with OS bookworm

Thu, Sep 21, 4:11 PM · vm-requests, Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T331713: Migrate restbase servers to Bullseye.

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase2013.codfw.wmnet with OS bullseye completed:

  • restbase2013 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309211416_eevans_420265_restbase2013.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
Thu, Sep 21, 2:41 PM · Cassandra, Platform Engineering, Data-Persistence, SRE
ops-monitoring-bot added a comment to T331713: Migrate restbase servers to Bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase2013.codfw.wmnet with OS bullseye

Thu, Sep 21, 1:59 PM · Cassandra, Platform Engineering, Data-Persistence, SRE
ops-monitoring-bot added a comment to T345709: Setup kubernetes20[25-53].

Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubernetes2028.codfw.wmnet with OS bullseye completed:

  • kubernetes2028 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309211245_jiji_386204_kubernetes2028.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Thu, Sep 21, 1:05 PM · serviceops
ops-monitoring-bot added a comment to T345709: Setup kubernetes20[25-53].

Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubernetes2028.codfw.wmnet with OS bullseye

Thu, Sep 21, 12:25 PM · serviceops
ops-monitoring-bot added a comment to T346892: cloudcontrol1007: move to new network setup.

cookbooks.sre.hosts.decommission executed by taavi@cumin1001 for hosts: cloudcontrol1007.wikimedia.org

  • cloudcontrol1007.wikimedia.org (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
Thu, Sep 21, 9:00 AM · cloud-services-team (FY2023/2024-Q1), SRE, ops-eqiad, User-aborrero

Wed, Sep 20

ops-monitoring-bot added a comment to T340721: Build Debian packages for Bookworm.

Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1001 for host idm-test1001.wikimedia.org with OS bookworm completed:

  • idm-test1001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309201331_slyngshede_78977_idm-test1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Wed, Sep 20, 1:49 PM · Bitu, Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T340721: Build Debian packages for Bookworm.

Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1001 for host idm-test1001.wikimedia.org with OS bookworm

Wed, Sep 20, 1:12 PM · Bitu, Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T346042: cloudservices1005: move to new setup.

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudservices1005.eqiad.wmnet with OS bullseye completed:

  • cloudservices1005 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309200913_aborrero_18549_cloudservices1005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Wed, Sep 20, 9:55 AM · cloud-services-team (FY2023/2024-Q1), SRE, ops-eqiad, User-aborrero
ops-monitoring-bot added a comment to T346042: cloudservices1005: move to new setup.

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudservices1005.eqiad.wmnet with OS bullseye

Wed, Sep 20, 8:40 AM · cloud-services-team (FY2023/2024-Q1), SRE, ops-eqiad, User-aborrero
ops-monitoring-bot added a comment to T346042: cloudservices1005: move to new setup.

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1001 for host cloudservices1005.eqiad.wmnet with OS bullseye executed with errors:

  • cloudservices1005 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details
Wed, Sep 20, 8:40 AM · cloud-services-team (FY2023/2024-Q1), SRE, ops-eqiad, User-aborrero
ops-monitoring-bot added a comment to T346042: cloudservices1005: move to new setup.

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1001 for host cloudservices1005.eqiad.wmnet with OS bullseye

Wed, Sep 20, 8:33 AM · cloud-services-team (FY2023/2024-Q1), SRE, ops-eqiad, User-aborrero
ops-monitoring-bot added a comment to T342214: update systems to use new puppetdb instance.

Icinga downtime and Alertmanager silence (ID=708cd0d4-307e-4f35-acfa-ddae4ae88236) set by jmm@cumin2002 for 5 days, 0:00:00 on 1 host(s) and their services with reason: Disable puppetdb/postgres/nginx on old nodes to ensure nothing hits them anyway

puppetdb1002.eqiad.wmnet
Wed, Sep 20, 8:17 AM · Patch-For-Review, SRE-tools, netbox, Puppet-Infrastructure, Puppet (Puppet 7.0), Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T342214: update systems to use new puppetdb instance.

Icinga downtime and Alertmanager silence (ID=11ec6d55-6d8f-4537-a398-4863d7f38c9c) set by jmm@cumin2002 for 5 days, 0:00:00 on 1 host(s) and their services with reason: Disable puppetdb/postgres/nginx on old nodes to ensure nothing hits them anyway

puppetdb2002.codfw.wmnet
Wed, Sep 20, 8:16 AM · Patch-For-Review, SRE-tools, netbox, Puppet-Infrastructure, Puppet (Puppet 7.0), Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T340721: Build Debian packages for Bookworm.

Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1001 for host idm1001.wikimedia.org with OS bookworm completed:

  • idm1001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309200724_slyngshede_4193245_idm1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Wed, Sep 20, 7:42 AM · Bitu, Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T340721: Build Debian packages for Bookworm.

Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1001 for host idm1001.wikimedia.org with OS bookworm

Wed, Sep 20, 7:09 AM · Bitu, Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T342892: Q1:rack/setup/install pki1002.

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host pki1002.eqiad.wmnet with OS bullseye completed:

  • pki1002 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309200054_jhancock_3108220_pki1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Wed, Sep 20, 1:12 AM · SRE, Infrastructure-Foundations, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T342892: Q1:rack/setup/install pki1002.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host pki1002.eqiad.wmnet with OS bullseye

Wed, Sep 20, 12:38 AM · SRE, Infrastructure-Foundations, ops-eqiad, DC-Ops

Tue, Sep 19

ops-monitoring-bot added a comment to T346330: Sept 2023 Switchover Checklist: Services & Traffic.

kamila@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all services in eqiad: Datacenter Switchover: Services - T346330 completed.

Tue, Sep 19, 2:28 PM · serviceops, Datacenter-Switchover, SRE
ops-monitoring-bot added a comment to T346330: Sept 2023 Switchover Checklist: Services & Traffic.

kamila@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all services in eqiad: Datacenter Switchover: Services - T346330 started.

Tue, Sep 19, 2:01 PM · serviceops, Datacenter-Switchover, SRE
ops-monitoring-bot added a comment to T343520: decommission an-test-client1001.eqiad.wmnet.

cookbooks.sre.hosts.decommission executed by stevemunene@cumin1001 for hosts: an-test-client1001.eqiad.wmnet

  • an-test-client1001.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
Tue, Sep 19, 1:53 PM · Data-Platform-SRE, decommission-hardware
ops-monitoring-bot added a comment to T332570: Upgrade hadoop workers to bullseye.

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1148.eqiad.wmnet with OS bullseye completed:

  • an-worker1148 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309191015_stevemunene_3947579_an-worker1148.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)
Tue, Sep 19, 10:38 AM · Data-Platform-SRE
ops-monitoring-bot added a comment to T332570: Upgrade hadoop workers to bullseye.

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1147.eqiad.wmnet with OS bullseye completed:

  • an-worker1147 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309191001_stevemunene_3943227_an-worker1147.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)
Tue, Sep 19, 10:25 AM · Data-Platform-SRE
ops-monitoring-bot added a comment to T332570: Upgrade hadoop workers to bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1148.eqiad.wmnet with OS bullseye

Tue, Sep 19, 9:58 AM · Data-Platform-SRE
ops-monitoring-bot added a comment to T340721: Build Debian packages for Bookworm.

Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1001 for host idm2001.wikimedia.org with OS bookworm completed:

  • idm2001 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run failed and logged in /var/log/spicerack/sre/hosts/reimage/202309190829_slyngshede_3921709_idm2001.out, asking the operator what to do
    • First Puppet run failed and logged in /var/log/spicerack/sre/hosts/reimage/202309190831_slyngshede_3921709_idm2001.out, asking the operator what to do
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309190845_slyngshede_3921709_idm2001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Tue, Sep 19, 9:48 AM · Bitu, Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T332570: Upgrade hadoop workers to bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1147.eqiad.wmnet with OS bullseye

Tue, Sep 19, 9:42 AM · Data-Platform-SRE
ops-monitoring-bot added a comment to T332570: Upgrade hadoop workers to bullseye.

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1146.eqiad.wmnet with OS bullseye completed:

  • an-worker1146 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309190813_stevemunene_3919727_an-worker1146.out
    • Unable to run puppet on puppetmaster2001.codfw.wmnet,puppetmaster1001.eqiad.wmnet to update configmaster.wikimedia.org with the new host SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)
Tue, Sep 19, 8:36 AM · Data-Platform-SRE
ops-monitoring-bot added a comment to T340721: Build Debian packages for Bookworm.

Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1001 for host idm2001.wikimedia.org with OS bookworm

Tue, Sep 19, 8:11 AM · Bitu, Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T332570: Upgrade hadoop workers to bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1146.eqiad.wmnet with OS bullseye

Tue, Sep 19, 7:57 AM · Data-Platform-SRE
ops-monitoring-bot added a comment to T345810: [openstack] Upgrade codfw hosts to bookworm.

Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1001 for host cloudservices2004-dev.codfw.wmnet with OS bookworm executed with errors:

  • cloudservices2004-dev (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309181441_fnegri_3715108_cloudservices2004-dev.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details
Tue, Sep 19, 5:48 AM · cloud-services-team (FY2023/2024-Q1), Cloud-VPS

Mon, Sep 18

ops-monitoring-bot added a comment to T344198: Decommission wdqs100[3-5].

cookbooks.sre.hosts.decommission executed by ryankemper@cumin1001 for hosts: wdqs1004.eqiad.wmnet

  • wdqs1004.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
Mon, Sep 18, 10:09 PM · Data-Platform-SRE
ops-monitoring-bot added a comment to T344198: Decommission wdqs100[3-5].

cookbooks.sre.hosts.decommission executed by ryankemper@cumin1001 for hosts: wdqs1003.eqiad.wmnet

  • wdqs1003.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
Mon, Sep 18, 9:49 PM · Data-Platform-SRE
ops-monitoring-bot added a comment to T342862: Q1:rack/setup/install dbstore100[89].

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host dbstore1008.eqiad.wmnet with OS bullseye completed:

  • dbstore1008 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309182016_jhancock_2787483_dbstore1008.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)
Mon, Sep 18, 8:30 PM · SRE, Data-Engineering, Data-Platform-SRE, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T342862: Q1:rack/setup/install dbstore100[89].

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host dbstore1009.eqiad.wmnet with OS bullseye completed:

  • dbstore1009 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309182009_jhancock_2787488_dbstore1009.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)
Mon, Sep 18, 8:27 PM · SRE, Data-Engineering, Data-Platform-SRE, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T342862: Q1:rack/setup/install dbstore100[89].

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host dbstore1009.eqiad.wmnet with OS bullseye

Mon, Sep 18, 7:22 PM · SRE, Data-Engineering, Data-Platform-SRE, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T342862: Q1:rack/setup/install dbstore100[89].

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host dbstore1008.eqiad.wmnet with OS bullseye

Mon, Sep 18, 7:22 PM · SRE, Data-Engineering, Data-Platform-SRE, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T332570: Upgrade hadoop workers to bullseye.

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1145.eqiad.wmnet with OS bullseye completed:

  • an-worker1145 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309181617_stevemunene_3741759_an-worker1145.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)
Mon, Sep 18, 4:41 PM · Data-Platform-SRE
ops-monitoring-bot added a comment to T332570: Upgrade hadoop workers to bullseye.

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1144.eqiad.wmnet with OS bullseye completed:

  • an-worker1144 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309181601_stevemunene_3736949_an-worker1144.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)
Mon, Sep 18, 4:25 PM · Data-Platform-SRE
ops-monitoring-bot added a comment to T342533: Q1:rack/setup/install kubernetes10[27-56].

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1036.eqiad.wmnet with OS bullseye completed:

  • kubernetes1036 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309181556_jhancock_2748447_kubernetes1036.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Mon, Sep 18, 4:13 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T332570: Upgrade hadoop workers to bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1145.eqiad.wmnet with OS bullseye

Mon, Sep 18, 4:02 PM · Data-Platform-SRE
ops-monitoring-bot added a comment to T342533: Q1:rack/setup/install kubernetes10[27-56].

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1047.eqiad.wmnet with OS bullseye completed:

  • kubernetes1047 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309181536_jhancock_2742743_kubernetes1047.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)
Mon, Sep 18, 3:57 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T342533: Q1:rack/setup/install kubernetes10[27-56].

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1038.eqiad.wmnet with OS bullseye completed:

  • kubernetes1038 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309181534_jhancock_2742687_kubernetes1038.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Mon, Sep 18, 3:53 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T342533: Q1:rack/setup/install kubernetes10[27-56].

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1036.eqiad.wmnet with OS bullseye

Mon, Sep 18, 3:51 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T332570: Upgrade hadoop workers to bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1144.eqiad.wmnet with OS bullseye

Mon, Sep 18, 3:43 PM · Data-Platform-SRE
ops-monitoring-bot added a comment to T342533: Q1:rack/setup/install kubernetes10[27-56].

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1047.eqiad.wmnet with OS bullseye

Mon, Sep 18, 3:29 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T342533: Q1:rack/setup/install kubernetes10[27-56].

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1038.eqiad.wmnet with OS bullseye

Mon, Sep 18, 3:29 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T332570: Upgrade hadoop workers to bullseye.

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1143.eqiad.wmnet with OS bullseye completed:

  • an-worker1143 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309181504_stevemunene_3720801_an-worker1143.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)
Mon, Sep 18, 3:25 PM · Data-Platform-SRE
ops-monitoring-bot added a comment to T332570: Upgrade hadoop workers to bullseye.

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1142.eqiad.wmnet with OS bullseye completed:

  • an-worker1142 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309181445_stevemunene_3718642_an-worker1142.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)
Mon, Sep 18, 3:11 PM · Data-Platform-SRE
ops-monitoring-bot added a comment to T342533: Q1:rack/setup/install kubernetes10[27-56].

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1038.eqiad.wmnet with OS bullseye executed with errors:

  • kubernetes1038 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details
Mon, Sep 18, 2:54 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T342533: Q1:rack/setup/install kubernetes10[27-56].

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1038.eqiad.wmnet with OS bullseye

Mon, Sep 18, 2:54 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T332570: Upgrade hadoop workers to bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1143.eqiad.wmnet with OS bullseye

Mon, Sep 18, 2:47 PM · Data-Platform-SRE
ops-monitoring-bot added a comment to T342533: Q1:rack/setup/install kubernetes10[27-56].

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1036.eqiad.wmnet with OS bullseye executed with errors:

  • kubernetes1036 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details
Mon, Sep 18, 2:45 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T342533: Q1:rack/setup/install kubernetes10[27-56].

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1036.eqiad.wmnet with OS bullseye

Mon, Sep 18, 2:45 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T332570: Upgrade hadoop workers to bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1142.eqiad.wmnet with OS bullseye

Mon, Sep 18, 2:29 PM · Data-Platform-SRE
ops-monitoring-bot added a comment to T345810: [openstack] Upgrade codfw hosts to bookworm.

Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1001 for host cloudservices2004-dev.codfw.wmnet with OS bookworm

Mon, Sep 18, 2:18 PM · cloud-services-team (FY2023/2024-Q1), Cloud-VPS
ops-monitoring-bot added a comment to T345810: [openstack] Upgrade codfw hosts to bookworm.

Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1001 for host cloudbackup1001-dev.eqiad.wmnet with OS bookworm completed:

  • cloudbackup1001-dev (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309181401_fnegri_3707187_cloudbackup1001-dev.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Mon, Sep 18, 2:18 PM · cloud-services-team (FY2023/2024-Q1), Cloud-VPS
ops-monitoring-bot added a comment to T345810: [openstack] Upgrade codfw hosts to bookworm.

Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1001 for host cloudbackup1001-dev.eqiad.wmnet with OS bookworm

Mon, Sep 18, 1:46 PM · cloud-services-team (FY2023/2024-Q1), Cloud-VPS
ops-monitoring-bot added a comment to T332570: Upgrade hadoop workers to bullseye.

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1141.eqiad.wmnet with OS bullseye completed:

  • an-worker1141 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309181224_stevemunene_3170186_an-worker1141.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Mon, Sep 18, 12:47 PM · Data-Platform-SRE
ops-monitoring-bot added a comment to T332570: Upgrade hadoop workers to bullseye.

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1140.eqiad.wmnet with OS bullseye executed with errors:

  • an-worker1140 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309181210_stevemunene_3044066_an-worker1140.out
    • The reimage failed, see the cookbook logs for the details
Mon, Sep 18, 12:24 PM · Data-Platform-SRE
ops-monitoring-bot added a comment to T332570: Upgrade hadoop workers to bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1141.eqiad.wmnet with OS bullseye

Mon, Sep 18, 12:08 PM · Data-Platform-SRE
ops-monitoring-bot added a comment to T332570: Upgrade hadoop workers to bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1140.eqiad.wmnet with OS bullseye

Mon, Sep 18, 11:53 AM · Data-Platform-SRE
ops-monitoring-bot added a comment to T346042: cloudservices1005: move to new setup.

cookbooks.sre.hosts.decommission executed by aborrero@cumin1001 for hosts: cloudservices1005.wikimedia.org

  • cloudservices1005.wikimedia.org (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
Mon, Sep 18, 11:47 AM · cloud-services-team (FY2023/2024-Q1), SRE, ops-eqiad, User-aborrero

Fri, Sep 15

ops-monitoring-bot added a comment to T342533: Q1:rack/setup/install kubernetes10[27-56].

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1047.eqiad.wmnet with OS bullseye executed with errors:

  • kubernetes1047 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details
Fri, Sep 15, 9:03 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T342533: Q1:rack/setup/install kubernetes10[27-56].

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1047.eqiad.wmnet with OS bullseye

Fri, Sep 15, 9:03 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T342533: Q1:rack/setup/install kubernetes10[27-56].

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1036.eqiad.wmnet with OS bullseye executed with errors:

  • kubernetes1036 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details
Fri, Sep 15, 8:59 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T342533: Q1:rack/setup/install kubernetes10[27-56].

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1036.eqiad.wmnet with OS bullseye

Fri, Sep 15, 8:59 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T342533: Q1:rack/setup/install kubernetes10[27-56].

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1038.eqiad.wmnet with OS bullseye executed with errors:

  • kubernetes1038 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details
Fri, Sep 15, 8:59 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T342533: Q1:rack/setup/install kubernetes10[27-56].

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1038.eqiad.wmnet with OS bullseye

Fri, Sep 15, 8:58 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T342533: Q1:rack/setup/install kubernetes10[27-56].

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1036.eqiad.wmnet with OS bullseye executed with errors:

  • kubernetes1036 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details
Fri, Sep 15, 8:58 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T342533: Q1:rack/setup/install kubernetes10[27-56].

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1036.eqiad.wmnet with OS bullseye

Fri, Sep 15, 8:58 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T340721: Build Debian packages for Bookworm.

Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1001 for host idm-test1001.wikimedia.org with OS bookworm completed:

  • idm-test1001 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309151319_slyngshede_376144_idm-test1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
Fri, Sep 15, 1:41 PM · Bitu, Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T340721: Build Debian packages for Bookworm.

Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1001 for host idm-test1001.wikimedia.org with OS bookworm

Fri, Sep 15, 1:03 PM · Bitu, Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T332570: Upgrade hadoop workers to bullseye.

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1138.eqiad.wmnet with OS bullseye completed:

  • an-worker1138 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309151112_stevemunene_350432_an-worker1138.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Fri, Sep 15, 11:37 AM · Data-Platform-SRE
ops-monitoring-bot added a comment to T332570: Upgrade hadoop workers to bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1138.eqiad.wmnet with OS bullseye

Fri, Sep 15, 10:56 AM · Data-Platform-SRE
ops-monitoring-bot added a comment to T331699: Migrate the r/w LDAP servers to Bookworm and MDB storage.

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ldap-replica2008.wikimedia.org with OS bookworm executed with errors:

  • ldap-replica2008 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309150850_jmm_1867768_ldap-replica2008.out
    • The reimage failed, see the cookbook logs for the details
Fri, Sep 15, 8:57 AM · Patch-For-Review, LDAP, Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T331699: Migrate the r/w LDAP servers to Bookworm and MDB storage.

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ldap-replica2008.wikimedia.org with OS bookworm

Fri, Sep 15, 8:30 AM · Patch-For-Review, LDAP, Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T331699: Migrate the r/w LDAP servers to Bookworm and MDB storage.

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ldap-replica2007.wikimedia.org with OS bookworm completed:

  • ldap-replica2007 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309150754_jmm_1856603_ldap-replica2007.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Fri, Sep 15, 8:08 AM · Patch-For-Review, LDAP, Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T331699: Migrate the r/w LDAP servers to Bookworm and MDB storage.

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ldap-replica2007.wikimedia.org with OS bookworm

Fri, Sep 15, 7:38 AM · Patch-For-Review, LDAP, Infrastructure-Foundations, SRE

Thu, Sep 14

ops-monitoring-bot added a comment to T342533: Q1:rack/setup/install kubernetes10[27-56].

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1056.eqiad.wmnet with OS bullseye completed:

  • kubernetes1056 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309142334_jclark_215867_kubernetes1056.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)
Thu, Sep 14, 11:49 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T342533: Q1:rack/setup/install kubernetes10[27-56].

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1056.eqiad.wmnet with OS bullseye

Thu, Sep 14, 11:10 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T342533: Q1:rack/setup/install kubernetes10[27-56].

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1031.eqiad.wmnet with OS bullseye completed:

  • kubernetes1031 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309142229_jhancock_1751701_kubernetes1031.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Thu, Sep 14, 11:03 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T342533: Q1:rack/setup/install kubernetes10[27-56].

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1030.eqiad.wmnet with OS bullseye completed:

  • kubernetes1030 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309142225_jhancock_1751696_kubernetes1030.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Thu, Sep 14, 10:45 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T342533: Q1:rack/setup/install kubernetes10[27-56].

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1034.eqiad.wmnet with OS bullseye completed:

  • kubernetes1034 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309142227_jhancock_1751727_kubernetes1034.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Failed to run the sre.puppet.sync-netbox-hiera cookbook, run it manually
Thu, Sep 14, 10:45 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T342533: Q1:rack/setup/install kubernetes10[27-56].

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1031.eqiad.wmnet with OS bullseye

Thu, Sep 14, 10:06 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T342533: Q1:rack/setup/install kubernetes10[27-56].

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1034.eqiad.wmnet with OS bullseye

Thu, Sep 14, 10:06 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T342533: Q1:rack/setup/install kubernetes10[27-56].

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1030.eqiad.wmnet with OS bullseye

Thu, Sep 14, 10:06 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T342533: Q1:rack/setup/install kubernetes10[27-56].

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1032.eqiad.wmnet with OS bullseye completed:

  • kubernetes1032 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309142138_jhancock_1729404_kubernetes1032.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Thu, Sep 14, 9:55 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T342533: Q1:rack/setup/install kubernetes10[27-56].

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1035.eqiad.wmnet with OS bullseye completed:

  • kubernetes1035 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309142124_jhancock_1729936_kubernetes1035.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Thu, Sep 14, 9:42 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T342533: Q1:rack/setup/install kubernetes10[27-56].

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1039.eqiad.wmnet with OS bullseye completed:

  • kubernetes1039 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309142122_jhancock_1730675_kubernetes1039.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Thu, Sep 14, 9:40 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T342533: Q1:rack/setup/install kubernetes10[27-56].

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1037.eqiad.wmnet with OS bullseye completed:

  • kubernetes1037 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309142119_jhancock_1730280_kubernetes1037.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Thu, Sep 14, 9:38 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T342533: Q1:rack/setup/install kubernetes10[27-56].

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1034.eqiad.wmnet with OS bullseye executed with errors:

  • kubernetes1034 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details
Thu, Sep 14, 9:35 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T342533: Q1:rack/setup/install kubernetes10[27-56].

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1034.eqiad.wmnet with OS bullseye

Thu, Sep 14, 9:35 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T342533: Q1:rack/setup/install kubernetes10[27-56].

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1034.eqiad.wmnet with OS bullseye executed with errors:

  • kubernetes1034 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details
Thu, Sep 14, 9:34 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T342533: Q1:rack/setup/install kubernetes10[27-56].

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1034.eqiad.wmnet with OS bullseye

Thu, Sep 14, 9:34 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T342533: Q1:rack/setup/install kubernetes10[27-56].

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1034.eqiad.wmnet with OS bullseye executed with errors:

  • kubernetes1034 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details
Thu, Sep 14, 9:33 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T342533: Q1:rack/setup/install kubernetes10[27-56].

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1033.eqiad.wmnet with OS bullseye completed:

  • kubernetes1033 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309142117_jhancock_1729593_kubernetes1033.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Thu, Sep 14, 9:32 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T342533: Q1:rack/setup/install kubernetes10[27-56].

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1039.eqiad.wmnet with OS bullseye

Thu, Sep 14, 9:03 PM · SRE, serviceops, ops-eqiad, DC-Ops