Page MenuHomePhabricator

ops-monitoring-bot (Operations Monitoring Bot)
UserBot

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Aug 12 2016, 1:45 PM (345 w, 2 d)
Roles
Bot
Availability
Available
LDAP User
Unknown
MediaWiki User
Unknown

Bot managed by SRE for automated interaction with Phabricator from monitoring tools.

Recent Activity

Today

ops-monitoring-bot created T333091: Degraded RAID on an-worker1132.
Sun, Mar 26, 2:06 PM · Data-Engineering, SRE, ops-eqiad

Fri, Mar 24

ops-monitoring-bot added a comment to T331695: Migrate the KDCs to Bullseye.

Icinga downtime and Alertmanager silence (ID=d3c0fbee-5db6-4389-b75e-415ed51c67bc) set by jmm@cumin2002 for 21 days, 0:00:00 on 1 host(s) and their services with reason: Non-functional, WIP for Bullseye update

krb2002.codfw.wmnet
Fri, Mar 24, 10:56 AM · Infrastructure-Foundations, SRE

Thu, Mar 23

ops-monitoring-bot added a comment to T332819: Site: 1 VM request for doc2002.

Cookbook cookbooks.sre.ganeti.reimage started by denisse@cumin1001 for host doc2002.codfw.wmnet with OS bullseye completed:

  • doc2002 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202303232131_denisse_3186993_doc2002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
Thu, Mar 23, 10:00 PM · SRE, vm-requests
ops-monitoring-bot added a comment to T332819: Site: 1 VM request for doc2002.

Cookbook cookbooks.sre.ganeti.reimage was started by denisse@cumin1001 for host doc2002.codfw.wmnet with OS bullseye

Thu, Mar 23, 9:31 PM · SRE, vm-requests
ops-monitoring-bot added a comment to T332819: Site: 1 VM request for doc2002.

Cookbook cookbooks.sre.ganeti.reimage started by denisse@cumin1001 for host doc2002.codfw.wmnet with OS bullseye executed with errors:

  • doc2002 (FAIL)
    • The reimage failed, see the cookbook logs for the details
Thu, Mar 23, 8:42 PM · SRE, vm-requests
ops-monitoring-bot added a comment to T332819: Site: 1 VM request for doc2002.

Cookbook cookbooks.sre.ganeti.reimage was started by denisse@cumin1001 for host doc2002.codfw.wmnet with OS bullseye

Thu, Mar 23, 8:42 PM · SRE, vm-requests
ops-monitoring-bot added a comment to T332819: Site: 1 VM request for doc2002.

Cookbook cookbooks.sre.ganeti.reimage started by denisse@cumin1001 for host doc2002.codfw.wmnet with OS bullseye executed with errors:

  • doc2002 (FAIL)
    • The reimage failed, see the cookbook logs for the details
Thu, Mar 23, 8:35 PM · SRE, vm-requests
ops-monitoring-bot added a comment to T332819: Site: 1 VM request for doc2002.

Cookbook cookbooks.sre.ganeti.reimage was started by denisse@cumin1001 for host doc2002.codfw.wmnet with OS bullseye

Thu, Mar 23, 8:34 PM · SRE, vm-requests
ops-monitoring-bot added a comment to T332819: Site: 1 VM request for doc2002.

cookbooks.sre.hosts.decommission executed by denisse@cumin1001 for hosts: doc2002

  • doc2002 (WARN)
    • Host not found on Icinga, unable to downtime it
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
Thu, Mar 23, 7:28 PM · SRE, vm-requests
ops-monitoring-bot added a comment to T332013: Migrate kafka-main to bullseye.

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kafka-main2002.codfw.wmnet with OS bullseye executed with errors:

  • kafka-main2002 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details
Thu, Mar 23, 4:17 PM · serviceops
ops-monitoring-bot added a comment to T332013: Migrate kafka-main to bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kafka-main2002.codfw.wmnet with OS bullseye

Thu, Mar 23, 4:07 PM · serviceops
ops-monitoring-bot added a comment to T331702: Migrate mw_rc_irc servers to Bullseye.

Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host irc1002.wikimedia.org with OS bullseye completed:

  • irc1002 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202303231503_jmm_1164999_irc1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
Thu, Mar 23, 3:36 PM · Wikimedia-IRC-RC-Server, SRE-Unowned, SRE
ops-monitoring-bot added a comment to T331702: Migrate mw_rc_irc servers to Bullseye.

Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host irc1002.wikimedia.org with OS bullseye

Thu, Mar 23, 3:03 PM · Wikimedia-IRC-RC-Server, SRE-Unowned, SRE
ops-monitoring-bot added a comment to T331706: Migrate Mailman/lists to Bullseye/Bookworm.

Cookbook cookbooks.sre.ganeti.reimage started by jhathaway@cumin1001 for host lists1003.wikimedia.org with OS bullseye completed:

  • lists1003 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202303231429_jhathaway_3111141_lists1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
Thu, Mar 23, 2:56 PM · Patch-For-Review, Wikimedia-Mailing-lists, SRE
ops-monitoring-bot added a comment to T331706: Migrate Mailman/lists to Bullseye/Bookworm.

Cookbook cookbooks.sre.ganeti.reimage was started by jhathaway@cumin1001 for host lists1003.wikimedia.org with OS bullseye

Thu, Mar 23, 2:29 PM · Patch-For-Review, Wikimedia-Mailing-lists, SRE
ops-monitoring-bot added a comment to T321309: Upgrade Traffic hosts to bullseye.

Cookbook cookbooks.sre.ganeti.reimage started by sukhe@cumin2002 for host pybal-test2003.codfw.wmnet with OS bullseye completed:

  • pybal-test2003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202303231355_sukhe_1115468_pybal-test2003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
Thu, Mar 23, 2:21 PM · Patch-For-Review, Traffic, SRE
ops-monitoring-bot added a comment to T321309: Upgrade Traffic hosts to bullseye.

Cookbook cookbooks.sre.ganeti.reimage was started by sukhe@cumin2002 for host pybal-test2003.codfw.wmnet with OS bullseye

Thu, Mar 23, 1:55 PM · Patch-For-Review, Traffic, SRE
ops-monitoring-bot added a comment to T332584: Upgrade an-test-druid1001 to bullseye.

Cookbook cookbooks.sre.ganeti.reimage started by btullis@cumin1001 for host an-test-druid1001.eqiad.wmnet with OS bullseye completed:

  • an-test-druid1001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202303231136_btullis_3072350_an-test-druid1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
Thu, Mar 23, 12:14 PM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10), Data-Engineering-Planning
ops-monitoring-bot added a comment to T332013: Migrate kafka-main to bullseye.

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kafka-main2004.codfw.wmnet with OS bullseye completed:

  • kafka-main2004 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202303231108_elukey_3067129_kafka-main2004.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Thu, Mar 23, 11:52 AM · serviceops
ops-monitoring-bot added a comment to T332584: Upgrade an-test-druid1001 to bullseye.

Cookbook cookbooks.sre.ganeti.reimage was started by btullis@cumin1001 for host an-test-druid1001.eqiad.wmnet with OS bullseye

Thu, Mar 23, 11:36 AM · Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10), Data-Engineering-Planning
ops-monitoring-bot added a comment to T331702: Migrate mw_rc_irc servers to Bullseye.

Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host irc2002.wikimedia.org with OS bullseye completed:

  • irc2002 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202303231044_jmm_947400_irc2002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
Thu, Mar 23, 11:16 AM · Wikimedia-IRC-RC-Server, SRE-Unowned, SRE
ops-monitoring-bot added a comment to T332013: Migrate kafka-main to bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kafka-main2004.codfw.wmnet with OS bullseye

Thu, Mar 23, 11:08 AM · serviceops
ops-monitoring-bot added a comment to T331702: Migrate mw_rc_irc servers to Bullseye.

Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host irc2002.wikimedia.org with OS bullseye

Thu, Mar 23, 10:44 AM · Wikimedia-IRC-RC-Server, SRE-Unowned, SRE
ops-monitoring-bot added a comment to T332013: Migrate kafka-main to bullseye.

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kafka-main2005.codfw.wmnet with OS bullseye completed:

  • kafka-main2005 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202303231001_elukey_3053469_kafka-main2005.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Thu, Mar 23, 10:38 AM · serviceops
ops-monitoring-bot added a comment to T332013: Migrate kafka-main to bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kafka-main2005.codfw.wmnet with OS bullseye

Thu, Mar 23, 10:01 AM · serviceops
ops-monitoring-bot added a comment to T332819: Site: 1 VM request for doc2002.

Cookbook cookbooks.sre.ganeti.reimage started by denisse@cumin1001 for host doc2002.codfw.wmnet with OS bullseye executed with errors:

  • doc2002 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details
Thu, Mar 23, 5:37 AM · SRE, vm-requests
ops-monitoring-bot added a comment to T329363: Upgrade Hadoop test cluster to Bullseye.

Cookbook cookbooks.sre.ganeti.reimage started by stevemunene@cumin1001 for host an-test-client1002.eqiad.wmnet with OS bullseye executed with errors:

  • an-test-client1002 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202303221402_stevemunene_2834991_an-test-client1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details
Thu, Mar 23, 5:34 AM · Patch-For-Review, Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10), Data-Engineering-Planning
ops-monitoring-bot added a comment to T332819: Site: 1 VM request for doc2002.

Cookbook cookbooks.sre.ganeti.reimage was started by denisse@cumin1001 for host doc2002.codfw.wmnet with OS bullseye

Thu, Mar 23, 4:26 AM · SRE, vm-requests
ops-monitoring-bot added a comment to T332819: Site: 1 VM request for doc2002.

Cookbook cookbooks.sre.ganeti.reimage started by denisse@cumin1001 for host doc2002.codfw.wmnet with OS bullseye executed with errors:

  • doc2002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details
Thu, Mar 23, 2:07 AM · SRE, vm-requests
ops-monitoring-bot added a comment to T332819: Site: 1 VM request for doc2002.

Cookbook cookbooks.sre.ganeti.reimage was started by denisse@cumin1001 for host doc2002.codfw.wmnet with OS bullseye

Thu, Mar 23, 12:57 AM · SRE, vm-requests
ops-monitoring-bot added a comment to T332819: Site: 1 VM request for doc2002.

Cookbook cookbooks.sre.ganeti.reimage started by denisse@cumin1001 for host doc2002.codfw.wmnet with OS bullseye executed with errors:

  • doc2002 (FAIL)
    • The reimage failed, see the cookbook logs for the details
Thu, Mar 23, 12:57 AM · SRE, vm-requests
ops-monitoring-bot added a comment to T332819: Site: 1 VM request for doc2002.

Cookbook cookbooks.sre.ganeti.reimage was started by denisse@cumin1001 for host doc2002.codfw.wmnet with OS bullseye

Thu, Mar 23, 12:57 AM · SRE, vm-requests
ops-monitoring-bot added a comment to T332812: Site: 1 VM request for doc1003.

Cookbook cookbooks.sre.ganeti.reimage started by denisse@cumin1001 for host doc1003.eqiad.wmnet with OS bullseye completed:

  • doc1003 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202303222346_denisse_2941187_doc1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
Thu, Mar 23, 12:10 AM · vm-requests, Infrastructure-Foundations, SRE

Wed, Mar 22

ops-monitoring-bot added a comment to T332812: Site: 1 VM request for doc1003.

Cookbook cookbooks.sre.ganeti.reimage was started by denisse@cumin1001 for host doc1003.eqiad.wmnet with OS bullseye

Wed, Mar 22, 11:46 PM · vm-requests, Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T289657: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet.

cookbooks.sre.hosts.decommission executed by jhathaway@cumin1001 for hosts: dborch1002.wikimedia.org

  • dborch1002.wikimedia.org (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
Wed, Mar 22, 5:53 PM · Patch-For-Review, SRE, ops-eqiad, decommission-hardware
ops-monitoring-bot added a comment to T298959: Upgrade dborch1001 to Bullseye.

Cookbook cookbooks.sre.ganeti.reimage started by jhathaway@cumin1001 for host dborch1001.wikimedia.org with OS bullseye completed:

  • dborch1001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202303221529_jhathaway_2852582_dborch1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
Wed, Mar 22, 3:58 PM · DBA
ops-monitoring-bot added a comment to T298959: Upgrade dborch1001 to Bullseye.

Cookbook cookbooks.sre.ganeti.reimage was started by jhathaway@cumin1001 for host dborch1001.wikimedia.org with OS bullseye

Wed, Mar 22, 3:29 PM · DBA
ops-monitoring-bot added a comment to T329363: Upgrade Hadoop test cluster to Bullseye.

Cookbook cookbooks.sre.ganeti.reimage was started by stevemunene@cumin1001 for host an-test-client1002.eqiad.wmnet with OS bullseye

Wed, Mar 22, 2:02 PM · Patch-For-Review, Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10), Data-Engineering-Planning
ops-monitoring-bot added a comment to T332013: Migrate kafka-main to bullseye.

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kafka-main1004.eqiad.wmnet with OS bullseye completed:

  • kafka-main1004 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202303220938_elukey_2764303_kafka-main1004.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Wed, Mar 22, 10:16 AM · serviceops
ops-monitoring-bot added a comment to T329363: Upgrade Hadoop test cluster to Bullseye.

Cookbook cookbooks.sre.ganeti.reimage started by stevemunene@cumin1001 for host an-test-client1002.eqiad.wmnet with OS bullseye executed with errors:

  • an-test-client1002 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details
Wed, Mar 22, 10:07 AM · Patch-For-Review, Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10), Data-Engineering-Planning
ops-monitoring-bot added a comment to T332013: Migrate kafka-main to bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kafka-main1004.eqiad.wmnet with OS bullseye

Wed, Mar 22, 9:38 AM · serviceops
ops-monitoring-bot added a comment to T311687: Upgrade ganeti/eqiad to Bullseye.

Icinga downtime and Alertmanager silence (ID=dcc641f3-257f-4a0d-875d-85c9d542b7f8) set by jmm@cumin2002 for 3 days, 0:00:00 on 1 host(s) and their services with reason: Some tests with pybal/Bullseye

pybal-test2003.codfw.wmnet
Wed, Mar 22, 8:58 AM · Ganeti, Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T329363: Upgrade Hadoop test cluster to Bullseye.

Cookbook cookbooks.sre.ganeti.reimage was started by stevemunene@cumin1001 for host an-test-client1002.eqiad.wmnet with OS bullseye

Wed, Mar 22, 8:52 AM · Patch-For-Review, Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10), Data-Engineering-Planning

Tue, Mar 21

ops-monitoring-bot added a comment to T329363: Upgrade Hadoop test cluster to Bullseye.

Cookbook cookbooks.sre.ganeti.reimage started by stevemunene@cumin1001 for host an-test-client1002.eqiad.wmnet with OS bullseye executed with errors:

  • an-test-client1002 (FAIL)
    • The reimage failed, see the cookbook logs for the details
Tue, Mar 21, 9:30 PM · Patch-For-Review, Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10), Data-Engineering-Planning
ops-monitoring-bot added a comment to T329363: Upgrade Hadoop test cluster to Bullseye.

Cookbook cookbooks.sre.ganeti.reimage was started by stevemunene@cumin1001 for host an-test-client1002.eqiad.wmnet with OS bullseye

Tue, Mar 21, 9:21 PM · Patch-For-Review, Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10), Data-Engineering-Planning
ops-monitoring-bot added a comment to T329363: Upgrade Hadoop test cluster to Bullseye.

Cookbook cookbooks.sre.ganeti.reimage started by stevemunene@cumin1001 for host an-test-client1002.eqiad.wmnet with OS bullseye executed with errors:

  • an-test-client1002 (FAIL)
    • The reimage failed, see the cookbook logs for the details
Tue, Mar 21, 8:10 PM · Patch-For-Review, Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10), Data-Engineering-Planning
ops-monitoring-bot added a comment to T329363: Upgrade Hadoop test cluster to Bullseye.

Cookbook cookbooks.sre.ganeti.reimage was started by stevemunene@cumin1001 for host an-test-client1002.eqiad.wmnet with OS bullseye

Tue, Mar 21, 8:09 PM · Patch-For-Review, Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10), Data-Engineering-Planning
ops-monitoring-bot added a comment to T329363: Upgrade Hadoop test cluster to Bullseye.

Cookbook cookbooks.sre.ganeti.reimage started by stevemunene@cumin1001 for host an-test-client1002.eqiad.wmnet with OS bullseye executed with errors:

  • an-test-client1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details
Tue, Mar 21, 7:52 PM · Patch-For-Review, Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10), Data-Engineering-Planning
ops-monitoring-bot added a comment to T326846: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004.

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye

Tue, Mar 21, 7:44 PM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T298959: Upgrade dborch1001 to Bullseye.

Cookbook cookbooks.sre.ganeti.reimage started by jhathaway@cumin1001 for host dborch1002.wikimedia.org with OS bullseye executed with errors:

  • dborch1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202303211852_jhathaway_2589599_dborch1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details
Tue, Mar 21, 7:41 PM · DBA
ops-monitoring-bot added a comment to T326846: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004.

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye

Tue, Mar 21, 7:17 PM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T298959: Upgrade dborch1001 to Bullseye.

Cookbook cookbooks.sre.ganeti.reimage was started by jhathaway@cumin1001 for host dborch1002.wikimedia.org with OS bullseye

Tue, Mar 21, 6:52 PM · DBA
ops-monitoring-bot added a comment to T329363: Upgrade Hadoop test cluster to Bullseye.

Cookbook cookbooks.sre.ganeti.reimage was started by stevemunene@cumin1001 for host an-test-client1002.eqiad.wmnet with OS bullseye

Tue, Mar 21, 6:38 PM · Patch-For-Review, Shared-Data-Infrastructure (Shared-Data-Infra Sprint 10), Data-Engineering-Planning
ops-monitoring-bot added a comment to T332013: Migrate kafka-main to bullseye.

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kafka-main1005.eqiad.wmnet with OS bullseye completed:

  • kafka-main1005 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202303211510_elukey_2520937_kafka-main1005.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Tue, Mar 21, 3:52 PM · serviceops
ops-monitoring-bot added a comment to T326846: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004.

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye

Tue, Mar 21, 3:31 PM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T332013: Migrate kafka-main to bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kafka-main1005.eqiad.wmnet with OS bullseye

Tue, Mar 21, 3:10 PM · serviceops
ops-monitoring-bot added a comment to T332013: Migrate kafka-main to bullseye.

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kafka-main1005.eqiad.wmnet with OS bullseye executed with errors:

  • kafka-main1005 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • New OS is buster but bullseye was requested
    • The reimage failed, see the cookbook logs for the details
Tue, Mar 21, 3:02 PM · serviceops
ops-monitoring-bot added a comment to T332013: Migrate kafka-main to bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kafka-main1005.eqiad.wmnet with OS bullseye

Tue, Mar 21, 2:38 PM · serviceops
ops-monitoring-bot added a comment to T332013: Migrate kafka-main to bullseye.

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host kafka-main1005.eqiad.wmnet with OS bullseye executed with errors:

  • kafka-main1005 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details
Tue, Mar 21, 2:37 PM · serviceops
ops-monitoring-bot added a comment to T332013: Migrate kafka-main to bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host kafka-main1005.eqiad.wmnet with OS bullseye

Tue, Mar 21, 9:43 AM · serviceops

Mon, Mar 20

ops-monitoring-bot created T332649: Degraded RAID on db1154.
Mon, Mar 20, 11:08 PM · DBA, SRE, ops-eqiad
ops-monitoring-bot added a comment to T326846: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004.

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye executed with errors:

  • thanos-fe1004 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details
Mon, Mar 20, 3:52 PM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T326846: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004.

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye

Mon, Mar 20, 2:56 PM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T326846: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004.

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ms-fe1013.eqiad.wmnet with OS bullseye executed with errors:

  • ms-fe1013 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details
Mon, Mar 20, 2:53 PM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T326846: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004.

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ms-fe1013.eqiad.wmnet with OS bullseye

Mon, Mar 20, 2:53 PM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T331700: Migrate cuminunpriv1001 to Bullseye.

Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host cuminunpriv1001.eqiad.wmnet with OS bullseye completed:

  • cuminunpriv1001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202303201317_jmm_2241129_cuminunpriv1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
Mon, Mar 20, 1:41 PM · Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T331700: Migrate cuminunpriv1001 to Bullseye.

Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host cuminunpriv1001.eqiad.wmnet with OS bullseye

Mon, Mar 20, 1:18 PM · Infrastructure-Foundations, SRE

Fri, Mar 17

ops-monitoring-bot added a comment to T326846: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004.

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ms-fe1013.eqiad.wmnet with OS bullseye executed with errors:

  • ms-fe1013 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details
Fri, Mar 17, 1:59 PM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T326846: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004.

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ms-fe1013.eqiad.wmnet with OS bullseye

Fri, Mar 17, 1:59 PM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops

Thu, Mar 16

ops-monitoring-bot added a comment to T331896: upgrade miscweb VMs to bullseye.

Cookbook cookbooks.sre.ganeti.reimage started by dzahn@cumin2002 for host miscweb2003.codfw.wmnet with OS bullseye completed:

  • miscweb2003 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202303162300_dzahn_2853858_miscweb2003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
Thu, Mar 16, 11:31 PM · Patch-For-Review, serviceops-collab
ops-monitoring-bot added a comment to T331896: upgrade miscweb VMs to bullseye.

Cookbook cookbooks.sre.ganeti.reimage started by dzahn@cumin1001 for host miscweb1003.eqiad.wmnet with OS bullseye completed:

  • miscweb1003 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202303162301_dzahn_1167777_miscweb1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
Thu, Mar 16, 11:28 PM · Patch-For-Review, serviceops-collab
ops-monitoring-bot added a comment to T331896: upgrade miscweb VMs to bullseye.

Cookbook cookbooks.sre.ganeti.reimage was started by dzahn@cumin1001 for host miscweb1003.eqiad.wmnet with OS bullseye

Thu, Mar 16, 11:01 PM · Patch-For-Review, serviceops-collab
ops-monitoring-bot added a comment to T331896: upgrade miscweb VMs to bullseye.

Cookbook cookbooks.sre.ganeti.reimage was started by dzahn@cumin2002 for host miscweb2003.codfw.wmnet with OS bullseye

Thu, Mar 16, 11:00 PM · Patch-For-Review, serviceops-collab
ops-monitoring-bot added a comment to T326846: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004.

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye executed with errors:

  • thanos-fe1004 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details
Thu, Mar 16, 6:37 PM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T326846: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004.

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye

Thu, Mar 16, 5:40 PM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T326846: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004.

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye executed with errors:

  • thanos-fe1004 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details
Thu, Mar 16, 5:36 PM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T326846: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004.

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye

Thu, Mar 16, 5:30 PM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T326846: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004.

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ms-fe1013.eqiad.wmnet with OS bullseye

Thu, Mar 16, 5:21 PM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T326363: mw2420-mw2451 service implementation tracking.

Icinga downtime and Alertmanager silence (ID=33992616-b446-4bc5-bf17-27cb8c47e8d7) set by cgoubert@cumin1001 for 1:00:00 on 32 host(s) and their services with reason: new_install

mw[2420-2451].codfw.wmnet
Thu, Mar 16, 2:50 PM · SRE, serviceops
ops-monitoring-bot added a comment to T326363: mw2420-mw2451 service implementation tracking.

Icinga downtime and Alertmanager silence (ID=f7f64d19-c64a-4fb5-a8ab-f3218dfd9862) set by cgoubert@cumin1001 for 1:00:00 on 32 host(s) and their services with reason: new_install

mw[2420-2451].codfw.wmnet
Thu, Mar 16, 12:08 PM · SRE, serviceops
ops-monitoring-bot added a comment to T326363: mw2420-mw2451 service implementation tracking.

Icinga downtime and Alertmanager silence (ID=17f33514-0b87-4f50-abfa-6cd2e1548410) set by cgoubert@cumin1001 for 5:00:00 on 32 host(s) and their services with reason: new_install

mw[2420-2451].codfw.wmnet
Thu, Mar 16, 11:17 AM · SRE, serviceops
ops-monitoring-bot added a comment to T326363: mw2420-mw2451 service implementation tracking.

Icinga downtime and Alertmanager silence (ID=c5ba1cf2-f027-43f9-8672-b4eb30f98ddc) set by cgoubert@cumin1001 for 1:00:00 on 32 host(s) and their services with reason: new_install

mw[2420-2451].codfw.wmnet
Thu, Mar 16, 10:33 AM · SRE, serviceops
ops-monitoring-bot added a comment to T331874: decommission db1105.eqiad.wmnet.

cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: db1105.eqiad.wmnet

  • db1105.eqiad.wmnet (WARN)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
Thu, Mar 16, 8:51 AM · SRE, ops-eqiad, decommission-hardware

Wed, Mar 15

ops-monitoring-bot added a comment to T321309: Upgrade Traffic hosts to bullseye.

Cookbook cookbooks.sre.ganeti.reimage started by brett@cumin2002 for host doh3002.wikimedia.org with OS bullseye completed:

  • doh3002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202303151954_brett_1718980_doh3002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
Wed, Mar 15, 8:33 PM · Patch-For-Review, Traffic, SRE
ops-monitoring-bot added a comment to T321309: Upgrade Traffic hosts to bullseye.

Cookbook cookbooks.sre.ganeti.reimage started by brett@cumin2002 for host doh1002.wikimedia.org with OS bullseye completed:

  • doh1002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202303151948_brett_1715541_doh1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
Wed, Mar 15, 8:20 PM · Patch-For-Review, Traffic, SRE
ops-monitoring-bot added a comment to T321309: Upgrade Traffic hosts to bullseye.

Cookbook cookbooks.sre.ganeti.reimage started by brett@cumin2002 for host doh2002.wikimedia.org with OS bullseye completed:

  • doh2002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202303151945_brett_1712110_doh2002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
Wed, Mar 15, 8:17 PM · Patch-For-Review, Traffic, SRE
ops-monitoring-bot added a comment to T326420: Kafka-logging Bullseye Upgrades.

Cookbook cookbooks.sre.hosts.reimage started by herron@cumin1001 for host kafka-logging1001.eqiad.wmnet with OS bullseye completed:

  • kafka-logging1001 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202303151932_herron_854627_kafka-logging1001.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
Wed, Mar 15, 8:16 PM · SRE Observability (FY2022/2023-Q3), Observability-Logging, User-herron
ops-monitoring-bot added a comment to T321309: Upgrade Traffic hosts to bullseye.

Cookbook cookbooks.sre.ganeti.reimage was started by brett@cumin2002 for host doh3002.wikimedia.org with OS bullseye

Wed, Mar 15, 7:54 PM · Patch-For-Review, Traffic, SRE
ops-monitoring-bot added a comment to T321309: Upgrade Traffic hosts to bullseye.

Cookbook cookbooks.sre.ganeti.reimage started by brett@cumin2002 for host doh3001.wikimedia.org with OS bullseye completed:

  • doh3001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202303151914_brett_1685626_doh3001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
Wed, Mar 15, 7:53 PM · Patch-For-Review, Traffic, SRE
ops-monitoring-bot added a comment to T321309: Upgrade Traffic hosts to bullseye.

Cookbook cookbooks.sre.ganeti.reimage was started by brett@cumin2002 for host doh1002.wikimedia.org with OS bullseye

Wed, Mar 15, 7:49 PM · Patch-For-Review, Traffic, SRE
ops-monitoring-bot added a comment to T321309: Upgrade Traffic hosts to bullseye.

Cookbook cookbooks.sre.ganeti.reimage started by brett@cumin2002 for host doh1001.wikimedia.org with OS bullseye completed:

  • doh1001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202303151917_brett_1687637_doh1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
Wed, Mar 15, 7:46 PM · Patch-For-Review, Traffic, SRE
ops-monitoring-bot added a comment to T321309: Upgrade Traffic hosts to bullseye.

Cookbook cookbooks.sre.ganeti.reimage was started by brett@cumin2002 for host doh2002.wikimedia.org with OS bullseye

Wed, Mar 15, 7:45 PM · Patch-For-Review, Traffic, SRE
ops-monitoring-bot added a comment to T321309: Upgrade Traffic hosts to bullseye.

Cookbook cookbooks.sre.ganeti.reimage started by brett@cumin2002 for host doh2001.wikimedia.org with OS bullseye completed:

  • doh2001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202303151916_brett_1686927_doh2001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
Wed, Mar 15, 7:44 PM · Patch-For-Review, Traffic, SRE
ops-monitoring-bot added a comment to T321309: Upgrade Traffic hosts to bullseye.

Cookbook cookbooks.sre.ganeti.reimage started by brett@cumin2002 for host doh6002.wikimedia.org with OS bullseye completed:

  • doh6002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202303151905_brett_1678110_doh6002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
Wed, Mar 15, 7:41 PM · Patch-For-Review, Traffic, SRE
ops-monitoring-bot added a comment to T326420: Kafka-logging Bullseye Upgrades.

Cookbook cookbooks.sre.hosts.reimage was started by herron@cumin1001 for host kafka-logging1001.eqiad.wmnet with OS bullseye

Wed, Mar 15, 7:32 PM · SRE Observability (FY2022/2023-Q3), Observability-Logging, User-herron
ops-monitoring-bot added a comment to T321309: Upgrade Traffic hosts to bullseye.

Cookbook cookbooks.sre.ganeti.reimage was started by brett@cumin2002 for host doh1001.wikimedia.org with OS bullseye

Wed, Mar 15, 7:17 PM · Patch-For-Review, Traffic, SRE
ops-monitoring-bot added a comment to T321309: Upgrade Traffic hosts to bullseye.

Cookbook cookbooks.sre.ganeti.reimage was started by brett@cumin2002 for host doh2001.wikimedia.org with OS bullseye

Wed, Mar 15, 7:16 PM · Patch-For-Review, Traffic, SRE
ops-monitoring-bot added a comment to T321309: Upgrade Traffic hosts to bullseye.

Cookbook cookbooks.sre.ganeti.reimage started by brett@cumin2002 for host doh5002.wikimedia.org with OS bullseye completed:

  • doh5002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202303151819_brett_1643746_doh5002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
Wed, Mar 15, 7:15 PM · Patch-For-Review, Traffic, SRE
ops-monitoring-bot added a comment to T321309: Upgrade Traffic hosts to bullseye.

Cookbook cookbooks.sre.ganeti.reimage was started by brett@cumin2002 for host doh3001.wikimedia.org with OS bullseye

Wed, Mar 15, 7:14 PM · Patch-For-Review, Traffic, SRE
ops-monitoring-bot added a comment to T321309: Upgrade Traffic hosts to bullseye.

Cookbook cookbooks.sre.ganeti.reimage was started by brett@cumin2002 for host doh6002.wikimedia.org with OS bullseye

Wed, Mar 15, 7:05 PM · Patch-For-Review, Traffic, SRE