Page MenuHomePhabricator

ops-monitoring-bot (Operations Monitoring Bot)
UserBot

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Aug 12 2016, 1:45 PM (405 w, 2 d)
Roles
Bot
Availability
Available
LDAP User
Unknown
MediaWiki User
Unknown

Bot managed by SRE for automated interaction with Phabricator from monitoring tools.

Recent Activity

Yesterday

ops-monitoring-bot added a comment to T353878: Service implementation for elastic2087-2109.

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host elastic2090.codfw.wmnet with OS bullseye completed:

  • elastic2090 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405181836_ryankemper_2728901_elastic2090.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)
Sat, May 18, 6:56 PM · Data-Platform-SRE (2024.03.25 - 2024.04.14)
ops-monitoring-bot added a comment to T353878: Service implementation for elastic2087-2109.

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host elastic2090.codfw.wmnet with OS bullseye

Sat, May 18, 6:16 PM · Data-Platform-SRE (2024.03.25 - 2024.04.14)
ops-monitoring-bot added a comment to T353878: Service implementation for elastic2087-2109.

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host elastic2090.codfw.wmnet with OS bullseye executed with errors:

  • elastic2090 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" elastic2090.codfw.wmnet to get a root shellbut depending on the failure this may not work.
Sat, May 18, 2:39 AM · Data-Platform-SRE (2024.03.25 - 2024.04.14)
ops-monitoring-bot added a comment to T353878: Service implementation for elastic2087-2109.

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host elastic2090.codfw.wmnet with OS bullseye

Sat, May 18, 1:18 AM · Data-Platform-SRE (2024.03.25 - 2024.04.14)
ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye completed:

  • kafka-main2009 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405172346_pt1979_1649208_kafka-main2009.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Sat, May 18, 12:04 AM · SRE, ops-codfw, serviceops, DC-Ops

Fri, May 17

ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye

Fri, May 17, 11:41 PM · SRE, ops-codfw, serviceops, DC-Ops
ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye executed with errors:

  • kafka-main2009 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kafka-main2009.codfw.wmnet to get a root shellbut depending on the failure this may not work.
Fri, May 17, 11:08 PM · SRE, ops-codfw, serviceops, DC-Ops
ops-monitoring-bot added a comment to T363212: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010.

Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host kafka-main1006.eqiad.wmnet with OS bullseye executed with errors:

  • kafka-main1006 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kafka-main1006.eqiad.wmnet to get a root shellbut depending on the failure this may not work.
Fri, May 17, 10:43 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye

Fri, May 17, 10:20 PM · SRE, ops-codfw, serviceops, DC-Ops
ops-monitoring-bot added a comment to T363212: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010.

Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host kafka-main1006.eqiad.wmnet with OS bullseye

Fri, May 17, 9:58 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T363212: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010.

Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host kafka-main1006.eqiad.wmnet with OS bullseye executed with errors:

  • kafka-main1006 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kafka-main1006.eqiad.wmnet to get a root shellbut depending on the failure this may not work.
Fri, May 17, 9:57 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T363212: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010.

Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host kafka-main1006.eqiad.wmnet with OS bullseye

Fri, May 17, 9:47 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye executed with errors:

  • kafka-main2009 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kafka-main2009.codfw.wmnet to get a root shellbut depending on the failure this may not work.
Fri, May 17, 9:10 PM · SRE, ops-codfw, serviceops, DC-Ops
ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye

Fri, May 17, 7:43 PM · SRE, ops-codfw, serviceops, DC-Ops
ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye executed with errors:

  • kafka-main2009 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kafka-main2009.codfw.wmnet to get a root shellbut depending on the failure this may not work.
Fri, May 17, 7:42 PM · SRE, ops-codfw, serviceops, DC-Ops
ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye

Fri, May 17, 7:21 PM · SRE, ops-codfw, serviceops, DC-Ops
ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye executed with errors:

  • kafka-main2009 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kafka-main2009.codfw.wmnet to get a root shellbut depending on the failure this may not work.
Fri, May 17, 4:17 PM · SRE, ops-codfw, serviceops, DC-Ops
ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye

Fri, May 17, 3:21 PM · SRE, ops-codfw, serviceops, DC-Ops
ops-monitoring-bot added a comment to T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes.

cookbooks.sre.hosts.decommission executed by jayme@cumin1002 for hosts: kubestagetcd[1004-1006].eqiad.wmnet

  • kubestagetcd1004.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
Fri, May 17, 12:56 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
ops-monitoring-bot added a comment to T331699: Migrate the r/w LDAP servers to Bookworm and MDB storage.

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: ldap-replica1006.wikimedia.org

  • ldap-replica1006.wikimedia.org (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
Fri, May 17, 12:46 PM · Patch-For-Review, LDAP, Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T331699: Migrate the r/w LDAP servers to Bookworm and MDB storage.

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: ldap-replica1005.wikimedia.org

  • ldap-replica1005.wikimedia.org (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
Fri, May 17, 12:27 PM · Patch-For-Review, LDAP, Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes.

Icinga downtime and Alertmanager silence (ID=dd087345-70da-428c-8704-76433fe47872) set by jayme@cumin1002 for 2 days, 0:00:00 on 3 host(s) and their services with reason: decom

kubestagetcd[1004-1006].eqiad.wmnet
Fri, May 17, 12:24 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
ops-monitoring-bot added a comment to T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes.

cookbooks.sre.hosts.decommission executed by jayme@cumin1002 for hosts: kubestagemaster[1001-1002].eqiad.wmnet

  • kubestagemaster1001.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
Fri, May 17, 12:12 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
ops-monitoring-bot added a comment to T331699: Migrate the r/w LDAP servers to Bookworm and MDB storage.

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: ldap-replica2008.wikimedia.org

  • ldap-replica2008.wikimedia.org (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
Fri, May 17, 12:07 PM · Patch-For-Review, LDAP, Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes.

Icinga downtime and Alertmanager silence (ID=d858a874-17ca-4ab5-8c9c-7fea35f1c823) set by jayme@cumin1002 for 2 days, 0:00:00 on 2 host(s) and their services with reason: decom

kubestagemaster[1001-1002].eqiad.wmnet
Fri, May 17, 11:51 AM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
ops-monitoring-bot added a comment to T331699: Migrate the r/w LDAP servers to Bookworm and MDB storage.

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: ldap-replica2007.wikimedia.org

  • ldap-replica2007.wikimedia.org (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
Fri, May 17, 11:48 AM · Patch-For-Review, LDAP, Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T325228: Migrate Dumps Snapshot hosts from Buster to Bullseye.

Host rebooted by btullis@cumin1002 with reason: Rebooting to pick up new kernel

Fri, May 17, 9:17 AM · Data-Platform-SRE (2024.05.06 - 2024.05.26), SRE, Data-Engineering, Dumps-Generation
ops-monitoring-bot added a comment to T325228: Migrate Dumps Snapshot hosts from Buster to Bullseye.

Host rebooted by btullis@cumin1002 with reason: Rebooting to pick up new kernel

Fri, May 17, 9:01 AM · Data-Platform-SRE (2024.05.06 - 2024.05.26), SRE, Data-Engineering, Dumps-Generation
ops-monitoring-bot created T365217: Degraded RAID on backup2010.
Fri, May 17, 4:31 AM · SRE, ops-codfw
ops-monitoring-bot created T365213: Degraded RAID on es2022.
Fri, May 17, 12:54 AM · DBA, SRE, ops-codfw

Thu, May 16

ops-monitoring-bot added a comment to T334517: upgrade contint servers to bullseye.

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1002 for host contint2002.wikimedia.org with OS bullseye completed:

  • contint2002 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405162014_dzahn_464740_contint2002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)
Thu, May 16, 8:33 PM · Release-Engineering-Team (Radar), collaboration-services
ops-monitoring-bot added a comment to T334517: upgrade contint servers to bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1002 for host contint2002.wikimedia.org with OS bullseye

Thu, May 16, 7:55 PM · Release-Engineering-Team (Radar), collaboration-services
ops-monitoring-bot added a comment to T334517: upgrade contint servers to bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host contint2002.wikimedia.org with OS buster

Thu, May 16, 6:58 PM · Release-Engineering-Team (Radar), collaboration-services
ops-monitoring-bot added a comment to T355353: Q3:rack/setup/install dbprov100[56].

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host kafka-main1006.eqiad.wmnet with OS bullseye executed with errors:

  • kafka-main1006 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kafka-main1006.eqiad.wmnet to get a root shellbut depending on the failure this may not work.
Thu, May 16, 6:32 PM · Patch-For-Review, SRE, Data-Persistence, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T334517: upgrade contint servers to bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host contint2002.wikimedia.org with OS bullseye

Thu, May 16, 6:17 PM · Release-Engineering-Team (Radar), collaboration-services
ops-monitoring-bot added a comment to T334517: upgrade contint servers to bullseye.

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host contint2002.wikimedia.org with OS buster executed with errors:

  • contint2002 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" contint2002.wikimedia.org to get a root shellbut depending on the failure this may not work.
Thu, May 16, 6:04 PM · Release-Engineering-Team (Radar), collaboration-services
ops-monitoring-bot added a comment to T355353: Q3:rack/setup/install dbprov100[56].

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host kafka-main1006.eqiad.wmnet with OS bullseye

Thu, May 16, 5:46 PM · Patch-For-Review, SRE, Data-Persistence, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T334517: upgrade contint servers to bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host contint2002.wikimedia.org with OS buster

Thu, May 16, 4:59 PM · Release-Engineering-Team (Radar), collaboration-services
ops-monitoring-bot added a comment to T334517: upgrade contint servers to bullseye.

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host contint2002.wikimedia.org with OS bullseye executed with errors:

  • contint2002 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" contint2002.wikimedia.org to get a root shellbut depending on the failure this may not work.
Thu, May 16, 4:57 PM · Release-Engineering-Team (Radar), collaboration-services
ops-monitoring-bot added a comment to T334517: upgrade contint servers to bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host contint2002.wikimedia.org with OS bullseye

Thu, May 16, 4:42 PM · Release-Engineering-Team (Radar), collaboration-services
ops-monitoring-bot added a comment to T334517: upgrade contint servers to bullseye.

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host contint2002.wikimedia.org with OS bullseye executed with errors:

  • contint2002 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" contint2002.wikimedia.org to get a root shellbut depending on the failure this may not work.
Thu, May 16, 4:40 PM · Release-Engineering-Team (Radar), collaboration-services
ops-monitoring-bot added a comment to T334517: upgrade contint servers to bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host contint2002.wikimedia.org with OS bullseye

Thu, May 16, 3:25 PM · Release-Engineering-Team (Radar), collaboration-services
ops-monitoring-bot added a comment to T334517: upgrade contint servers to bullseye.

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host contint2002.wikimedia.org with OS bullseye executed with errors:

  • contint2002 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" contint2002.wikimedia.org to get a root shellbut depending on the failure this may not work.
Thu, May 16, 3:24 PM · Release-Engineering-Team (Radar), collaboration-services
ops-monitoring-bot added a comment to T334517: upgrade contint servers to bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host contint2002.wikimedia.org with OS bullseye

Thu, May 16, 3:03 PM · Release-Engineering-Team (Radar), collaboration-services
ops-monitoring-bot added a comment to T364290: Upgrade s1 to MariaDB 10.6.

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db2174.codfw.wmnet with OS bookworm completed:

  • db2174 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405161428_arnaudb_418176_db2174.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
Thu, May 16, 2:47 PM · DBA
ops-monitoring-bot added a comment to T364290: Upgrade s1 to MariaDB 10.6.

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db2174.codfw.wmnet with OS bookworm

Thu, May 16, 2:08 PM · DBA
ops-monitoring-bot added a comment to T364290: Upgrade s1 to MariaDB 10.6.

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db2176.codfw.wmnet with OS bookworm completed:

  • db2176 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405161337_arnaudb_407694_db2176.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)
Thu, May 16, 1:59 PM · DBA
ops-monitoring-bot added a comment to T364289: Reimage external store hosts with Bookworm.

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1002 for host es1024.eqiad.wmnet with OS bookworm completed:

  • es1024 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405161331_marostegui_407297_es1024.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Thu, May 16, 1:48 PM · DBA
ops-monitoring-bot added a comment to T364290: Upgrade s1 to MariaDB 10.6.

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db2176.codfw.wmnet with OS bookworm

Thu, May 16, 1:17 PM · DBA
ops-monitoring-bot added a comment to T364289: Reimage external store hosts with Bookworm.

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1002 for host es1024.eqiad.wmnet with OS bookworm

Thu, May 16, 1:12 PM · DBA
ops-monitoring-bot added a comment to T364435: Drop gu_salt from globaluser.

Cookbook cookbooks.sre.wikireplicas.update-views started by fnegri completed:

  • clouddb1021.eqiad.wmnet (PASS)
    • Ran Puppet agent
    • Ran 'maintain-views --all-databases --replace-all --auto-depool --table globaluser'
Thu, May 16, 10:48 AM · MW-1.43-notes (1.43.0-wmf.6; 2024-05-21), MediaWiki-Platform-Team (Radar), Patch-For-Review, MediaWiki-extensions-CentralAuth
ops-monitoring-bot added a comment to T364435: Drop gu_salt from globaluser.

Cookbook cookbooks.sre.wikireplicas.update-views run by fnegri: Started updating wiki replica views

Thu, May 16, 10:38 AM · MW-1.43-notes (1.43.0-wmf.6; 2024-05-21), MediaWiki-Platform-Team (Radar), Patch-For-Review, MediaWiki-extensions-CentralAuth
ops-monitoring-bot added a comment to T364289: Reimage external store hosts with Bookworm.

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1002 for host es1021.eqiad.wmnet with OS bookworm completed:

  • es1021 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405160808_marostegui_361726_es1021.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Thu, May 16, 8:26 AM · DBA
ops-monitoring-bot added a comment to T364289: Reimage external store hosts with Bookworm.

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1002 for host es1021.eqiad.wmnet with OS bookworm

Thu, May 16, 7:51 AM · DBA
ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye executed with errors:

  • kafka-main2009 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kafka-main2009.codfw.wmnet to get a root shellbut depending on the failure this may not work.
Thu, May 16, 1:12 AM · SRE, ops-codfw, serviceops, DC-Ops
ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye

Thu, May 16, 12:28 AM · SRE, ops-codfw, serviceops, DC-Ops

Wed, May 15

ops-monitoring-bot added a comment to T363212: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010.

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host kafka-main1007.eqiad.wmnet with OS bullseye executed with errors:

  • kafka-main1007 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kafka-main1007.eqiad.wmnet to get a root shellbut depending on the failure this may not work.
Wed, May 15, 6:03 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T363212: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host kafka-main1007.eqiad.wmnet with OS bullseye

Wed, May 15, 5:17 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye executed with errors:

  • kafka-main2009 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kafka-main2009.codfw.wmnet to get a root shellbut depending on the failure this may not work.
Wed, May 15, 3:05 PM · SRE, ops-codfw, serviceops, DC-Ops
ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kafka-main2010.codfw.wmnet with OS bullseye completed:

  • kafka-main2010 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405151426_jhancock_2468124_kafka-main2010.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Wed, May 15, 2:44 PM · SRE, ops-codfw, serviceops, DC-Ops
ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kafka-main2008.codfw.wmnet with OS bullseye completed:

  • kafka-main2008 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405151424_jhancock_2468073_kafka-main2008.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Wed, May 15, 2:41 PM · SRE, ops-codfw, serviceops, DC-Ops
ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kafka-main2007.codfw.wmnet with OS bullseye completed:

  • kafka-main2007 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405151422_jhancock_2467812_kafka-main2007.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)
Wed, May 15, 2:39 PM · SRE, ops-codfw, serviceops, DC-Ops
ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2010.codfw.wmnet with OS bullseye

Wed, May 15, 2:02 PM · SRE, ops-codfw, serviceops, DC-Ops
ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye

Wed, May 15, 2:02 PM · SRE, ops-codfw, serviceops, DC-Ops
ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2008.codfw.wmnet with OS bullseye

Wed, May 15, 2:02 PM · SRE, ops-codfw, serviceops, DC-Ops
ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2007.codfw.wmnet with OS bullseye

Wed, May 15, 2:02 PM · SRE, ops-codfw, serviceops, DC-Ops
ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kafka-main2006.codfw.wmnet with OS bullseye completed:

  • kafka-main2006 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405151325_jhancock_2409379_kafka-main2006.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)
Wed, May 15, 1:49 PM · SRE, ops-codfw, serviceops, DC-Ops
ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2006.codfw.wmnet with OS bullseye

Wed, May 15, 1:04 PM · SRE, ops-codfw, serviceops, DC-Ops
ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kafka-main2006.codfw.wmnet with OS bullseye executed with errors:

  • kafka-main2006 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kafka-main2006.codfw.wmnet to get a root shellbut depending on the failure this may not work.
Wed, May 15, 1:02 PM · SRE, ops-codfw, serviceops, DC-Ops
ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2006.codfw.wmnet with OS bullseye

Wed, May 15, 1:01 PM · SRE, ops-codfw, serviceops, DC-Ops
ops-monitoring-bot added a comment to T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes.

cookbooks.sre.hosts.decommission executed by jayme@cumin1002 for hosts: kubestagetcd[2001-2003].codfw.wmnet

  • kubestagetcd2001.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
Wed, May 15, 12:58 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
ops-monitoring-bot added a comment to T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes.

Icinga downtime and Alertmanager silence (ID=5c048aeb-57ce-4f8d-8159-53dcf8b5fb78) set by jayme@cumin1002 for 2 days, 0:00:00 on 3 host(s) and their services with reason: decom

kubestagetcd[2001-2003].codfw.wmnet
Wed, May 15, 12:23 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
ops-monitoring-bot added a comment to T319184: Move WMCS servers to 1 single NIC.

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudvirt1041.eqiad.wmnet with OS bookworm executed with errors:

  • cloudvirt1041 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cloudvirt1041.eqiad.wmnet to get a root shellbut depending on the failure this may not work.
Wed, May 15, 11:52 AM · Patch-For-Review, User-aborrero, cloud-services-team, SRE, netops, Infrastructure-Foundations
ops-monitoring-bot added a comment to T319184: Move WMCS servers to 1 single NIC.

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1002 for host cloudvirt1041.eqiad.wmnet with OS bookworm

Wed, May 15, 11:11 AM · Patch-For-Review, User-aborrero, cloud-services-team, SRE, netops, Infrastructure-Foundations
ops-monitoring-bot added a comment to T319184: Move WMCS servers to 1 single NIC.

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudvirt1041.eqiad.wmnet with OS bookworm executed with errors:

  • cloudvirt1041 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cloudvirt1041.eqiad.wmnet to get a root shellbut depending on the failure this may not work.
Wed, May 15, 11:05 AM · Patch-For-Review, User-aborrero, cloud-services-team, SRE, netops, Infrastructure-Foundations
ops-monitoring-bot added a comment to T319184: Move WMCS servers to 1 single NIC.

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1002 for host cloudvirt1041.eqiad.wmnet with OS bookworm

Wed, May 15, 10:52 AM · Patch-For-Review, User-aborrero, cloud-services-team, SRE, netops, Infrastructure-Foundations
ops-monitoring-bot added a comment to T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes.

cookbooks.sre.hosts.decommission executed by jayme@cumin1002 for hosts: kubestagemaster[2001-2002].codfw.wmnet

  • kubestagemaster2001.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
Wed, May 15, 9:53 AM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
ops-monitoring-bot added a comment to T364823: Upgrade r/w LDAP servers to Bullseye.

Icinga downtime and Alertmanager silence (ID=be009031-0cc0-4a4d-97a0-f4d990831efe) set by jmm@cumin2002 for 1:00:00 on 1 host(s) and their services with reason: OS update

seaborgium.wikimedia.org
Wed, May 15, 9:01 AM · LDAP, SRE, Infrastructure-Foundations

Tue, May 14

ops-monitoring-bot added a comment to T355353: Q3:rack/setup/install dbprov100[56].

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-main1006.eqiad.wmnet with OS bullseye

Tue, May 14, 7:26 PM · Patch-For-Review, SRE, Data-Persistence, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T364480: Extend BGP peer automation via Netbox to include VMs.

Deployed homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: Release v0.6.5 update to add modified wmf homer plugin - cmooney@cumin1002 - T364480

Tue, May 14, 4:08 PM · netops, Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T364850: Deploy Phabricator/Phorge 2024-05-14.

Icinga downtime and Alertmanager silence (ID=6e2580b0-999e-4a68-87e7-c37d374c663f) set by aokoth@cumin1002 for 0:30:00 on 1 host(s) and their services with reason: Phorge update

phab1004.eqiad.wmnet
Tue, May 14, 3:04 PM · collaboration-services, User-brennen, Release-Engineering-Team (Yakisfaction), Phabricator (2024-05-14)
ops-monitoring-bot added a comment to T364746: Site: eqiad 3 VM request for staging-eqiad kube-apiserver.

VM kubestagemaster1005.eqiad.wmnet switching disk type to plain

Tue, May 14, 9:49 AM · SRE, Infrastructure-Foundations, vm-requests, serviceops, Prod-Kubernetes, Kubernetes
ops-monitoring-bot added a comment to T364746: Site: eqiad 3 VM request for staging-eqiad kube-apiserver.

VM kubestagemaster1004.eqiad.wmnet switching disk type to plain

Tue, May 14, 9:48 AM · SRE, Infrastructure-Foundations, vm-requests, serviceops, Prod-Kubernetes, Kubernetes
ops-monitoring-bot added a comment to T364746: Site: eqiad 3 VM request for staging-eqiad kube-apiserver.

VM kubestagemaster1003.eqiad.wmnet switching disk type to plain

Tue, May 14, 9:48 AM · SRE, Infrastructure-Foundations, vm-requests, serviceops, Prod-Kubernetes, Kubernetes
ops-monitoring-bot added a comment to T364746: Site: eqiad 3 VM request for staging-eqiad kube-apiserver.

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster1005.eqiad.wmnet with OS bullseye completed:

  • kubestagemaster1005 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405140931_jayme_4192981_kubestagemaster1005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Tue, May 14, 9:45 AM · SRE, Infrastructure-Foundations, vm-requests, serviceops, Prod-Kubernetes, Kubernetes
ops-monitoring-bot added a comment to T364746: Site: eqiad 3 VM request for staging-eqiad kube-apiserver.

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster1004.eqiad.wmnet with OS bullseye completed:

  • kubestagemaster1004 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405140906_jayme_4192644_kubestagemaster1004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Tue, May 14, 9:20 AM · SRE, Infrastructure-Foundations, vm-requests, serviceops, Prod-Kubernetes, Kubernetes
ops-monitoring-bot added a comment to T364746: Site: eqiad 3 VM request for staging-eqiad kube-apiserver.

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster1003.eqiad.wmnet with OS bullseye completed:

  • kubestagemaster1003 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405140904_jayme_4192355_kubestagemaster1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Tue, May 14, 9:18 AM · SRE, Infrastructure-Foundations, vm-requests, serviceops, Prod-Kubernetes, Kubernetes
ops-monitoring-bot added a comment to T364746: Site: eqiad 3 VM request for staging-eqiad kube-apiserver.

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemaster1005.eqiad.wmnet with OS bullseye

Tue, May 14, 9:14 AM · SRE, Infrastructure-Foundations, vm-requests, serviceops, Prod-Kubernetes, Kubernetes
ops-monitoring-bot added a comment to T364823: Upgrade r/w LDAP servers to Bullseye.

Icinga downtime and Alertmanager silence (ID=34ac3b76-436c-436c-afc2-20387cde43fb) set by jmm@cumin2002 for 1:00:00 on 1 host(s) and their services with reason: OS update

serpens.wikimedia.org
Tue, May 14, 8:58 AM · LDAP, SRE, Infrastructure-Foundations
ops-monitoring-bot added a comment to T364746: Site: eqiad 3 VM request for staging-eqiad kube-apiserver.

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemaster1004.eqiad.wmnet with OS bullseye

Tue, May 14, 8:52 AM · SRE, Infrastructure-Foundations, vm-requests, serviceops, Prod-Kubernetes, Kubernetes
ops-monitoring-bot added a comment to T364746: Site: eqiad 3 VM request for staging-eqiad kube-apiserver.

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemaster1003.eqiad.wmnet with OS bullseye

Tue, May 14, 8:49 AM · SRE, Infrastructure-Foundations, vm-requests, serviceops, Prod-Kubernetes, Kubernetes
ops-monitoring-bot added a comment to T364740: Site: codfw 2 VM request for staging-codfw kube-apiserver.

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster2005.codfw.wmnet with OS bullseye executed with errors:

  • kubestagemaster2005 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405131705_jayme_4072335_kubestagemaster2005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kubestagemaster2005.codfw.wmnet to get a root shellbut depending on the failure this may not work.
Tue, May 14, 8:15 AM · SRE, Infrastructure-Foundations, vm-requests, Prod-Kubernetes, Kubernetes
ops-monitoring-bot added a comment to T364296: Reimage db1215 and db2185 (zarcillo) to bookworm.

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1002 for host db2185.codfw.wmnet with OS bookworm completed:

  • db2185 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405140656_marostegui_4175353_db2185.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)
Tue, May 14, 7:17 AM · DBA
ops-monitoring-bot added a comment to T364296: Reimage db1215 and db2185 (zarcillo) to bookworm.

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1002 for host db2185.codfw.wmnet with OS bookworm

Tue, May 14, 6:35 AM · DBA
ops-monitoring-bot added a comment to T364296: Reimage db1215 and db2185 (zarcillo) to bookworm.

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1002 for host db2185.codfw.wmnet with OS bookworm executed with errors:

  • db2185 (FAIL)
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" db2185.codfw.wmnet to get a root shellbut depending on the failure this may not work.
Tue, May 14, 6:34 AM · DBA
ops-monitoring-bot added a comment to T364296: Reimage db1215 and db2185 (zarcillo) to bookworm.

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1002 for host db2185.codfw.wmnet with OS bookworm

Tue, May 14, 6:34 AM · DBA

Mon, May 13

ops-monitoring-bot added a comment to T364740: Site: codfw 2 VM request for staging-codfw kube-apiserver.

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemaster2005.codfw.wmnet with OS bullseye

Mon, May 13, 4:50 PM · SRE, Infrastructure-Foundations, vm-requests, Prod-Kubernetes, Kubernetes
ops-monitoring-bot added a comment to T364740: Site: codfw 2 VM request for staging-codfw kube-apiserver.

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster2005.codfw.wmnet with OS bullseye executed with errors:

  • kubestagemaster2005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405131442_jayme_4048520_kubestagemaster2005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kubestagemaster2005.codfw.wmnet to get a root shellbut depending on the failure this may not work.
Mon, May 13, 4:46 PM · SRE, Infrastructure-Foundations, vm-requests, Prod-Kubernetes, Kubernetes
ops-monitoring-bot added a comment to T363310: Site: codfw 1 VM request for staging-codfw kube-apiserver.

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster2004.codfw.wmnet with OS bullseye completed:

  • kubestagemaster2004 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405131435_jayme_4048471_kubestagemaster2004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Mon, May 13, 2:49 PM · Patch-For-Review, vm-requests, Infrastructure-Foundations, SRE, serviceops, Prod-Kubernetes, Kubernetes
ops-monitoring-bot added a comment to T364740: Site: codfw 2 VM request for staging-codfw kube-apiserver.

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemaster2005.codfw.wmnet with OS bullseye

Mon, May 13, 2:25 PM · SRE, Infrastructure-Foundations, vm-requests, Prod-Kubernetes, Kubernetes