ops-monitoring-bot (Operations Monitoring Bot)
UserBot

Projects

Trusted-Contributors
Group

Calendar

User Details

User Since: Aug 12 2016, 1:45 PM (405 w, 2 d)
Roles: Bot
Availability: Available
LDAP User: Unknown
MediaWiki User: Unknown

Bot managed by SRE for automated interaction with Phabricator from monitoring tools.

Recent Activity
View All

Yesterday

ops-monitoring-bot added a comment to T353878: Service implementation for elastic2087-2109.

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host elastic2090.codfw.wmnet with OS bullseye completed:

elastic2090 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405181836_ryankemper_2728901_elastic2090.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Sat, May 18, 6:56 PM · Data-Platform-SRE (2024.03.25 - 2024.04.14)

ops-monitoring-bot added a comment to T353878: Service implementation for elastic2087-2109.

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host elastic2090.codfw.wmnet with OS bullseye

Sat, May 18, 6:16 PM · Data-Platform-SRE (2024.03.25 - 2024.04.14)

ops-monitoring-bot added a comment to T353878: Service implementation for elastic2087-2109.

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host elastic2090.codfw.wmnet with OS bullseye executed with errors:

elastic2090 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" elastic2090.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Sat, May 18, 2:39 AM · Data-Platform-SRE (2024.03.25 - 2024.04.14)

ops-monitoring-bot added a comment to T353878: Service implementation for elastic2087-2109.

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host elastic2090.codfw.wmnet with OS bullseye

Sat, May 18, 1:18 AM · Data-Platform-SRE (2024.03.25 - 2024.04.14)

ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye completed:

kafka-main2009 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405172346_pt1979_1649208_kafka-main2009.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> active
- The sre.puppet.sync-netbox-hiera cookbook was run successfully

Sat, May 18, 12:04 AM · SRE, ops-codfw, serviceops, DC-Ops

Fri, May 17

ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye

Fri, May 17, 11:41 PM · SRE, ops-codfw, serviceops, DC-Ops

ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye executed with errors:

kafka-main2009 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kafka-main2009.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Fri, May 17, 11:08 PM · SRE, ops-codfw, serviceops, DC-Ops

ops-monitoring-bot added a comment to T363212: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010.

Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host kafka-main1006.eqiad.wmnet with OS bullseye executed with errors:

kafka-main1006 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kafka-main1006.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Fri, May 17, 10:43 PM · SRE, serviceops, ops-eqiad, DC-Ops

ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye

Fri, May 17, 10:20 PM · SRE, ops-codfw, serviceops, DC-Ops

ops-monitoring-bot added a comment to T363212: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010.

Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host kafka-main1006.eqiad.wmnet with OS bullseye

Fri, May 17, 9:58 PM · SRE, serviceops, ops-eqiad, DC-Ops

ops-monitoring-bot added a comment to T363212: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010.

Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host kafka-main1006.eqiad.wmnet with OS bullseye executed with errors:

kafka-main1006 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kafka-main1006.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Fri, May 17, 9:57 PM · SRE, serviceops, ops-eqiad, DC-Ops

ops-monitoring-bot added a comment to T363212: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010.

Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host kafka-main1006.eqiad.wmnet with OS bullseye

Fri, May 17, 9:47 PM · SRE, serviceops, ops-eqiad, DC-Ops

ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye executed with errors:

kafka-main2009 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kafka-main2009.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Fri, May 17, 9:10 PM · SRE, ops-codfw, serviceops, DC-Ops

ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye

Fri, May 17, 7:43 PM · SRE, ops-codfw, serviceops, DC-Ops

ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye executed with errors:

kafka-main2009 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kafka-main2009.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Fri, May 17, 7:42 PM · SRE, ops-codfw, serviceops, DC-Ops

ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye

Fri, May 17, 7:21 PM · SRE, ops-codfw, serviceops, DC-Ops

ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye executed with errors:

kafka-main2009 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kafka-main2009.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Fri, May 17, 4:17 PM · SRE, ops-codfw, serviceops, DC-Ops

ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye

Fri, May 17, 3:21 PM · SRE, ops-codfw, serviceops, DC-Ops

ops-monitoring-bot added a comment to T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes.

cookbooks.sre.hosts.decommission executed by jayme@cumin1002 for hosts: kubestagetcd[1004-1006].eqiad.wmnet

kubestagetcd1004.eqiad.wmnet (PASS)
- Downtimed host on Icinga/Alertmanager
- Found Ganeti VM
- VM shutdown
- Started forced sync of VMs in Ganeti cluster eqiad to Netbox
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
- VM removed
- Started forced sync of VMs in Ganeti cluster eqiad to Netbox

Fri, May 17, 12:56 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes

ops-monitoring-bot added a comment to T331699: Migrate the r/w LDAP servers to Bookworm and MDB storage.

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: ldap-replica1006.wikimedia.org

ldap-replica1006.wikimedia.org (PASS)
- Downtimed host on Icinga/Alertmanager
- Found Ganeti VM
- VM shutdown
- Started forced sync of VMs in Ganeti cluster eqiad to Netbox
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
- VM removed
- Started forced sync of VMs in Ganeti cluster eqiad to Netbox

Fri, May 17, 12:46 PM · Patch-For-Review, LDAP, Infrastructure-Foundations, SRE

ops-monitoring-bot added a comment to T331699: Migrate the r/w LDAP servers to Bookworm and MDB storage.

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: ldap-replica1005.wikimedia.org

ldap-replica1005.wikimedia.org (PASS)
- Downtimed host on Icinga/Alertmanager
- Found Ganeti VM
- VM shutdown
- Started forced sync of VMs in Ganeti cluster eqiad to Netbox
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
- VM removed
- Started forced sync of VMs in Ganeti cluster eqiad to Netbox

Fri, May 17, 12:27 PM · Patch-For-Review, LDAP, Infrastructure-Foundations, SRE

ops-monitoring-bot added a comment to T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes.

Icinga downtime and Alertmanager silence (ID=dd087345-70da-428c-8704-76433fe47872) set by jayme@cumin1002 for 2 days, 0:00:00 on 3 host(s) and their services with reason: decom

kubestagetcd[1004-1006].eqiad.wmnet

Fri, May 17, 12:24 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes

ops-monitoring-bot added a comment to T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes.

cookbooks.sre.hosts.decommission executed by jayme@cumin1002 for hosts: kubestagemaster[1001-1002].eqiad.wmnet

kubestagemaster1001.eqiad.wmnet (PASS)
- Downtimed host on Icinga/Alertmanager
- Found Ganeti VM
- VM shutdown
- Started forced sync of VMs in Ganeti cluster eqiad to Netbox
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
- VM removed
- Started forced sync of VMs in Ganeti cluster eqiad to Netbox

Fri, May 17, 12:12 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes

ops-monitoring-bot added a comment to T331699: Migrate the r/w LDAP servers to Bookworm and MDB storage.

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: ldap-replica2008.wikimedia.org

ldap-replica2008.wikimedia.org (PASS)
- Downtimed host on Icinga/Alertmanager
- Found Ganeti VM
- VM shutdown
- Started forced sync of VMs in Ganeti cluster codfw to Netbox
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
- VM removed
- Started forced sync of VMs in Ganeti cluster codfw to Netbox

Fri, May 17, 12:07 PM · Patch-For-Review, LDAP, Infrastructure-Foundations, SRE

ops-monitoring-bot added a comment to T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes.

Icinga downtime and Alertmanager silence (ID=d858a874-17ca-4ab5-8c9c-7fea35f1c823) set by jayme@cumin1002 for 2 days, 0:00:00 on 2 host(s) and their services with reason: decom

kubestagemaster[1001-1002].eqiad.wmnet

Fri, May 17, 11:51 AM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes

ops-monitoring-bot added a comment to T331699: Migrate the r/w LDAP servers to Bookworm and MDB storage.

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: ldap-replica2007.wikimedia.org

ldap-replica2007.wikimedia.org (PASS)
- Downtimed host on Icinga/Alertmanager
- Found Ganeti VM
- VM shutdown
- Started forced sync of VMs in Ganeti cluster codfw to Netbox
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
- VM removed
- Started forced sync of VMs in Ganeti cluster codfw to Netbox

Fri, May 17, 11:48 AM · Patch-For-Review, LDAP, Infrastructure-Foundations, SRE

ops-monitoring-bot added a comment to T325228: Migrate Dumps Snapshot hosts from Buster to Bullseye.

Host rebooted by btullis@cumin1002 with reason: Rebooting to pick up new kernel

Fri, May 17, 9:17 AM · Data-Platform-SRE (2024.05.06 - 2024.05.26), SRE, Data-Engineering, Dumps-Generation

ops-monitoring-bot added a comment to T325228: Migrate Dumps Snapshot hosts from Buster to Bullseye.

Host rebooted by btullis@cumin1002 with reason: Rebooting to pick up new kernel

Fri, May 17, 9:01 AM · Data-Platform-SRE (2024.05.06 - 2024.05.26), SRE, Data-Engineering, Dumps-Generation

ops-monitoring-bot created T365217: Degraded RAID on backup2010.

Fri, May 17, 4:31 AM · SRE, ops-codfw

ops-monitoring-bot created T365213: Degraded RAID on es2022.

Fri, May 17, 12:54 AM · DBA, SRE, ops-codfw

Thu, May 16

ops-monitoring-bot added a comment to T334517: upgrade contint servers to bullseye.

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1002 for host contint2002.wikimedia.org with OS bullseye completed:

contint2002 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405162014_dzahn_464740_contint2002.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Thu, May 16, 8:33 PM · Release-Engineering-Team (Radar), collaboration-services

ops-monitoring-bot added a comment to T334517: upgrade contint servers to bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1002 for host contint2002.wikimedia.org with OS bullseye

Thu, May 16, 7:55 PM · Release-Engineering-Team (Radar), collaboration-services

ops-monitoring-bot added a comment to T334517: upgrade contint servers to bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host contint2002.wikimedia.org with OS buster

Thu, May 16, 6:58 PM · Release-Engineering-Team (Radar), collaboration-services

ops-monitoring-bot added a comment to T355353: Q3:rack/setup/install dbprov100[56].

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host kafka-main1006.eqiad.wmnet with OS bullseye executed with errors:

kafka-main1006 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kafka-main1006.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Thu, May 16, 6:32 PM · Patch-For-Review, SRE, Data-Persistence, ops-eqiad, DC-Ops

ops-monitoring-bot added a comment to T334517: upgrade contint servers to bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host contint2002.wikimedia.org with OS bullseye

Thu, May 16, 6:17 PM · Release-Engineering-Team (Radar), collaboration-services

ops-monitoring-bot added a comment to T334517: upgrade contint servers to bullseye.

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host contint2002.wikimedia.org with OS buster executed with errors:

contint2002 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" contint2002.wikimedia.org to get a root shellbut depending on the failure this may not work.

Thu, May 16, 6:04 PM · Release-Engineering-Team (Radar), collaboration-services

ops-monitoring-bot added a comment to T355353: Q3:rack/setup/install dbprov100[56].

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host kafka-main1006.eqiad.wmnet with OS bullseye

Thu, May 16, 5:46 PM · Patch-For-Review, SRE, Data-Persistence, ops-eqiad, DC-Ops

ops-monitoring-bot added a comment to T334517: upgrade contint servers to bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host contint2002.wikimedia.org with OS buster

Thu, May 16, 4:59 PM · Release-Engineering-Team (Radar), collaboration-services

ops-monitoring-bot added a comment to T334517: upgrade contint servers to bullseye.

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host contint2002.wikimedia.org with OS bullseye executed with errors:

contint2002 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" contint2002.wikimedia.org to get a root shellbut depending on the failure this may not work.

Thu, May 16, 4:57 PM · Release-Engineering-Team (Radar), collaboration-services

ops-monitoring-bot added a comment to T334517: upgrade contint servers to bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host contint2002.wikimedia.org with OS bullseye

Thu, May 16, 4:42 PM · Release-Engineering-Team (Radar), collaboration-services

ops-monitoring-bot added a comment to T334517: upgrade contint servers to bullseye.

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host contint2002.wikimedia.org with OS bullseye executed with errors:

contint2002 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" contint2002.wikimedia.org to get a root shellbut depending on the failure this may not work.

Thu, May 16, 4:40 PM · Release-Engineering-Team (Radar), collaboration-services

ops-monitoring-bot added a comment to T334517: upgrade contint servers to bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host contint2002.wikimedia.org with OS bullseye

Thu, May 16, 3:25 PM · Release-Engineering-Team (Radar), collaboration-services

ops-monitoring-bot added a comment to T334517: upgrade contint servers to bullseye.

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host contint2002.wikimedia.org with OS bullseye executed with errors:

contint2002 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" contint2002.wikimedia.org to get a root shellbut depending on the failure this may not work.

Thu, May 16, 3:24 PM · Release-Engineering-Team (Radar), collaboration-services

ops-monitoring-bot added a comment to T334517: upgrade contint servers to bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host contint2002.wikimedia.org with OS bullseye

Thu, May 16, 3:03 PM · Release-Engineering-Team (Radar), collaboration-services

ops-monitoring-bot added a comment to T364290: Upgrade s1 to MariaDB 10.6.

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db2174.codfw.wmnet with OS bookworm completed:

db2174 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405161428_arnaudb_418176_db2174.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Thu, May 16, 2:47 PM · DBA

ops-monitoring-bot added a comment to T364290: Upgrade s1 to MariaDB 10.6.

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db2174.codfw.wmnet with OS bookworm

Thu, May 16, 2:08 PM · DBA

ops-monitoring-bot added a comment to T364290: Upgrade s1 to MariaDB 10.6.

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db2176.codfw.wmnet with OS bookworm completed:

db2176 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405161337_arnaudb_407694_db2176.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB
- Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Thu, May 16, 1:59 PM · DBA

ops-monitoring-bot added a comment to T364289: Reimage external store hosts with Bookworm.

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1002 for host es1024.eqiad.wmnet with OS bookworm completed:

es1024 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405161331_marostegui_407297_es1024.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Thu, May 16, 1:48 PM · DBA

ops-monitoring-bot added a comment to T364290: Upgrade s1 to MariaDB 10.6.

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db2176.codfw.wmnet with OS bookworm

Thu, May 16, 1:17 PM · DBA

ops-monitoring-bot added a comment to T364289: Reimage external store hosts with Bookworm.

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1002 for host es1024.eqiad.wmnet with OS bookworm

Thu, May 16, 1:12 PM · DBA

ops-monitoring-bot added a comment to T364435: Drop gu_salt from globaluser.

Cookbook cookbooks.sre.wikireplicas.update-views started by fnegri completed:

clouddb1021.eqiad.wmnet (PASS)
- Ran Puppet agent
- Ran 'maintain-views --all-databases --replace-all --auto-depool --table globaluser'

Thu, May 16, 10:48 AM · MW-1.43-notes (1.43.0-wmf.6; 2024-05-21), MediaWiki-Platform-Team (Radar), Patch-For-Review, MediaWiki-extensions-CentralAuth

ops-monitoring-bot added a comment to T364435: Drop gu_salt from globaluser.

Cookbook cookbooks.sre.wikireplicas.update-views run by fnegri: Started updating wiki replica views

Thu, May 16, 10:38 AM · MW-1.43-notes (1.43.0-wmf.6; 2024-05-21), MediaWiki-Platform-Team (Radar), Patch-For-Review, MediaWiki-extensions-CentralAuth

ops-monitoring-bot added a comment to T364289: Reimage external store hosts with Bookworm.

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1002 for host es1021.eqiad.wmnet with OS bookworm completed:

es1021 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405160808_marostegui_361726_es1021.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Thu, May 16, 8:26 AM · DBA

ops-monitoring-bot added a comment to T364289: Reimage external store hosts with Bookworm.

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1002 for host es1021.eqiad.wmnet with OS bookworm

Thu, May 16, 7:51 AM · DBA

ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye executed with errors:

kafka-main2009 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Generated Puppet certificate
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kafka-main2009.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Thu, May 16, 1:12 AM · SRE, ops-codfw, serviceops, DC-Ops

ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye

Thu, May 16, 12:28 AM · SRE, ops-codfw, serviceops, DC-Ops

Wed, May 15

ops-monitoring-bot added a comment to T363212: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010.

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host kafka-main1007.eqiad.wmnet with OS bullseye executed with errors:

kafka-main1007 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kafka-main1007.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Wed, May 15, 6:03 PM · SRE, serviceops, ops-eqiad, DC-Ops

ops-monitoring-bot added a comment to T363212: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host kafka-main1007.eqiad.wmnet with OS bullseye

Wed, May 15, 5:17 PM · SRE, serviceops, ops-eqiad, DC-Ops

ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye executed with errors:

kafka-main2009 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kafka-main2009.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Wed, May 15, 3:05 PM · SRE, ops-codfw, serviceops, DC-Ops

ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kafka-main2010.codfw.wmnet with OS bullseye completed:

kafka-main2010 (WARN)
- Downtimed on Icinga/Alertmanager
- Unable to disable Puppet, the host may have been unreachable
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405151426_jhancock_2468124_kafka-main2010.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> active
- The sre.puppet.sync-netbox-hiera cookbook was run successfully

Wed, May 15, 2:44 PM · SRE, ops-codfw, serviceops, DC-Ops

ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kafka-main2008.codfw.wmnet with OS bullseye completed:

kafka-main2008 (WARN)
- Downtimed on Icinga/Alertmanager
- Unable to disable Puppet, the host may have been unreachable
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405151424_jhancock_2468073_kafka-main2008.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> active
- The sre.puppet.sync-netbox-hiera cookbook was run successfully

Wed, May 15, 2:41 PM · SRE, ops-codfw, serviceops, DC-Ops

ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kafka-main2007.codfw.wmnet with OS bullseye completed:

kafka-main2007 (WARN)
- Downtimed on Icinga/Alertmanager
- Unable to disable Puppet, the host may have been unreachable
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405151422_jhancock_2467812_kafka-main2007.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> active
- The sre.puppet.sync-netbox-hiera cookbook was run successfully
- Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Wed, May 15, 2:39 PM · SRE, ops-codfw, serviceops, DC-Ops

ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2010.codfw.wmnet with OS bullseye

Wed, May 15, 2:02 PM · SRE, ops-codfw, serviceops, DC-Ops

ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye

Wed, May 15, 2:02 PM · SRE, ops-codfw, serviceops, DC-Ops

ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2008.codfw.wmnet with OS bullseye

Wed, May 15, 2:02 PM · SRE, ops-codfw, serviceops, DC-Ops

ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2007.codfw.wmnet with OS bullseye

Wed, May 15, 2:02 PM · SRE, ops-codfw, serviceops, DC-Ops

ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kafka-main2006.codfw.wmnet with OS bullseye completed:

kafka-main2006 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405151325_jhancock_2409379_kafka-main2006.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> active
- The sre.puppet.sync-netbox-hiera cookbook was run successfully
- Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Wed, May 15, 1:49 PM · SRE, ops-codfw, serviceops, DC-Ops

ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2006.codfw.wmnet with OS bullseye

Wed, May 15, 1:04 PM · SRE, ops-codfw, serviceops, DC-Ops

ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kafka-main2006.codfw.wmnet with OS bullseye executed with errors:

kafka-main2006 (FAIL)
- Downtimed on Icinga/Alertmanager
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kafka-main2006.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Wed, May 15, 1:02 PM · SRE, ops-codfw, serviceops, DC-Ops

ops-monitoring-bot added a comment to T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kafka-main2006.codfw.wmnet with OS bullseye

Wed, May 15, 1:01 PM · SRE, ops-codfw, serviceops, DC-Ops

ops-monitoring-bot added a comment to T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes.

cookbooks.sre.hosts.decommission executed by jayme@cumin1002 for hosts: kubestagetcd[2001-2003].codfw.wmnet

kubestagetcd2001.codfw.wmnet (PASS)
- Downtimed host on Icinga/Alertmanager
- Found Ganeti VM
- VM shutdown
- Started forced sync of VMs in Ganeti cluster codfw to Netbox
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
- VM removed
- Started forced sync of VMs in Ganeti cluster codfw to Netbox

Wed, May 15, 12:58 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes

ops-monitoring-bot added a comment to T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes.

Icinga downtime and Alertmanager silence (ID=5c048aeb-57ce-4f8d-8159-53dcf8b5fb78) set by jayme@cumin1002 for 2 days, 0:00:00 on 3 host(s) and their services with reason: decom

kubestagetcd[2001-2003].codfw.wmnet

Wed, May 15, 12:23 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes

ops-monitoring-bot added a comment to T319184: Move WMCS servers to 1 single NIC.

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudvirt1041.eqiad.wmnet with OS bookworm executed with errors:

cloudvirt1041 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cloudvirt1041.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Wed, May 15, 11:52 AM · Patch-For-Review, User-aborrero, cloud-services-team, SRE, netops, Infrastructure-Foundations

ops-monitoring-bot added a comment to T319184: Move WMCS servers to 1 single NIC.

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1002 for host cloudvirt1041.eqiad.wmnet with OS bookworm

Wed, May 15, 11:11 AM · Patch-For-Review, User-aborrero, cloud-services-team, SRE, netops, Infrastructure-Foundations

ops-monitoring-bot added a comment to T319184: Move WMCS servers to 1 single NIC.

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudvirt1041.eqiad.wmnet with OS bookworm executed with errors:

cloudvirt1041 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cloudvirt1041.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Wed, May 15, 11:05 AM · Patch-For-Review, User-aborrero, cloud-services-team, SRE, netops, Infrastructure-Foundations

ops-monitoring-bot added a comment to T319184: Move WMCS servers to 1 single NIC.

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1002 for host cloudvirt1041.eqiad.wmnet with OS bookworm

Wed, May 15, 10:52 AM · Patch-For-Review, User-aborrero, cloud-services-team, SRE, netops, Infrastructure-Foundations

ops-monitoring-bot added a comment to T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes.

cookbooks.sre.hosts.decommission executed by jayme@cumin1002 for hosts: kubestagemaster[2001-2002].codfw.wmnet

kubestagemaster2001.codfw.wmnet (PASS)
- Downtimed host on Icinga/Alertmanager
- Found Ganeti VM
- VM shutdown
- Started forced sync of VMs in Ganeti cluster codfw to Netbox
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
- VM removed
- Started forced sync of VMs in Ganeti cluster codfw to Netbox

Wed, May 15, 9:53 AM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes

ops-monitoring-bot added a comment to T364823: Upgrade r/w LDAP servers to Bullseye.

Icinga downtime and Alertmanager silence (ID=be009031-0cc0-4a4d-97a0-f4d990831efe) set by jmm@cumin2002 for 1:00:00 on 1 host(s) and their services with reason: OS update

seaborgium.wikimedia.org

Wed, May 15, 9:01 AM · LDAP, SRE, Infrastructure-Foundations

Tue, May 14

ops-monitoring-bot added a comment to T355353: Q3:rack/setup/install dbprov100[56].

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-main1006.eqiad.wmnet with OS bullseye

Tue, May 14, 7:26 PM · Patch-For-Review, SRE, Data-Persistence, ops-eqiad, DC-Ops

ops-monitoring-bot added a comment to T364480: Extend BGP peer automation via Netbox to include VMs.

Deployed homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: Release v0.6.5 update to add modified wmf homer plugin - cmooney@cumin1002 - T364480

Tue, May 14, 4:08 PM · netops, Infrastructure-Foundations, SRE

ops-monitoring-bot added a comment to T364850: Deploy Phabricator/Phorge 2024-05-14.

Icinga downtime and Alertmanager silence (ID=6e2580b0-999e-4a68-87e7-c37d374c663f) set by aokoth@cumin1002 for 0:30:00 on 1 host(s) and their services with reason: Phorge update

phab1004.eqiad.wmnet

Tue, May 14, 3:04 PM · collaboration-services, User-brennen, Release-Engineering-Team (Yakisfaction), Phabricator (2024-05-14)

ops-monitoring-bot added a comment to T364746: Site: eqiad 3 VM request for staging-eqiad kube-apiserver.

VM kubestagemaster1005.eqiad.wmnet switching disk type to plain

Tue, May 14, 9:49 AM · SRE, Infrastructure-Foundations, vm-requests, serviceops, Prod-Kubernetes, Kubernetes

ops-monitoring-bot added a comment to T364746: Site: eqiad 3 VM request for staging-eqiad kube-apiserver.

VM kubestagemaster1004.eqiad.wmnet switching disk type to plain

Tue, May 14, 9:48 AM · SRE, Infrastructure-Foundations, vm-requests, serviceops, Prod-Kubernetes, Kubernetes

ops-monitoring-bot added a comment to T364746: Site: eqiad 3 VM request for staging-eqiad kube-apiserver.

VM kubestagemaster1003.eqiad.wmnet switching disk type to plain

Tue, May 14, 9:48 AM · SRE, Infrastructure-Foundations, vm-requests, serviceops, Prod-Kubernetes, Kubernetes

ops-monitoring-bot added a comment to T364746: Site: eqiad 3 VM request for staging-eqiad kube-apiserver.

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster1005.eqiad.wmnet with OS bullseye completed:

kubestagemaster1005 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Set boot media to disk
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405140931_jayme_4192981_kubestagemaster1005.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Tue, May 14, 9:45 AM · SRE, Infrastructure-Foundations, vm-requests, serviceops, Prod-Kubernetes, Kubernetes

ops-monitoring-bot added a comment to T364746: Site: eqiad 3 VM request for staging-eqiad kube-apiserver.

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster1004.eqiad.wmnet with OS bullseye completed:

kubestagemaster1004 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Set boot media to disk
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405140906_jayme_4192644_kubestagemaster1004.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Tue, May 14, 9:20 AM · SRE, Infrastructure-Foundations, vm-requests, serviceops, Prod-Kubernetes, Kubernetes

ops-monitoring-bot added a comment to T364746: Site: eqiad 3 VM request for staging-eqiad kube-apiserver.

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster1003.eqiad.wmnet with OS bullseye completed:

kubestagemaster1003 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Set boot media to disk
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405140904_jayme_4192355_kubestagemaster1003.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Tue, May 14, 9:18 AM · SRE, Infrastructure-Foundations, vm-requests, serviceops, Prod-Kubernetes, Kubernetes

ops-monitoring-bot added a comment to T364746: Site: eqiad 3 VM request for staging-eqiad kube-apiserver.

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemaster1005.eqiad.wmnet with OS bullseye

Tue, May 14, 9:14 AM · SRE, Infrastructure-Foundations, vm-requests, serviceops, Prod-Kubernetes, Kubernetes

ops-monitoring-bot added a comment to T364823: Upgrade r/w LDAP servers to Bullseye.

Icinga downtime and Alertmanager silence (ID=34ac3b76-436c-436c-afc2-20387cde43fb) set by jmm@cumin2002 for 1:00:00 on 1 host(s) and their services with reason: OS update

serpens.wikimedia.org

Tue, May 14, 8:58 AM · LDAP, SRE, Infrastructure-Foundations

ops-monitoring-bot added a comment to T364746: Site: eqiad 3 VM request for staging-eqiad kube-apiserver.

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemaster1004.eqiad.wmnet with OS bullseye

Tue, May 14, 8:52 AM · SRE, Infrastructure-Foundations, vm-requests, serviceops, Prod-Kubernetes, Kubernetes

ops-monitoring-bot added a comment to T364746: Site: eqiad 3 VM request for staging-eqiad kube-apiserver.

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemaster1003.eqiad.wmnet with OS bullseye

Tue, May 14, 8:49 AM · SRE, Infrastructure-Foundations, vm-requests, serviceops, Prod-Kubernetes, Kubernetes

ops-monitoring-bot added a comment to T364740: Site: codfw 2 VM request for staging-codfw kube-apiserver.

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster2005.codfw.wmnet with OS bullseye executed with errors:

kubestagemaster2005 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Set boot media to disk
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405131705_jayme_4072335_kubestagemaster2005.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kubestagemaster2005.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Tue, May 14, 8:15 AM · SRE, Infrastructure-Foundations, vm-requests, Prod-Kubernetes, Kubernetes

ops-monitoring-bot added a comment to T364296: Reimage db1215 and db2185 (zarcillo) to bookworm.

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1002 for host db2185.codfw.wmnet with OS bookworm completed:

db2185 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405140656_marostegui_4175353_db2185.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Tue, May 14, 7:17 AM · DBA

ops-monitoring-bot added a comment to T364296: Reimage db1215 and db2185 (zarcillo) to bookworm.

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1002 for host db2185.codfw.wmnet with OS bookworm

Tue, May 14, 6:35 AM · DBA

ops-monitoring-bot added a comment to T364296: Reimage db1215 and db2185 (zarcillo) to bookworm.

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1002 for host db2185.codfw.wmnet with OS bookworm executed with errors:

db2185 (FAIL)
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" db2185.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Tue, May 14, 6:34 AM · DBA

ops-monitoring-bot added a comment to T364296: Reimage db1215 and db2185 (zarcillo) to bookworm.

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1002 for host db2185.codfw.wmnet with OS bookworm

Tue, May 14, 6:34 AM · DBA

Mon, May 13

ops-monitoring-bot added a comment to T364740: Site: codfw 2 VM request for staging-codfw kube-apiserver.

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemaster2005.codfw.wmnet with OS bullseye

Mon, May 13, 4:50 PM · SRE, Infrastructure-Foundations, vm-requests, Prod-Kubernetes, Kubernetes

ops-monitoring-bot added a comment to T364740: Site: codfw 2 VM request for staging-codfw kube-apiserver.

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster2005.codfw.wmnet with OS bullseye executed with errors:

kubestagemaster2005 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Set boot media to disk
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405131442_jayme_4048520_kubestagemaster2005.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" kubestagemaster2005.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Mon, May 13, 4:46 PM · SRE, Infrastructure-Foundations, vm-requests, Prod-Kubernetes, Kubernetes

ops-monitoring-bot added a comment to T363310: Site: codfw 1 VM request for staging-codfw kube-apiserver.

Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kubestagemaster2004.codfw.wmnet with OS bullseye completed:

kubestagemaster2004 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Set boot media to disk
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405131435_jayme_4048471_kubestagemaster2004.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Mon, May 13, 2:49 PM · Patch-For-Review, vm-requests, Infrastructure-Foundations, SRE, serviceops, Prod-Kubernetes, Kubernetes

ops-monitoring-bot added a comment to T364740: Site: codfw 2 VM request for staging-codfw kube-apiserver.

Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kubestagemaster2005.codfw.wmnet with OS bullseye

Mon, May 13, 2:25 PM · SRE, Infrastructure-Foundations, vm-requests, Prod-Kubernetes, Kubernetes

ops-monitoring-bot (Operations Monitoring Bot)
UserBot

Projects

Calendar

Today

Tomorrow

Tuesday

User Details

Recent Activity
View All

Yesterday

Fri, May 17

Thu, May 16

Wed, May 15

Tue, May 14

Mon, May 13

ops-monitoring-bot (Operations Monitoring Bot)UserBot

Projects

Calendar

Today

Tomorrow

Tuesday

User Details

Recent ActivityView All

Yesterday

Fri, May 17

Thu, May 16

Wed, May 15

Tue, May 14

Mon, May 13

ops-monitoring-bot (Operations Monitoring Bot)
UserBot

Recent Activity
View All