Page MenuHomePhabricator
Feed Advanced Search

Today

ops-monitoring-bot added a comment to T335031: Move two GPUs from Hadoop to Lift Wing.

Icinga downtime and Alertmanager silence (ID=2ef51d27-4384-414f-9fdf-8fe7b4c93b00) set by elukey@cumin1001 for 1:00:00 on 1 host(s) and their services with reason: Host under maintenance

ml-serve1001.eqiad.wmnet
Mon, Jun 5, 3:33 PM · SRE, ops-eqiad
ops-monitoring-bot added a comment to T335777: Q4:rack/decom codfw unified decommission task.

cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: lvs2009.codfw.wmnet

  • lvs2009.codfw.wmnet (WARN)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
Mon, Jun 5, 2:47 PM · Patch-For-Review, SRE, Traffic, ops-codfw
ops-monitoring-bot added a comment to T335031: Move two GPUs from Hadoop to Lift Wing.

Icinga downtime and Alertmanager silence (ID=43b4a369-edbc-4df6-b931-f35757b38bf1) set by elukey@cumin1001 for 2:00:00 on 1 host(s) and their services with reason: Host under maintenance

ml-serve1001.eqiad.wmnet
Mon, Jun 5, 1:45 PM · SRE, ops-eqiad
ops-monitoring-bot added a comment to T335031: Move two GPUs from Hadoop to Lift Wing.

Icinga downtime and Alertmanager silence (ID=b4799674-ad70-4117-a653-cdeaad02c246) set by elukey@cumin1001 for 2:00:00 on 1 host(s) and their services with reason: Host under maintenance

dse-k8s-worker1002.eqiad.wmnet
Mon, Jun 5, 1:44 PM · SRE, ops-eqiad

Sat, Jun 3

ops-monitoring-bot added a comment to T329363: Upgrade Hadoop test cluster to Bullseye.

Icinga downtime and Alertmanager silence (ID=36ba4f0a-3a73-43c0-81fd-7ab408de8929) set by elukey@cumin1001 for 30 days, 0:00:00 on 1 host(s) and their services with reason: Host under testing/upgrade

an-test-worker1001.eqiad.wmnet
Sat, Jun 3, 1:41 PM · Data-Platform-SRE, Patch-For-Review, Shared-Data-Infrastructure (Q4 Wrap up), Data-Engineering-Planning

Fri, Jun 2

ops-monitoring-bot added a comment to T32383: update.php failed to add column "rev_sha1" to table "revision"..

Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: cp5016.eqsin.wmnet

Fri, Jun 2, 8:52 AM · MediaWiki-Installer
ops-monitoring-bot added a comment to T32383: update.php failed to add column "rev_sha1" to table "revision"..

Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: cp5015.eqsin.wmnet

Fri, Jun 2, 8:52 AM · MediaWiki-Installer
ops-monitoring-bot added a comment to T32383: update.php failed to add column "rev_sha1" to table "revision"..

Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: cp5014.eqsin.wmnet

Fri, Jun 2, 8:51 AM · MediaWiki-Installer
ops-monitoring-bot added a comment to T32383: update.php failed to add column "rev_sha1" to table "revision"..

Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: cp5013.eqsin.wmnet

Fri, Jun 2, 8:51 AM · MediaWiki-Installer
Restricted Application added a project to T32383: update.php failed to add column "rev_sha1" to table "revision".: Performance-Team.

Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 0 hosts:

Fri, Jun 2, 8:51 AM · MediaWiki-Installer

Thu, Jun 1

ops-monitoring-bot added a comment to T289882: Q1:(Need By: TBD) rack/setup/install cloudswift100[12].

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1002.eqiad.wmnet with OS bullseye executed with errors:

  • cloudswift1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details
Thu, Jun 1, 5:05 PM · SRE, Infrastructure-Foundations, ops-eqiad, netops, cloud-services-team (Hardware), DC-Ops
ops-monitoring-bot added a comment to T289882: Q1:(Need By: TBD) rack/setup/install cloudswift100[12].

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1002.eqiad.wmnet with OS bullseye

Thu, Jun 1, 5:05 PM · SRE, Infrastructure-Foundations, ops-eqiad, netops, cloud-services-team (Hardware), DC-Ops
ops-monitoring-bot added a comment to T289882: Q1:(Need By: TBD) rack/setup/install cloudswift100[12].

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1002.eqiad.wmnet with OS bullseye executed with errors:

  • cloudswift1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details
Thu, Jun 1, 4:55 PM · SRE, Infrastructure-Foundations, ops-eqiad, netops, cloud-services-team (Hardware), DC-Ops
ops-monitoring-bot added a comment to T289882: Q1:(Need By: TBD) rack/setup/install cloudswift100[12].

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1002.eqiad.wmnet with OS bullseye

Thu, Jun 1, 4:55 PM · SRE, Infrastructure-Foundations, ops-eqiad, netops, cloud-services-team (Hardware), DC-Ops
ops-monitoring-bot added a comment to T337828: cloudcontrol2004-dev: make it a cloudlb backend.

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudcontrol2004-dev.codfw.wmnet with OS bullseye completed:

  • cloudcontrol2004-dev (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run failed and logged in /var/log/spicerack/sre/hosts/reimage/202306011610_aborrero_3154891_cloudcontrol2004-dev.out, asking the operator what to do
    • First Puppet run failed and the operator skipped it
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Failed to run the sre.puppet.sync-netbox-hiera cookbook, run it manually
Thu, Jun 1, 4:53 PM · SRE, ops-codfw, User-aborrero, cloud-services-team (FY2022/2023-Q4)
ops-monitoring-bot added a comment to T289882: Q1:(Need By: TBD) rack/setup/install cloudswift100[12].

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1002.eqiad.wmnet with OS bullseye executed with errors:

  • cloudswift1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details
Thu, Jun 1, 4:40 PM · SRE, Infrastructure-Foundations, ops-eqiad, netops, cloud-services-team (Hardware), DC-Ops
ops-monitoring-bot added a comment to T289882: Q1:(Need By: TBD) rack/setup/install cloudswift100[12].

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1002.eqiad.wmnet with OS bullseye

Thu, Jun 1, 4:40 PM · SRE, Infrastructure-Foundations, ops-eqiad, netops, cloud-services-team (Hardware), DC-Ops
ops-monitoring-bot added a comment to T289882: Q1:(Need By: TBD) rack/setup/install cloudswift100[12].

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS bullseye completed:

  • cloudswift1001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306011607_jhancock_3172209_cloudswift1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status failed -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Thu, Jun 1, 4:24 PM · SRE, Infrastructure-Foundations, ops-eqiad, netops, cloud-services-team (Hardware), DC-Ops
ops-monitoring-bot added a comment to T289882: Q1:(Need By: TBD) rack/setup/install cloudswift100[12].

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS bullseye

Thu, Jun 1, 4:01 PM · SRE, Infrastructure-Foundations, ops-eqiad, netops, cloud-services-team (Hardware), DC-Ops
ops-monitoring-bot added a comment to T289882: Q1:(Need By: TBD) rack/setup/install cloudswift100[12].

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS bullseye executed with errors:

  • cloudswift1001 (FAIL)
    • The reimage failed, see the cookbook logs for the details
Thu, Jun 1, 4:00 PM · SRE, Infrastructure-Foundations, ops-eqiad, netops, cloud-services-team (Hardware), DC-Ops
ops-monitoring-bot added a comment to T289882: Q1:(Need By: TBD) rack/setup/install cloudswift100[12].

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS bullseye

Thu, Jun 1, 3:59 PM · SRE, Infrastructure-Foundations, ops-eqiad, netops, cloud-services-team (Hardware), DC-Ops
ops-monitoring-bot added a comment to T289882: Q1:(Need By: TBD) rack/setup/install cloudswift100[12].

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS bullseye executed with errors:

  • cloudswift1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details
Thu, Jun 1, 3:57 PM · SRE, Infrastructure-Foundations, ops-eqiad, netops, cloud-services-team (Hardware), DC-Ops
ops-monitoring-bot added a comment to T337828: cloudcontrol2004-dev: make it a cloudlb backend.

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudcontrol2004-dev.codfw.wmnet with OS bullseye

Thu, Jun 1, 3:45 PM · SRE, ops-codfw, User-aborrero, cloud-services-team (FY2022/2023-Q4)
ops-monitoring-bot added a comment to T337828: cloudcontrol2004-dev: make it a cloudlb backend.

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudcontrol2004-dev.codfw.wmnet with OS bullseye executed with errors:

  • cloudcontrol2004-dev (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details
Thu, Jun 1, 3:44 PM · SRE, ops-codfw, User-aborrero, cloud-services-team (FY2022/2023-Q4)
ops-monitoring-bot added a comment to T337828: cloudcontrol2004-dev: make it a cloudlb backend.

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudcontrol2004-dev.codfw.wmnet with OS bullseye

Thu, Jun 1, 3:34 PM · SRE, ops-codfw, User-aborrero, cloud-services-team (FY2022/2023-Q4)
ops-monitoring-bot added a comment to T337828: cloudcontrol2004-dev: make it a cloudlb backend.

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudcontrol2004-dev.codfw.wmnet with OS bullseye executed with errors:

  • cloudcontrol2004-dev (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details
Thu, Jun 1, 3:33 PM · SRE, ops-codfw, User-aborrero, cloud-services-team (FY2022/2023-Q4)
ops-monitoring-bot added a comment to T337828: cloudcontrol2004-dev: make it a cloudlb backend.

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudcontrol2004-dev.codfw.wmnet with OS bullseye

Thu, Jun 1, 2:56 PM · SRE, ops-codfw, User-aborrero, cloud-services-team (FY2022/2023-Q4)
ops-monitoring-bot added a comment to T337828: cloudcontrol2004-dev: make it a cloudlb backend.

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudcontrol2004-dev.codfw.wmnet with OS bullseye executed with errors:

  • cloudcontrol2004-dev (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details
Thu, Jun 1, 2:55 PM · SRE, ops-codfw, User-aborrero, cloud-services-team (FY2022/2023-Q4)
ops-monitoring-bot added a comment to T289882: Q1:(Need By: TBD) rack/setup/install cloudswift100[12].

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudswift1001.eqiad.wmnet with OS bullseye

Thu, Jun 1, 2:55 PM · SRE, Infrastructure-Foundations, ops-eqiad, netops, cloud-services-team (Hardware), DC-Ops
ops-monitoring-bot added a comment to T337828: cloudcontrol2004-dev: make it a cloudlb backend.

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudcontrol2004-dev.codfw.wmnet with OS bullseye

Thu, Jun 1, 2:34 PM · SRE, ops-codfw, User-aborrero, cloud-services-team (FY2022/2023-Q4)
ops-monitoring-bot added a comment to T331300: Ensure WCQS/WDQS stack works on Bullseye.

Icinga downtime and Alertmanager silence (ID=389b7357-bed5-4b2f-8790-8d67f9ff7609) set by bking@cumin1001 for 20 days, 0:00:00 on 1 host(s) and their services with reason: attempting WDQS stack on bullseye

wdqs2021.codfw.wmnet
Thu, Jun 1, 1:04 PM · Data-Platform-SRE, Discovery-Search (Current work)
ops-monitoring-bot added a comment to T337828: cloudcontrol2004-dev: make it a cloudlb backend.

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin2002 for host cloudcontrol2004-dev.codfw.wmnet with OS bullseye executed with errors:

  • cloudcontrol2004-dev (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details
Thu, Jun 1, 10:45 AM · SRE, ops-codfw, User-aborrero, cloud-services-team (FY2022/2023-Q4)
ops-monitoring-bot added a comment to T337828: cloudcontrol2004-dev: make it a cloudlb backend.

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin2002 for host cloudcontrol2004-dev.codfw.wmnet with OS bullseye

Thu, Jun 1, 9:49 AM · SRE, ops-codfw, User-aborrero, cloud-services-team (FY2022/2023-Q4)
ops-monitoring-bot added a comment to T337828: cloudcontrol2004-dev: make it a cloudlb backend.

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudcontrol2004-dev.codfw.wmnet with OS bullseye executed with errors:

  • cloudcontrol2004-dev (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details
Thu, Jun 1, 9:30 AM · SRE, ops-codfw, User-aborrero, cloud-services-team (FY2022/2023-Q4)
ops-monitoring-bot added a comment to T337828: cloudcontrol2004-dev: make it a cloudlb backend.

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudcontrol2004-dev.codfw.wmnet with OS bullseye

Thu, Jun 1, 8:56 AM · SRE, ops-codfw, User-aborrero, cloud-services-team (FY2022/2023-Q4)

Wed, May 31

ops-monitoring-bot added a comment to T319477: Migrate doc hosts to Bullseye.

cookbooks.sre.hosts.decommission executed by eoghan@cumin1001 for hosts: doc2001.codfw.wmnet

  • doc2001.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
Wed, May 31, 11:14 AM · Patch-For-Review, Continuous-Integration-Infrastructure, serviceops-collab
ops-monitoring-bot added a comment to T319477: Migrate doc hosts to Bullseye.

cookbooks.sre.hosts.decommission executed by eoghan@cumin1001 for hosts: doc1002.eqiad.wmnet

  • doc1002.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
Wed, May 31, 10:11 AM · Patch-For-Review, Continuous-Integration-Infrastructure, serviceops-collab
ops-monitoring-bot added a comment to T337828: cloudcontrol2004-dev: make it a cloudlb backend.

cookbooks.sre.hosts.decommission executed by aborrero@cumin2002 for hosts: cloudcontrol2004-dev.wikimedia.org

  • cloudcontrol2004-dev.wikimedia.org (WARN)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
Wed, May 31, 9:56 AM · SRE, ops-codfw, User-aborrero, cloud-services-team (FY2022/2023-Q4)
ops-monitoring-bot added a comment to T337269: decommission labstore100[45].eqiad.wmne.

Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: labstore1005.eqiad.wmnet

Wed, May 31, 8:55 AM · SRE, ops-eqiad, cloud-services-team, decommission-hardware
ops-monitoring-bot added a comment to T337269: decommission labstore100[45].eqiad.wmne.

Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: labstore1004.eqiad.wmnet

Wed, May 31, 8:55 AM · SRE, ops-eqiad, cloud-services-team, decommission-hardware
ops-monitoring-bot added a comment to T279637: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options.

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-fe1009.eqiad.wmnet with OS bullseye completed:

  • ms-fe1009 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Set pooled=inactive for the following services on confctl:

{"ms-fe1009.eqiad.wmnet": {"weight": 40, "pooled": "yes"}, "tags": "dc=eqiad,cluster=swift,service=nginx"}
{"ms-fe1009.eqiad.wmnet": {"weight": 40, "pooled": "yes"}, "tags": "dc=eqiad,cluster=swift,service=swift-fe"}

  • Disabled Puppet
  • Removed from Puppet and PuppetDB if present
  • Deleted any existing Puppet certificate
  • Removed from Debmonitor if present
  • Forced PXE for next reboot
  • Host rebooted via IPMI
  • Host up (Debian installer)
  • Checked BIOS boot parameters are back to normal
  • Host up (new fresh bullseye OS)
  • Generated Puppet certificate
  • Signed new Puppet certificate
  • Run Puppet in NOOP mode to populate exported resources in PuppetDB
  • Found Nagios_host resource for this host in PuppetDB
  • Downtimed the new host on Icinga/Alertmanager
  • Removed previous downtime on Alertmanager (old OS)
  • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202305310822_mvernon_1082987_ms-fe1009.out
  • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
  • Rebooted
  • Automatic Puppet run was successful
  • Forced a re-check of all Icinga services for the host
  • Icinga status is optimal
  • Icinga downtime removed
  • Services in confctl are not automatically pooled, to restore the previous state you have to run the following commands:

sudo confctl select 'dc=eqiad,cluster=swift,service=nginx' set/pooled=yes
sudo confctl select 'dc=eqiad,cluster=swift,service=swift-fe' set/pooled=yes

  • Updated Netbox data from PuppetDB
Wed, May 31, 8:41 AM · SRE-swift-storage
ops-monitoring-bot added a comment to T279637: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options.

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-fe1009.eqiad.wmnet with OS bullseye

Wed, May 31, 8:04 AM · SRE-swift-storage

Tue, May 30

ops-monitoring-bot added a comment to T336564: cloudcontrol2005-dev: make it a cloudlb backend.

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudcontrol2005-dev.codfw.wmnet with OS bullseye completed:

  • cloudcontrol2005-dev (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202305301303_aborrero_4112615_cloudcontrol2005-dev.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Failed to run the sre.puppet.sync-netbox-hiera cookbook, run it manually
Tue, May 30, 3:15 PM · SRE, ops-codfw, cloud-services-team (FY2022/2023-Q4)
ops-monitoring-bot added a comment to T333614: Upgrade mwlog hosts to Bullseye.

Cookbook cookbooks.sre.hosts.reimage started by herron@cumin1001 for host mwlog2002.codfw.wmnet with OS bullseye executed with errors:

  • mwlog2002 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details
Tue, May 30, 2:16 PM · User-herron, SRE Observability (FY2022/2023-Q4)
ops-monitoring-bot added a comment to T321783: Setup an initial bookworm host pair with Puppetdb 7.

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host puppetdb1003.eqiad.wmnet with OS bookworm executed with errors:

  • puppetdb1003 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202305300958_jmm_3953981_puppetdb1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details
Tue, May 30, 2:16 PM · Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T279637: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options.

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-fe2009.codfw.wmnet with OS bullseye completed:

  • ms-fe2009 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Set pooled=inactive for the following services on confctl:

{"ms-fe2009.codfw.wmnet": {"weight": 40, "pooled": "no"}, "tags": "dc=codfw,cluster=swift,service=nginx"}
{"ms-fe2009.codfw.wmnet": {"weight": 40, "pooled": "no"}, "tags": "dc=codfw,cluster=swift,service=swift-fe"}

  • Disabled Puppet
  • Removed from Puppet and PuppetDB if present
  • Deleted any existing Puppet certificate
  • Removed from Debmonitor if present
  • Forced PXE for next reboot
  • Host rebooted via IPMI
  • Host up (Debian installer)
  • Checked BIOS boot parameters are back to normal
  • Host up (new fresh bullseye OS)
  • Generated Puppet certificate
  • Signed new Puppet certificate
  • Run Puppet in NOOP mode to populate exported resources in PuppetDB
  • Found Nagios_host resource for this host in PuppetDB
  • Downtimed the new host on Icinga/Alertmanager
  • Removed previous downtime on Alertmanager (old OS)
  • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202305301309_mvernon_4148502_ms-fe2009.out
  • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
  • Rebooted
  • Automatic Puppet run was successful
  • Forced a re-check of all Icinga services for the host
  • Icinga status is optimal
  • Icinga downtime removed
  • Services in confctl are not automatically pooled, to restore the previous state you have to run the following commands:

sudo confctl select 'dc=codfw,cluster=swift,service=nginx' set/pooled=no
sudo confctl select 'dc=codfw,cluster=swift,service=swift-fe' set/pooled=no

  • Updated Netbox data from PuppetDB
Tue, May 30, 1:50 PM · SRE-swift-storage
ops-monitoring-bot added a comment to T333614: Upgrade mwlog hosts to Bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by herron@cumin1001 for host mwlog2002.codfw.wmnet with OS bullseye

Tue, May 30, 1:11 PM · User-herron, SRE Observability (FY2022/2023-Q4)
ops-monitoring-bot added a comment to T279637: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options.

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-fe2009.codfw.wmnet with OS bullseye

Tue, May 30, 12:48 PM · SRE-swift-storage
ops-monitoring-bot added a comment to T336564: cloudcontrol2005-dev: make it a cloudlb backend.

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudcontrol2005-dev.codfw.wmnet with OS bullseye

Tue, May 30, 12:14 PM · SRE, ops-codfw, cloud-services-team (FY2022/2023-Q4)
ops-monitoring-bot added a comment to T336491: Merge reimaging cookbooks.

cookbooks.sre.hosts.decommission executed by slyngshede@cumin1001 for hosts: testvm2006.codfw.wmnet

  • testvm2006.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw_test to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw_test to Netbox
Tue, May 30, 11:51 AM · Spicerack, SRE-tools, Infrastructure-Foundations
ops-monitoring-bot added a comment to T336491: Merge reimaging cookbooks.

Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1001 for host testvm2006.codfw.wmnet with OS bookworm completed:

  • testvm2006 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202305301053_slyngshede_818964_testvm2006.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Tue, May 30, 11:07 AM · Spicerack, SRE-tools, Infrastructure-Foundations
ops-monitoring-bot added a comment to T336491: Merge reimaging cookbooks.

Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1001 for host testvm2006.codfw.wmnet with OS bookworm

Tue, May 30, 9:59 AM · Spicerack, SRE-tools, Infrastructure-Foundations
ops-monitoring-bot added a comment to T321783: Setup an initial bookworm host pair with Puppetdb 7.

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host puppetdb1003.eqiad.wmnet with OS bookworm

Tue, May 30, 9:43 AM · Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T321783: Setup an initial bookworm host pair with Puppetdb 7.

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host puppetdb2003.codfw.wmnet with OS bookworm completed:

  • puppetdb2003 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202305300811_jmm_3834209_puppetdb2003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
Tue, May 30, 9:29 AM · Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T321783: Setup an initial bookworm host pair with Puppetdb 7.

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host puppetdb2003.codfw.wmnet with OS bookworm

Tue, May 30, 7:49 AM · Infrastructure-Foundations, SRE

Mon, May 29

ops-monitoring-bot added a comment to T337690: ProbeDown - vrts2001.

Icinga downtime and Alertmanager silence (ID=a7ca581f-c62c-43fa-8b59-2bdcb1fd56c6) set by eoghan@cumin1001 for 14 days, 0:00:00 on 1 host(s) and their services with reason: This is being worked on

vrts2001.codfw.wmnet
Mon, May 29, 3:19 PM · serviceops-collab
ops-monitoring-bot added a comment to T336036: Bring stat1009 into service.

Icinga downtime and Alertmanager silence (ID=e14a6c2c-888d-45b4-94a2-edc04252cc36) set by stevemunene@cumin1001 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Bringing stat1009 into service

stat1009.eqiad.wmnet
Mon, May 29, 2:18 PM · Data-Platform-SRE, Shared-Data-Infrastructure (Q4 Wrap up), Data-Engineering

Fri, May 26

ops-monitoring-bot added a comment to T337446: Rebuild sanitarium hosts.

Cookbook cookbooks.sre.wikireplicas.update-views for section s7 started by nskaggs executed with errors:

  • dbproxy1018.eqiad.wmnet (FAIL)
    • Confirmed clouddb1018.eqiad.wmnet is depooled from dbproxy1018.eqiad.wmnet
    • Could not confirm host is repooled
Fri, May 26, 3:31 PM · TaxonBot, Patch-For-Review, User-notice, cloud-services-team, Data-Engineering, Data-Services, DBA
ops-monitoring-bot added a comment to T337446: Rebuild sanitarium hosts.

Cookbook cookbooks.sre.wikireplicas.update-views run by nskaggs: Started updating wikireplica views

Fri, May 26, 3:08 PM · TaxonBot, Patch-For-Review, User-notice, cloud-services-team, Data-Engineering, Data-Services, DBA
ops-monitoring-bot added a comment to T326346: Q4:rack/setup/install dbproxy10[22-27]..

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye executed with errors:

  • dbproxy1023 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details
Fri, May 26, 12:47 PM · SRE, DBA, Data-Persistence, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T326346: Q4:rack/setup/install dbproxy10[22-27]..

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1023.eqiad.wmnet with OS bullseye

Fri, May 26, 11:51 AM · SRE, DBA, Data-Persistence, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T329366: Enable WarmParsoidParserCache on all wikis.

Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host parse1016.eqiad.wmnet with OS buster completed:

  • parse1016 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202305260859_jiji_3767559_parse1016.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
Fri, May 26, 9:28 AM · serviceops, Parsoid (Tracking), RESTbase Sunsetting
ops-monitoring-bot added a comment to T329366: Enable WarmParsoidParserCache on all wikis.

Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host parse1014.eqiad.wmnet with OS buster completed:

  • parse1014 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202305260856_jiji_3767538_parse1014.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
Fri, May 26, 9:26 AM · serviceops, Parsoid (Tracking), RESTbase Sunsetting
ops-monitoring-bot added a comment to T329366: Enable WarmParsoidParserCache on all wikis.

Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host parse1013.eqiad.wmnet with OS buster completed:

  • parse1013 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202305260854_jiji_3767513_parse1013.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
Fri, May 26, 9:23 AM · serviceops, Parsoid (Tracking), RESTbase Sunsetting
ops-monitoring-bot added a comment to T329366: Enable WarmParsoidParserCache on all wikis.

Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host parse1015.eqiad.wmnet with OS buster executed with errors:

  • parse1015 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202305260854_jiji_3767551_parse1015.out
    • The reimage failed, see the cookbook logs for the details
Fri, May 26, 9:08 AM · serviceops, Parsoid (Tracking), RESTbase Sunsetting
ops-monitoring-bot added a comment to T329366: Enable WarmParsoidParserCache on all wikis.

Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host parse1016.eqiad.wmnet with OS buster

Fri, May 26, 8:39 AM · serviceops, Parsoid (Tracking), RESTbase Sunsetting
ops-monitoring-bot added a comment to T329366: Enable WarmParsoidParserCache on all wikis.

Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host parse1015.eqiad.wmnet with OS buster

Fri, May 26, 8:39 AM · serviceops, Parsoid (Tracking), RESTbase Sunsetting
ops-monitoring-bot added a comment to T329366: Enable WarmParsoidParserCache on all wikis.

Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host parse1014.eqiad.wmnet with OS buster

Fri, May 26, 8:39 AM · serviceops, Parsoid (Tracking), RESTbase Sunsetting
ops-monitoring-bot added a comment to T329366: Enable WarmParsoidParserCache on all wikis.

Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host parse1013.eqiad.wmnet with OS buster

Fri, May 26, 8:39 AM · serviceops, Parsoid (Tracking), RESTbase Sunsetting

Thu, May 25

ops-monitoring-bot added a comment to T322937: Migrate row E/F network aggregation to dedicated Spine switches.

Icinga downtime and Alertmanager silence (ID=37545969-c51e-450d-9ef0-5fadfd151520) set by cmooney@cumin1001 for 0:30:00 on 3 host(s) and their services with reason: Migrate lsw1-e3-eqiad uplinks to spine

lsw1-e[1,3]-eqiad.mgmt,lsw1-f1-eqiad.mgmt
Thu, May 25, 4:11 PM · SRE, netops, Infrastructure-Foundations
ops-monitoring-bot added a comment to T334521: upgrade gerrit servers to bullseye.

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1001 for host gerrit2002.wikimedia.org with OS bullseye completed:

  • gerrit2002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202305251533_dzahn_3574334_gerrit2002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Thu, May 25, 4:02 PM · Release-Engineering-Team (They Live 🕶️🧟), serviceops-collab
ops-monitoring-bot added a comment to T322937: Migrate row E/F network aggregation to dedicated Spine switches.

Icinga downtime and Alertmanager silence (ID=c43be552-7ced-4f58-99c1-a10b5984bf3a) set by cmooney@cumin1001 for 0:30:00 on 2 host(s) and their services with reason: Migrate lsw1-e2-eqiad uplink from lsw1-f1 to ssw1-f1

lsw1-e2-eqiad.mgmt,lsw1-f1-eqiad.mgmt
Thu, May 25, 3:57 PM · SRE, netops, Infrastructure-Foundations
ops-monitoring-bot added a comment to T322937: Migrate row E/F network aggregation to dedicated Spine switches.

Icinga downtime and Alertmanager silence (ID=8f44dd48-0cac-4bfd-907a-512dfa686d40) set by cmooney@cumin1001 for 0:30:00 on 2 host(s) and their services with reason: Migrate lsw1-e1-eqiad to cr1-eqiad link to ssw1-e1-eqiad

lsw1-e[1-2]-eqiad.mgmt
Thu, May 25, 3:34 PM · SRE, netops, Infrastructure-Foundations
ops-monitoring-bot added a comment to T326346: Q4:rack/setup/install dbproxy10[22-27]..

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye executed with errors:

  • dbproxy1022 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details
Thu, May 25, 3:28 PM · SRE, DBA, Data-Persistence, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T322937: Migrate row E/F network aggregation to dedicated Spine switches.

Icinga downtime and Alertmanager silence (ID=cf76e0ba-8648-48a0-beed-fe7b60f79656) set by cmooney@cumin1001 for 0:30:00 on 2 host(s) and their services with reason: Migrate lsw1-e1-eqiad to cr2-eqiad link to ssw1-e1-eqiad

cr2-eqiad,lsw1-f1-eqiad.mgmt
Thu, May 25, 3:21 PM · SRE, netops, Infrastructure-Foundations
ops-monitoring-bot added a comment to T334521: upgrade gerrit servers to bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1001 for host gerrit2002.wikimedia.org with OS bullseye

Thu, May 25, 3:14 PM · Release-Engineering-Team (They Live 🕶️🧟), serviceops-collab
ops-monitoring-bot added a comment to T322937: Migrate row E/F network aggregation to dedicated Spine switches.

Icinga downtime and Alertmanager silence (ID=03f7b2ab-bdea-4c56-ac41-3ec30004db4a) set by cmooney@cumin1001 for 0:30:00 on 2 host(s) and their services with reason: Migrate lsw1-e1-eqiad to cr1-eqiad link to ssw1-e1-eqiad

cr1-eqiad,lsw1-e1-eqiad.mgmt
Thu, May 25, 3:04 PM · SRE, netops, Infrastructure-Foundations
ops-monitoring-bot added a comment to T326346: Q4:rack/setup/install dbproxy10[22-27]..

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host dbproxy1022.eqiad.wmnet with OS bullseye

Thu, May 25, 2:32 PM · SRE, DBA, Data-Persistence, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T336564: cloudcontrol2005-dev: make it a cloudlb backend.

cookbooks.sre.hosts.decommission executed by aborrero@cumin2002 for hosts: cloudcontrol2005-dev.wikimedia.org

  • cloudcontrol2005-dev.wikimedia.org (WARN)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
Thu, May 25, 10:42 AM · SRE, ops-codfw, cloud-services-team (FY2022/2023-Q4)

Tue, May 23

ops-monitoring-bot added a comment to T334435: upgrade releases hosts to bullseye.

Cookbook cookbooks.sre.hosts.reimage started by eoghan@cumin1001 for host releases2003.codfw.wmnet with OS bullseye completed:

  • releases2003 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202305231524_eoghan_3041676_releases2003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Tue, May 23, 3:38 PM · Patch-For-Review, serviceops-collab
ops-monitoring-bot added a comment to T334435: upgrade releases hosts to bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by eoghan@cumin1001 for host releases2003.codfw.wmnet with OS bullseye

Tue, May 23, 3:03 PM · Patch-For-Review, serviceops-collab
ops-monitoring-bot added a comment to T334435: upgrade releases hosts to bullseye.

Cookbook cookbooks.sre.hosts.reimage started by eoghan@cumin1001 for host releases1003.eqiad.wmnet with OS bullseye completed:

  • releases1003 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202305231449_eoghan_3034498_releases1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Tue, May 23, 3:02 PM · Patch-For-Review, serviceops-collab
ops-monitoring-bot added a comment to T334435: upgrade releases hosts to bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by eoghan@cumin1001 for host releases1003.eqiad.wmnet with OS bullseye

Tue, May 23, 2:36 PM · Patch-For-Review, serviceops-collab
ops-monitoring-bot added a comment to T335424: kafkamon: upgrade to bullseye.

cookbooks.sre.hosts.decommission executed by herron@cumin1001 for hosts: kafkamon2002.codfw.wmnet

  • kafkamon2002.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
Tue, May 23, 2:05 PM · SRE Observability (FY2022/2023-Q4)
ops-monitoring-bot added a comment to T335424: kafkamon: upgrade to bullseye.

cookbooks.sre.hosts.decommission executed by herron@cumin1001 for hosts: kafkamon1002.eqiad.wmnet

  • kafkamon1002.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
Tue, May 23, 1:55 PM · SRE Observability (FY2022/2023-Q4)
ops-monitoring-bot added a comment to T336833: decommission db1122.eqiad.wmnet.

cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: db1122.eqiad.wmnet

  • db1122.eqiad.wmnet (WARN)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
Tue, May 23, 8:36 AM · SRE, ops-eqiad, decommission-hardware

Mon, May 22

ops-monitoring-bot added a comment to T337269: decommission labstore100[45].eqiad.wmne.

cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: labstore1004.eqiad.wmnet

  • labstore1004.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Host steps raised exception: 'NoneType' object has no attribute 'dns_name'
Mon, May 22, 8:55 PM · SRE, ops-eqiad, cloud-services-team, decommission-hardware
ops-monitoring-bot added a comment to T337269: decommission labstore100[45].eqiad.wmne.

cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: labstore1005.eqiad.wmnet

  • labstore1005.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Host steps raised exception: Cumin execution failed (exit_code=2)
Mon, May 22, 8:44 PM · SRE, ops-eqiad, cloud-services-team, decommission-hardware
ops-monitoring-bot added a comment to T336995: decommission bast2002.wikimedia.org.

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: bast2002

  • bast2002 (FAIL)
    • Unable to find/resolve the mgmt DNS record, using the IP instead: 10.193.1.207
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Failed to power off, manual intervention required: Remote IPMI for 10.193.1.207 failed (exit=1): b''
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
Mon, May 22, 6:45 AM · SRE, ops-codfw, decommission-hardware
ops-monitoring-bot added a comment to T336725: decommission db1121.eqiad.wmnet.

cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: db1121.eqiad.wmnet

  • db1121.eqiad.wmnet (WARN)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
Mon, May 22, 6:41 AM · SRE, ops-eqiad, decommission-hardware

Sun, May 21

ops-monitoring-bot created T337174: Degraded RAID on backup2010.
Sun, May 21, 7:28 AM · Data-Persistence-Backup, SRE, ops-codfw

Fri, May 19

ops-monitoring-bot added a comment to T336036: Bring stat1009 into service.

Icinga downtime and Alertmanager silence (ID=58b1da63-7fca-4cbc-8725-c1ba80542dd9) set by stevemunene@cumin1001 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Bringing stat1009 into service

stat1009.eqiad.wmnet
Fri, May 19, 2:36 PM · Data-Platform-SRE, Shared-Data-Infrastructure (Q4 Wrap up), Data-Engineering
ops-monitoring-bot added a comment to T331297: Audit/update NIC firmware on Search Platform-owned Buster hosts.

Icinga downtime and Alertmanager silence (ID=401c16a2-5570-4b24-b856-a1d2685f312c) set by bking@cumin1001 for 4:00:00 on 1 host(s) and their services with reason: firmware update

wdqs1014.eqiad.wmnet
Fri, May 19, 2:17 PM · Discovery-Search (Current work)
ops-monitoring-bot added a comment to T322937: Migrate row E/F network aggregation to dedicated Spine switches.

Icinga downtime and Alertmanager silence (ID=c4ef01af-e7d5-458f-ae46-17500f124165) set by cmooney@cumin1001 for 0:30:00 on 1 host(s) and their services with reason: Move lvs1020 handoff port to row e/f from lsw1-f1 to ssw1-f1

lvs1020.eqiad.wmnet
Fri, May 19, 1:34 PM · SRE, netops, Infrastructure-Foundations
ops-monitoring-bot added a comment to T336995: decommission bast2002.wikimedia.org.

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: bast2002

  • bast2002 (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
Fri, May 19, 10:55 AM · SRE, ops-codfw, decommission-hardware
ops-monitoring-bot added a comment to T335280: Drain and then decommission ms-be20[40-43].

cookbooks.sre.hosts.decommission executed by mvernon@cumin2002 for hosts: ms-be[2040-2043].codfw.wmnet

  • ms-be2040.codfw.wmnet (WARN)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
Fri, May 19, 9:21 AM · SRE-swift-storage
ops-monitoring-bot added a comment to T335585: Decommission prometheus4001.

Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: prometheus4001.ulsfo.wmnet

Fri, May 19, 7:21 AM · SRE, ops-ulsfo, DC-Ops, SRE Observability (FY2022/2023-Q4), decommission-hardware

Thu, May 18

ops-monitoring-bot added a comment to T320508: Core routers: replace bootp with dhcp-relay.

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bullseye completed:

  • sretest1002 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202305181553_cmooney_1698881_sretest1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Thu, May 18, 4:10 PM · SRE, netops, Infrastructure-Foundations
ops-monitoring-bot added a comment to T320508: Core routers: replace bootp with dhcp-relay.

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bullseye

Thu, May 18, 3:37 PM · SRE, netops, Infrastructure-Foundations
ops-monitoring-bot added a comment to T320508: Core routers: replace bootp with dhcp-relay.

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest1002.eqiad.wmnet with OS bookworm executed with errors:

  • sretest1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details
Thu, May 18, 3:25 PM · SRE, netops, Infrastructure-Foundations