Page MenuHomePhabricator

ops-monitoring-bot (Operations Monitoring Bot)
UserBot

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Aug 12 2016, 1:45 PM (413 w, 3 d)
Roles
Bot
Availability
Available
LDAP User
Unknown
MediaWiki User
Unknown

Bot managed by SRE for automated interaction with Phabricator from monitoring tools.

Recent Activity

Yesterday

ops-monitoring-bot added a comment to T336275: Upgrade Netbox to 4.x.

Icinga downtime and Alertmanager silence (ID=24d499e4-d334-4d4e-8fcd-fc9f2feed844) set by ayounsi@cumin1002 for 4 days, 0:00:00 on 1 host(s) and their services with reason: netbox upgrade prep work

netbox2003.codfw.wmnet
Mon, Jul 15, 3:32 PM · Patch-For-Review, Infrastructure-Foundations, netbox
ops-monitoring-bot created T370062: Degraded RAID on aqs1013.
Mon, Jul 15, 3:15 PM · DC-Ops, SRE, ops-eqiad
ops-monitoring-bot added a comment to T362033: Degraded RAID on aqs1013.

Icinga downtime and Alertmanager silence (ID=9483e0b8-53c7-4b67-8ac7-0ee42edaeba5) set by eevans@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Server swap — T362033

aqs1013.eqiad.wmnet
Mon, Jul 15, 2:50 PM · DC-Ops, Cassandra, SRE, ops-eqiad
ops-monitoring-bot added a comment to T362824: Q#:rack/setup/install dbproxy200[5-8].

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbproxy2005.codfw.wmnet with OS bookworm completed:

  • dbproxy2005 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407151358_pt1979_1933053_dbproxy2005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)
Mon, Jul 15, 2:13 PM · DBA, SRE, ops-codfw, Data-Persistence, DC-Ops
ops-monitoring-bot added a comment to T336275: Upgrade Netbox to 4.x.

Icinga downtime and Alertmanager silence (ID=60bc5c40-0301-4c29-907d-b4e0eb5e3cb3) set by ayounsi@cumin1002 for 4 days, 0:00:00 on 1 host(s) and their services with reason: netbox upgrade prep work

netbox1003.eqiad.wmnet
Mon, Jul 15, 1:52 PM · Patch-For-Review, Infrastructure-Foundations, netbox
ops-monitoring-bot added a comment to T336275: Upgrade Netbox to 4.x.

Icinga downtime and Alertmanager silence (ID=cc358df6-b5c1-490c-aad1-6454f09f0fc8) set by ayounsi@cumin1002 for 4 days, 0:00:00 on 1 host(s) and their services with reason: netbox upgrade prep work

netboxdb2003.codfw.wmnet
Mon, Jul 15, 1:50 PM · Patch-For-Review, Infrastructure-Foundations, netbox
ops-monitoring-bot added a comment to T362824: Q#:rack/setup/install dbproxy200[5-8].

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbproxy2005.codfw.wmnet with OS bookworm

Mon, Jul 15, 1:39 PM · DBA, SRE, ops-codfw, Data-Persistence, DC-Ops
ops-monitoring-bot added a comment to T367487: Update CAS to 7.0.

Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1002 for host idp2004.wikimedia.org with OS bookworm completed:

  • idp2004 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407150816_slyngshede_569604_idp2004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)
Mon, Jul 15, 8:33 AM · Patch-For-Review, CAS-SSO, Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T367487: Update CAS to 7.0.

Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1002 for host idp2004.wikimedia.org with OS bookworm

Mon, Jul 15, 8:00 AM · Patch-For-Review, CAS-SSO, Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T367487: Update CAS to 7.0.

Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1002 for host idp1004.wikimedia.org with OS bookworm completed:

  • idp1004 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407150721_slyngshede_562764_idp1004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Mon, Jul 15, 7:36 AM · Patch-For-Review, CAS-SSO, Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T367487: Update CAS to 7.0.

Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1002 for host idp1004.wikimedia.org with OS bookworm

Mon, Jul 15, 7:06 AM · Patch-For-Review, CAS-SSO, Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T362824: Q#:rack/setup/install dbproxy200[5-8].

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1002 for host dbproxy2005.codfw.wmnet with OS bookworm

Mon, Jul 15, 5:12 AM · DBA, SRE, ops-codfw, Data-Persistence, DC-Ops

Fri, Jul 12

ops-monitoring-bot added a comment to T351074: Move servers from the appserver/api cluster to kubernetes.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1351.eqiad.wmnet with OS buster completed:

  • mw1351 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407121632_cgoubert_101912_mw1351.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
Fri, Jul 12, 5:06 PM · serviceops, MW-on-K8s
ops-monitoring-bot added a comment to T351074: Move servers from the appserver/api cluster to kubernetes.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1350.eqiad.wmnet with OS buster completed:

  • mw1350 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407121629_cgoubert_101834_mw1350.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
Fri, Jul 12, 5:03 PM · serviceops, MW-on-K8s
ops-monitoring-bot added a comment to T351074: Move servers from the appserver/api cluster to kubernetes.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1349.eqiad.wmnet with OS buster completed:

  • mw1349 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407121627_cgoubert_101741_mw1349.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
Fri, Jul 12, 5:00 PM · serviceops, MW-on-K8s
ops-monitoring-bot added a comment to T351074: Move servers from the appserver/api cluster to kubernetes.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1351.eqiad.wmnet with OS buster

Fri, Jul 12, 4:10 PM · serviceops, MW-on-K8s
ops-monitoring-bot added a comment to T351074: Move servers from the appserver/api cluster to kubernetes.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1350.eqiad.wmnet with OS buster

Fri, Jul 12, 4:09 PM · serviceops, MW-on-K8s
ops-monitoring-bot added a comment to T351074: Move servers from the appserver/api cluster to kubernetes.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1349.eqiad.wmnet with OS buster

Fri, Jul 12, 4:09 PM · serviceops, MW-on-K8s
ops-monitoring-bot added a comment to T351074: Move servers from the appserver/api cluster to kubernetes.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1351.eqiad.wmnet with OS buster completed:

  • mw1351 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407121533_cgoubert_88255_mw1351.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Fri, Jul 12, 4:01 PM · serviceops, MW-on-K8s
ops-monitoring-bot added a comment to T351074: Move servers from the appserver/api cluster to kubernetes.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1350.eqiad.wmnet with OS buster completed:

  • mw1350 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407121526_cgoubert_88236_mw1350.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
Fri, Jul 12, 4:00 PM · serviceops, MW-on-K8s
ops-monitoring-bot added a comment to T351074: Move servers from the appserver/api cluster to kubernetes.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1349.eqiad.wmnet with OS buster completed:

  • mw1349 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407121523_cgoubert_88189_mw1349.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
Fri, Jul 12, 3:58 PM · serviceops, MW-on-K8s
ops-monitoring-bot added a comment to T351074: Move servers from the appserver/api cluster to kubernetes.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1351.eqiad.wmnet with OS buster

Fri, Jul 12, 3:07 PM · serviceops, MW-on-K8s
ops-monitoring-bot added a comment to T351074: Move servers from the appserver/api cluster to kubernetes.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1350.eqiad.wmnet with OS buster

Fri, Jul 12, 3:06 PM · serviceops, MW-on-K8s
ops-monitoring-bot added a comment to T351074: Move servers from the appserver/api cluster to kubernetes.

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1349.eqiad.wmnet with OS buster

Fri, Jul 12, 3:06 PM · serviceops, MW-on-K8s

Thu, Jul 11

ops-monitoring-bot added a comment to T362824: Q#:rack/setup/install dbproxy200[5-8].

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbproxy2005.codfw.wmnet with OS bookworm completed:

  • dbproxy2005 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407111909_pt1979_2208237_dbproxy2005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)
Thu, Jul 11, 8:29 PM · DBA, SRE, ops-codfw, Data-Persistence, DC-Ops
ops-monitoring-bot added a comment to T362824: Q#:rack/setup/install dbproxy200[5-8].

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbproxy2005.codfw.wmnet with OS bookworm

Thu, Jul 11, 6:52 PM · DBA, SRE, ops-codfw, Data-Persistence, DC-Ops
ops-monitoring-bot added a comment to T368646: Znuny LTS 6.5.9.

Cookbook cookbooks.sre.vrts.upgrade started by aokoth@cumin1002 executed with errorson VRTS host vrts1001.eqiad.wmnet

Thu, Jul 11, 6:18 PM · vrts, collaboration-services, Znuny
ops-monitoring-bot added a comment to T368646: Znuny LTS 6.5.9.

Cookbook cookbooks.sre.vrts.upgrade was started by aokoth@cumin1002 on VRTS host vrts1001.eqiad.wmnet

Thu, Jul 11, 6:15 PM · vrts, collaboration-services, Znuny
ops-monitoring-bot added a comment to T368646: Znuny LTS 6.5.9.

Cookbook cookbooks.sre.vrts.upgrade was started by aokoth@cumin1002 on VRTS host vrts1001.eqiad.wmnet

Thu, Jul 11, 6:00 PM · vrts, collaboration-services, Znuny
ops-monitoring-bot added a comment to T364417: deploy1003 implementation tracking.

Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host deploy1003.eqiad.wmnet with OS bullseye completed:

  • deploy1003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407111551_akosiaris_4099339_deploy1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Thu, Jul 11, 5:28 PM · serviceops
ops-monitoring-bot added a comment to T364417: deploy1003 implementation tracking.

Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host deploy1003.eqiad.wmnet with OS bullseye

Thu, Jul 11, 3:36 PM · serviceops
ops-monitoring-bot created T369829: Degraded RAID on dumpsdata1007.
Thu, Jul 11, 2:37 PM · Data-Engineering, DC-Ops, SRE, ops-eqiad
ops-monitoring-bot added a comment to T365996: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad .

Icinga downtime and Alertmanager silence (ID=de50ae5f-fec9-4347-b2ef-225a3af373f6) set by cmooney@cumin1002 for 0:30:00 on 23 host(s) and their services with reason: JunOS upgrade lsw1-f1-eqiad

an-coord1004.eqiad.wmnet,an-mariadb1002.eqiad.wmnet,an-presto[1011-1012].eqiad.wmnet,an-worker[1144,1148,1155].eqiad.wmnet,backup1011.eqiad.wmnet,cephosd1004.eqiad.wmnet,db1193.eqiad.wmnet,dbproxy1027.eqiad.wmnet,dse-k8s-worker1007.eqiad.wmnet,dumpsdata1007.eqiad.wmnet,elastic[1096-1097,1106].eqiad.wmnet,kafka-jumbo1013.eqiad.wmnet,logstash1037.eqiad.wmnet,ml-cache1003.eqiad.wmnet,ms-be1070.eqiad.wmnet,ms-fe1014.eqiad.wmnet,thanos-fe1004.eqiad.wmnet,titan1001.eqiad.wmnet
Thu, Jul 11, 2:19 PM · SRE-swift-storage, DBA, Data-Persistence, Infrastructure-Foundations, netops, SRE
ops-monitoring-bot added a comment to T365996: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad .

Icinga downtime and Alertmanager silence (ID=1d5a6d4b-345e-4f18-8342-05572d6411e7) set by cmooney@cumin1002 for 0:30:00 on 23 host(s) and their services with reason: JunOS upgrade lsw1-f1-eqiad

an-coord1004.eqiad.wmnet,an-mariadb1002.eqiad.wmnet,an-presto[1011-1012].eqiad.wmnet,an-worker[1144,1148,1155].eqiad.wmnet,backup1011.eqiad.wmnet,cephosd1004.eqiad.wmnet,db1193.eqiad.wmnet,dbproxy1027.eqiad.wmnet,dse-k8s-worker1007.eqiad.wmnet,dumpsdata1007.eqiad.wmnet,elastic[1096-1097,1106].eqiad.wmnet,kafka-jumbo1013.eqiad.wmnet,logstash1037.eqiad.wmnet,ml-cache1003.eqiad.wmnet,ms-be1070.eqiad.wmnet,ms-fe1014.eqiad.wmnet,thanos-fe1004.eqiad.wmnet,titan1001.eqiad.wmnet
Thu, Jul 11, 2:12 PM · SRE-swift-storage, DBA, Data-Persistence, Infrastructure-Foundations, netops, SRE
ops-monitoring-bot added a comment to T365996: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad .

Icinga downtime and Alertmanager silence (ID=d7f08b17-a319-4077-a271-a0ef15a438a3) set by cmooney@cumin1002 for 0:30:00 on 4 host(s) and their services with reason: JunOS upgrade lsw1-f1-eqiad

lsw1-f1-eqiad,lsw1-f1-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt
Thu, Jul 11, 2:09 PM · SRE-swift-storage, DBA, Data-Persistence, Infrastructure-Foundations, netops, SRE
ops-monitoring-bot added a comment to T365996: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad .

Icinga downtime and Alertmanager silence (ID=9abb3472-bf69-45f5-8c93-e3c8cfbe9e4e) set by cmooney@cumin1002 for 0:50:00 on 1 host(s) and their services with reason: prep JunOS upgrade lsw1-f1-eqiad

lsw1-f1-eqiad.mgmt
Thu, Jul 11, 2:08 PM · SRE-swift-storage, DBA, Data-Persistence, Infrastructure-Foundations, netops, SRE
ops-monitoring-bot added a comment to T336275: Upgrade Netbox to 4.x.

Icinga downtime and Alertmanager silence (ID=649a3ed0-08fc-40b8-a899-14c7a81aaa41) set by ayounsi@cumin1002 for 4 days, 0:00:00 on 1 host(s) and their services with reason: netbox upgrade prep work

netboxdb2003.codfw.wmnet
Thu, Jul 11, 12:51 PM · Patch-For-Review, Infrastructure-Foundations, netbox
ops-monitoring-bot added a comment to T336275: Upgrade Netbox to 4.x.

Icinga downtime and Alertmanager silence (ID=4abde2ff-0621-44ff-ad19-09d19fe0d4a2) set by ayounsi@cumin1002 for 4 days, 0:00:00 on 1 host(s) and their services with reason: netbox upgrade prep work

netboxdb1003.eqiad.wmnet
Thu, Jul 11, 12:51 PM · Patch-For-Review, Infrastructure-Foundations, netbox
ops-monitoring-bot added a comment to T336275: Upgrade Netbox to 4.x.

Icinga downtime and Alertmanager silence (ID=df7b46ee-b552-4bdd-9b54-9bed50fb98cd) set by ayounsi@cumin1002 for 4 days, 0:00:00 on 1 host(s) and their services with reason: netbox upgrade prep work

netbox2003.codfw.wmnet
Thu, Jul 11, 11:29 AM · Patch-For-Review, Infrastructure-Foundations, netbox
ops-monitoring-bot added a comment to T336275: Upgrade Netbox to 4.x.

Icinga downtime and Alertmanager silence (ID=05ca8c35-9b32-4c3a-9b80-5e01ef75b7f9) set by ayounsi@cumin1002 for 4 days, 0:00:00 on 1 host(s) and their services with reason: netbox upgrade prep work

netbox1003.eqiad.wmnet
Thu, Jul 11, 11:29 AM · Patch-For-Review, Infrastructure-Foundations, netbox

Wed, Jul 10

ops-monitoring-bot added a comment to T369116: an-presto1004 has reduced total memory size.

Icinga downtime and Alertmanager silence (ID=df8ef7ff-f8eb-4122-8c2e-2d486a4690ab) set by btullis@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Shutting down to investigate RAM issue

an-presto1004.eqiad.wmnet
Wed, Jul 10, 3:36 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28)
ops-monitoring-bot added a comment to T365993: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad.

Icinga downtime and Alertmanager silence (ID=9475b2b6-bc5f-41f8-97d1-970eb62b38bc) set by cmooney@cumin1002 for 0:30:00 on 26 host(s) and their services with reason: JunOS upgrade lsw1-e1-eqiad

an-coord1003.eqiad.wmnet,an-mariadb1001.eqiad.wmnet,an-presto[1006-1007].eqiad.wmnet,an-worker[1142,1147,1153].eqiad.wmnet,backup1010.eqiad.wmnet,cephosd1001.eqiad.wmnet,db1190.eqiad.wmnet,dbproxy1026.eqiad.wmnet,dse-k8s-worker1005.eqiad.wmnet,dumpsdata1006.eqiad.wmnet,elastic[1089-1090,1104].eqiad.wmnet,kafka-jumbo1010.eqiad.wmnet,kubernetes1059.eqiad.wmnet,logstash1036.eqiad.wmnet,lvs[1013-1015].eqiad.wmnet,ml-cache1001.eqiad.wmnet,ms-be1068.eqiad.wmnet,ms-fe1012.eqiad.wmnet,stat1010.eqiad.wmnet
Wed, Jul 10, 3:24 PM · SRE-swift-storage, DBA, Data-Persistence, Infrastructure-Foundations, netops, SRE
ops-monitoring-bot added a comment to T365993: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad.

Icinga downtime and Alertmanager silence (ID=5386f05e-734c-49b0-a4c5-1acbef4c187a) set by cmooney@cumin1002 for 0:30:00 on 4 host(s) and their services with reason: JunOS upgrade lsw1-e1-eqiad

lsw1-e1-eqiad,lsw1-e1-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt
Wed, Jul 10, 3:23 PM · SRE-swift-storage, DBA, Data-Persistence, Infrastructure-Foundations, netops, SRE
ops-monitoring-bot added a comment to T365993: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad.

Icinga downtime and Alertmanager silence (ID=9ca0faf1-4b9d-4345-9bb8-9c7153e17163) set by cmooney@cumin1002 for 1:30:00 on 1 host(s) and their services with reason: prep JunOS upgrade lsw1-e1-eqiad

lsw1-e1-eqiad.mgmt
Wed, Jul 10, 2:08 PM · SRE-swift-storage, DBA, Data-Persistence, Infrastructure-Foundations, netops, SRE
ops-monitoring-bot added a comment to T369011: hw troubleshooting: Management and main interfaces down for kubernetes1051.eqiad.wmnet.

Icinga downtime and Alertmanager silence (ID=3f55d01c-31c3-4e2a-8c13-b4c6da9484f8) set by cgoubert@cumin1002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Hardware issue

kubernetes1051.eqiad.wmnet
Wed, Jul 10, 1:56 PM · SRE, ops-eqiad, DC-Ops, Prod-Kubernetes, serviceops
ops-monitoring-bot added a comment to T365503: Upgrade mariadb on analytics_meta from 10.4 to 10.6.

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-mariadb1001.eqiad.wmnet with OS bookworm completed:

  • an-mariadb1001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407101318_btullis_3886281_an-mariadb1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)
Wed, Jul 10, 1:35 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Patch-For-Review
ops-monitoring-bot added a comment to T365503: Upgrade mariadb on analytics_meta from 10.4 to 10.6.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host an-mariadb1001.eqiad.wmnet with OS bookworm

Wed, Jul 10, 1:01 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Patch-For-Review

Tue, Jul 9

ops-monitoring-bot added a comment to T336275: Upgrade Netbox to 4.x.

Deployed netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.6 to netbox-next - ayounsi@cumin1002 - T336275

Tue, Jul 9, 3:44 PM · Patch-For-Review, Infrastructure-Foundations, netbox
ops-monitoring-bot added a comment to T336275: Upgrade Netbox to 4.x.

Deployed netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.6 to netbox-next - ayounsi@cumin1002 - T336275

Tue, Jul 9, 3:14 PM · Patch-For-Review, Infrastructure-Foundations, netbox
ops-monitoring-bot added a comment to T365998: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad .

Icinga downtime and Alertmanager silence (ID=2a5cb43e-793c-4103-9499-369354315479) set by cmooney@cumin1002 for 0:40:00 on 27 host(s) and their services with reason: JunOS upgrade lsw1-e3-eqiad

an-presto1010.eqiad.wmnet,an-worker1154.eqiad.wmnet,backup1009.eqiad.wmnet,cephosd1003.eqiad.wmnet,db[1192,1198-1199,1204].eqiad.wmnet,druid1010.eqiad.wmnet,dse-k8s-worker1006.eqiad.wmnet,elastic[1093-1095].eqiad.wmnet,kafka-jumbo1012.eqiad.wmnet,kafka-stretch1001.eqiad.wmnet,kubernetes[1047-1051,1061].eqiad.wmnet,ml-serve1006.eqiad.wmnet,ms-be1074.eqiad.wmnet,mw[1491-1493].eqiad.wmnet,wdqs1015.eqiad.wmnet
Tue, Jul 9, 3:03 PM · SRE-swift-storage, DBA, Data-Persistence, Infrastructure-Foundations, netops, SRE
ops-monitoring-bot added a comment to T365998: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad .

Icinga downtime and Alertmanager silence (ID=39fcbcd0-8c16-4208-ac06-f4b442e55a54) set by cmooney@cumin1002 for 0:30:00 on 4 host(s) and their services with reason: JunOS upgrade lsw1-e3-eqiad

lsw1-e3-eqiad,lsw1-e3-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt
Tue, Jul 9, 3:00 PM · SRE-swift-storage, DBA, Data-Persistence, Infrastructure-Foundations, netops, SRE
ops-monitoring-bot added a comment to T365998: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad .

Icinga downtime and Alertmanager silence (ID=6a298ae5-e736-4051-8220-9ec4f352950a) set by cmooney@cumin1002 for 0:40:00 on 1 host(s) and their services with reason: prep JunOS upgrade lsw1-e3-eqiad

lsw1-e3-eqiad.mgmt
Tue, Jul 9, 2:54 PM · SRE-swift-storage, DBA, Data-Persistence, Infrastructure-Foundations, netops, SRE
ops-monitoring-bot added a comment to T336275: Upgrade Netbox to 4.x.

Deployed netbox to netbox-dev2003.codfw.wmnet with reason: Release v4.0.6 to netbox-next - ayounsi@cumin1002 - T336275

Tue, Jul 9, 2:26 PM · Patch-For-Review, Infrastructure-Foundations, netbox
ops-monitoring-bot added a comment to T331706: Migrate Mailman/lists to Bullseye/Bookworm.

cookbooks.sre.hosts.decommission executed by eoghan@cumin1002 for hosts: lists1001.wikimedia.org

  • lists1001.wikimedia.org (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
Tue, Jul 9, 12:01 PM · Patch-For-Review, collaboration-services, Wikimedia-Mailing-lists, SRE
ops-monitoring-bot added a comment to T336275: Upgrade Netbox to 4.x.

cookbooks.sre.hosts.decommission executed by ayounsi@cumin1002 for hosts: netbox-dev2002.codfw.wmnet

  • netbox-dev2002.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
Tue, Jul 9, 7:42 AM · Patch-For-Review, Infrastructure-Foundations, netbox

Mon, Jul 8

ops-monitoring-bot added a comment to T309789: [ceph] Upgrade hosts to bullseye.

Host rebooted by dcaro@cumin1002 with reason: upgraded packages

Mon, Jul 8, 4:09 PM · cloud-services-team (FY2023/2024-Q3-Q4), Cloud-VPS, Goal, Cloud-Services-Worktype-Maintenance, Cloud-Services-Origin-Team, User-dcaro
ops-monitoring-bot added a comment to T309789: [ceph] Upgrade hosts to bullseye.

Cookbook cookbooks.sre.hosts.reimage started by root@cumin1002 for host cloudcephosd1011.eqiad.wmnet with OS bullseye completed:

  • cloudcephosd1011 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407081516_root_3515842_cloudcephosd1011.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Mon, Jul 8, 4:08 PM · cloud-services-team (FY2023/2024-Q3-Q4), Cloud-VPS, Goal, Cloud-Services-Worktype-Maintenance, Cloud-Services-Origin-Team, User-dcaro
ops-monitoring-bot added a comment to T309789: [ceph] Upgrade hosts to bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by root@cumin1002 for host cloudcephosd1011.eqiad.wmnet with OS bullseye

Mon, Jul 8, 2:59 PM · cloud-services-team (FY2023/2024-Q3-Q4), Cloud-VPS, Goal, Cloud-Services-Worktype-Maintenance, Cloud-Services-Origin-Team, User-dcaro
ops-monitoring-bot added a comment to T365503: Upgrade mariadb on analytics_meta from 10.4 to 10.6.

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-mariadb1002.eqiad.wmnet with OS bookworm completed:

  • an-mariadb1002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407081235_btullis_3488808_an-mariadb1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)
Mon, Jul 8, 12:51 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Patch-For-Review
ops-monitoring-bot added a comment to T365503: Upgrade mariadb on analytics_meta from 10.4 to 10.6.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host an-mariadb1002.eqiad.wmnet with OS bookworm

Mon, Jul 8, 12:19 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Patch-For-Review

Fri, Jul 5

ops-monitoring-bot added a comment to T369011: hw troubleshooting: Management and main interfaces down for kubernetes1051.eqiad.wmnet.

Icinga downtime and Alertmanager silence (ID=53cf057a-4641-401a-ab84-392d5d8f2444) set by cgoubert@cumin1002 for 4 days, 0:00:00 on 1 host(s) and their services with reason: Hardware issue

kubernetes1051.eqiad.wmnet
Fri, Jul 5, 11:30 AM · SRE, ops-eqiad, DC-Ops, Prod-Kubernetes, serviceops

Thu, Jul 4

ops-monitoring-bot added a comment to T309789: [ceph] Upgrade hosts to bullseye.

Host rebooted by dcaro@cumin1002 with reason: upgraded packages

Thu, Jul 4, 9:25 AM · cloud-services-team (FY2023/2024-Q3-Q4), Cloud-VPS, Goal, Cloud-Services-Worktype-Maintenance, Cloud-Services-Origin-Team, User-dcaro
ops-monitoring-bot added a comment to T309789: [ceph] Upgrade hosts to bullseye.

Cookbook cookbooks.sre.hosts.reimage started by root@cumin1002 for host cloudcephosd1009.eqiad.wmnet with OS bullseye completed:

  • cloudcephosd1009 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407040903_root_2725966_cloudcephosd1009.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Thu, Jul 4, 9:24 AM · cloud-services-team (FY2023/2024-Q3-Q4), Cloud-VPS, Goal, Cloud-Services-Worktype-Maintenance, Cloud-Services-Origin-Team, User-dcaro
ops-monitoring-bot added a comment to T309789: [ceph] Upgrade hosts to bullseye.

Cookbook cookbooks.sre.hosts.reimage was started by root@cumin1002 for host cloudcephosd1009.eqiad.wmnet with OS bullseye

Thu, Jul 4, 8:45 AM · cloud-services-team (FY2023/2024-Q3-Q4), Cloud-VPS, Goal, Cloud-Services-Worktype-Maintenance, Cloud-Services-Origin-Team, User-dcaro
ops-monitoring-bot added a comment to T364299: Make rc_id a bigint.

Icinga downtime and Alertmanager silence (ID=064f5419-e335-47d7-ba65-641c70236889) set by marostegui@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Long schema change

db1231.eqiad.wmnet
Thu, Jul 4, 5:08 AM · Schema-change-in-production, DBA
ops-monitoring-bot added a comment to T363399: Q4:rack/setup/install parsoidtest1001.

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1002 for host parsoidtest1001.eqiad.wmnet with OS bullseye completed:

  • parsoidtest1001 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407040029_dzahn_2656542_parsoidtest1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Thu, Jul 4, 12:44 AM · Patch-For-Review, SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T363399: Q4:rack/setup/install parsoidtest1001.

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1002 for host parsoidtest1001.eqiad.wmnet with OS bullseye

Thu, Jul 4, 12:15 AM · Patch-For-Review, SRE, serviceops, ops-eqiad, DC-Ops

Wed, Jul 3

ops-monitoring-bot added a comment to T363399: Q4:rack/setup/install parsoidtest1001.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host parsoidtest1001.eqiad.wmnet with OS bullseye executed with errors:

  • parsoidtest1001 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" parsoidtest1001.eqiad.wmnet to get a root shellbut depending on the failure this may not work.
Wed, Jul 3, 10:36 PM · Patch-For-Review, SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T363399: Q4:rack/setup/install parsoidtest1001.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host parsoidtest1001.eqiad.wmnet with OS bullseye

Wed, Jul 3, 9:56 PM · Patch-For-Review, SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot created T369229: Degraded RAID on db2161.
Wed, Jul 3, 9:43 PM · DBA, SRE, ops-codfw, DC-Ops
ops-monitoring-bot added a comment to T367512: Get test host connected to codfw row c/d lsw's.

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host sretest2002.codfw.wmnet with OS bookworm

Wed, Jul 3, 7:55 PM · DC-Ops, ops-codfw, SRE, netops, Infrastructure-Foundations
ops-monitoring-bot added a comment to T367512: Get test host connected to codfw row c/d lsw's.

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host sretest2002.codfw.wmnet with OS bookworm

Wed, Jul 3, 7:25 PM · DC-Ops, ops-codfw, SRE, netops, Infrastructure-Foundations
ops-monitoring-bot added a comment to T367512: Get test host connected to codfw row c/d lsw's.

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1002 for host sretest2002.codfw.wmnet with OS bookworm executed with errors:

  • sretest2002 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" sretest2002.codfw.wmnet to get a root shellbut depending on the failure this may not work.
Wed, Jul 3, 7:24 PM · DC-Ops, ops-codfw, SRE, netops, Infrastructure-Foundations
ops-monitoring-bot added a comment to T367512: Get test host connected to codfw row c/d lsw's.

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host sretest2002.codfw.wmnet with OS bookworm

Wed, Jul 3, 7:19 PM · DC-Ops, ops-codfw, SRE, netops, Infrastructure-Foundations
ops-monitoring-bot added a comment to T367512: Get test host connected to codfw row c/d lsw's.

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host sretest2002.codfw.wmnet with OS bookworm

Wed, Jul 3, 6:54 PM · DC-Ops, ops-codfw, SRE, netops, Infrastructure-Foundations
ops-monitoring-bot added a comment to T369116: an-presto1004 has reduced total memory size.

Icinga downtime and Alertmanager silence (ID=f9599f38-9a28-4b43-8244-c2dd542891f0) set by btullis@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Cold booting to investigate RAM issue

an-presto1004.eqiad.wmnet
Wed, Jul 3, 4:47 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28)
ops-monitoring-bot added a comment to T363399: Q4:rack/setup/install parsoidtest1001.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host parsoidtest1001.eqiad.wmnet with OS bullseye executed with errors:

  • parsoidtest1001 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" parsoidtest1001.eqiad.wmnet to get a root shellbut depending on the failure this may not work.
Wed, Jul 3, 2:54 PM · Patch-For-Review, SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T364429: Q4:rack/setup/install an-conf100[4-6].

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-conf1006.eqiad.wmnet with OS bookworm executed with errors:

  • an-conf1006 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" an-conf1006.eqiad.wmnet to get a root shellbut depending on the failure this may not work.
Wed, Jul 3, 2:40 PM · Patch-For-Review, SRE, Data-Engineering, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T364429: Q4:rack/setup/install an-conf100[4-6].

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-conf1005.eqiad.wmnet with OS bookworm executed with errors:

  • an-conf1005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" an-conf1005.eqiad.wmnet to get a root shellbut depending on the failure this may not work.
Wed, Jul 3, 2:40 PM · Patch-For-Review, SRE, Data-Engineering, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T364429: Q4:rack/setup/install an-conf100[4-6].

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-conf1004.eqiad.wmnet with OS bookworm executed with errors:

  • an-conf1004 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" an-conf1004.eqiad.wmnet to get a root shellbut depending on the failure this may not work.
Wed, Jul 3, 2:40 PM · Patch-For-Review, SRE, Data-Engineering, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T363399: Q4:rack/setup/install parsoidtest1001.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host parsoidtest1001.eqiad.wmnet with OS bullseye

Wed, Jul 3, 2:11 PM · Patch-For-Review, SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T363399: Q4:rack/setup/install parsoidtest1001.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host parsoidtest1001.eqiad.wmnet with OS bullseye executed with errors:

  • parsoidtest1001 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" parsoidtest1001.eqiad.wmnet to get a root shellbut depending on the failure this may not work.
Wed, Jul 3, 2:07 PM · Patch-For-Review, SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T365994: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad.

Icinga downtime and Alertmanager silence (ID=11036a9f-0b48-4b07-9e63-571b4f67c201) set by cmooney@cumin1002 for 0:40:00 on 22 host(s) and their services with reason: JunOS upgrade lsw1-e2-eqiad

an-presto[1008-1009].eqiad.wmnet,an-worker1143.eqiad.wmnet,aqs1020.eqiad.wmnet,cephosd1002.eqiad.wmnet,db[1191,1196-1197].eqiad.wmnet,dbstore1008.eqiad.wmnet,druid1009.eqiad.wmnet,elastic[1091-1092].eqiad.wmnet,kafka-jumbo1011.eqiad.wmnet,kafka-logging1004.eqiad.wmnet,kubernetes1060.eqiad.wmnet,lvs1016.eqiad.wmnet,ml-serve1005.eqiad.wmnet,ms-be1069.eqiad.wmnet,wdqs[1018,1020].eqiad.wmnet,wikikube-worker[1007,1021].eqiad.wmnet
Wed, Jul 3, 2:01 PM · SRE-swift-storage, DBA, Data-Persistence, Infrastructure-Foundations, netops, SRE
ops-monitoring-bot added a comment to T365994: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad.

Icinga downtime and Alertmanager silence (ID=185956f6-b0e6-4a89-9e32-6a8223f5678e) set by cmooney@cumin1002 for 0:40:00 on 4 host(s) and their services with reason: JunOS upgrade lsw1-e2-eqiad

lsw1-e2-eqiad,lsw1-e2-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt
Wed, Jul 3, 1:58 PM · SRE-swift-storage, DBA, Data-Persistence, Infrastructure-Foundations, netops, SRE
ops-monitoring-bot added a comment to T365994: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad.

Icinga downtime and Alertmanager silence (ID=753739a5-e1fb-44b6-9174-f7b3a8c4b73b) set by jayme@cumin1002 for 1:20:00 on 3 host(s) and their services with reason: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2

kubernetes1060.eqiad.wmnet,wikikube-worker[1007,1021].eqiad.wmnet
Wed, Jul 3, 1:56 PM · SRE-swift-storage, DBA, Data-Persistence, Infrastructure-Foundations, netops, SRE
ops-monitoring-bot added a comment to T365994: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad.

Icinga downtime and Alertmanager silence (ID=c8dbb89d-640c-4078-bc10-bbbe9c30f3ef) set by cmooney@cumin1002 for 0:50:00 on 1 host(s) and their services with reason: prep JunOS upgrade lsw1-e2-eqiad

lsw1-e2-eqiad.mgmt
Wed, Jul 3, 1:53 PM · SRE-swift-storage, DBA, Data-Persistence, Infrastructure-Foundations, netops, SRE
ops-monitoring-bot added a comment to T364429: Q4:rack/setup/install an-conf100[4-6].

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-conf1006.eqiad.wmnet with OS bookworm

Wed, Jul 3, 1:29 PM · Patch-For-Review, SRE, Data-Engineering, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T364429: Q4:rack/setup/install an-conf100[4-6].

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-conf1005.eqiad.wmnet with OS bookworm

Wed, Jul 3, 1:29 PM · Patch-For-Review, SRE, Data-Engineering, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T364429: Q4:rack/setup/install an-conf100[4-6].

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-conf1004.eqiad.wmnet with OS bookworm

Wed, Jul 3, 1:29 PM · Patch-For-Review, SRE, Data-Engineering, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T363399: Q4:rack/setup/install parsoidtest1001.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host parsoidtest1001.eqiad.wmnet with OS bullseye

Wed, Jul 3, 1:22 PM · Patch-For-Review, SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T367856: Cleanup revision table schema.

Icinga downtime and Alertmanager silence (ID=8806bd66-4c8a-4047-9bba-1c5cb25125be) set by marostegui@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Long schema change

db2207.codfw.wmnet
Wed, Jul 3, 5:23 AM · Data-Engineering, Schema-change-in-production, DBA, Data Products

Tue, Jul 2

ops-monitoring-bot added a comment to T369011: hw troubleshooting: Management and main interfaces down for kubernetes1051.eqiad.wmnet.

Icinga downtime and Alertmanager silence (ID=1d5196ee-59a9-4e12-b2fc-c8c25de6ab16) set by cgoubert@cumin1002 for 20:00:00 on 1 host(s) and their services with reason: Hardware issue

kubernetes1051.eqiad.wmnet
Tue, Jul 2, 3:45 PM · SRE, ops-eqiad, DC-Ops, Prod-Kubernetes, serviceops
ops-monitoring-bot added a comment to T353464: Migrate wikikube control planes to hardware nodes.

cookbooks.sre.hosts.decommission executed by jiji@cumin1002 for hosts: kubetcd[2004-2006].codfw.wmnet

  • kubetcd2004.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
Tue, Jul 2, 3:06 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
ops-monitoring-bot added a comment to T353464: Migrate wikikube control planes to hardware nodes.

cookbooks.sre.hosts.decommission executed by jiji@cumin1002 for hosts: kubetcd[1004-1006].eqiad.wmnet

  • kubetcd1004.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
Tue, Jul 2, 2:51 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
ops-monitoring-bot added a comment to T309789: [ceph] Upgrade hosts to bullseye.

Host rebooted by dcaro@cumin1002 with reason: upgraded packages

Tue, Jul 2, 2:19 PM · cloud-services-team (FY2023/2024-Q3-Q4), Cloud-VPS, Goal, Cloud-Services-Worktype-Maintenance, Cloud-Services-Origin-Team, User-dcaro
ops-monitoring-bot added a comment to T309789: [ceph] Upgrade hosts to bullseye.

Cookbook cookbooks.sre.hosts.reimage started by root@cumin1002 for host cloudcephosd1008.eqiad.wmnet with OS bullseye completed:

  • cloudcephosd1008 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407021207_root_2343933_cloudcephosd1008.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Tue, Jul 2, 2:15 PM · cloud-services-team (FY2023/2024-Q3-Q4), Cloud-VPS, Goal, Cloud-Services-Worktype-Maintenance, Cloud-Services-Origin-Team, User-dcaro
ops-monitoring-bot added a comment to T353464: Migrate wikikube control planes to hardware nodes.

Icinga downtime and Alertmanager silence (ID=53db6080-f00f-4a86-ae49-cafba7047a9d) set by jiji@cumin1002 for 2 days, 0:00:00 on 6 host(s) and their services with reason: decom

kubetcd[2004-2006].codfw.wmnet,kubetcd[1004-1006].eqiad.wmnet
Tue, Jul 2, 2:12 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
ops-monitoring-bot added a comment to T353464: Migrate wikikube control planes to hardware nodes.

cookbooks.sre.hosts.decommission executed by jiji@cumin1002 for hosts: kubemaster[1001-1002].eqiad.wmnet

  • kubemaster1001.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
Tue, Jul 2, 1:21 PM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
ops-monitoring-bot added a comment to T351074: Move servers from the appserver/api cluster to kubernetes.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2032.codfw.wmnet with OS bullseye completed:

  • wikikube-worker2032 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407021249_cgoubert_2351589_wikikube-worker2032.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Tue, Jul 2, 1:09 PM · serviceops, MW-on-K8s
ops-monitoring-bot added a comment to T351074: Move servers from the appserver/api cluster to kubernetes.

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2033.codfw.wmnet with OS bullseye completed:

  • wikikube-worker2033 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202407021245_cgoubert_2351640_wikikube-worker2033.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)
Tue, Jul 2, 1:04 PM · serviceops, MW-on-K8s