Page MenuHomePhabricator

Improve the InterfaceSpeedError alert
Open, MediumPublic

Description

From: T351862: InterfaceSpeedError, to make it more actionable by DCops:

  • The dashboard link points to an unrelated dashboard
  • description: on alert1001:9100 has the wrong speed: 1.2e+07. Should have a better notation, like Speed: 100Mb/s
  • The task title could be a bit more explicit (maybe mention the host)
  • The runbook link should point to https://wikitech.wikimedia.org/wiki/Monitoring/check_eth#InterfaceSpeedError (to be more specific) and have steps to follow when it triggers
  • Most of the information is repeated twice and it's unclear what's the actual bit of important information as opposed to all the metadata.

Event Timeline

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1002 for host cloudvirt1053.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudvirt1053.eqiad.wmnet with OS bookworm completed:

  • cloudvirt1053 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406210916_aborrero_181938_cloudvirt1053.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

sorry for the reimage messages, I copy-pasted the wrong phab ticket id.