Page MenuHomePhabricator

Q4:magru VM tracking task
Closed, ResolvedPublic

Description

magru01 (ready, running on ganeti7001/7003):

  • netflow7001.magru.wmnet
  • doh7001.wikimedia.org
  • durum7001.magru.wmnet
  • install7001.wikimedia.org
  • ncredir7001.magru.wmnet

magru02 (ready, running on ganeti7002/7004)

  • doh7002.wikimedia.org
  • durum7002.magru.wmnet
  • ncredir7002.magru.wmnet
  • bast7001.wikimedia.org
  • prometheus7001.magru.wmnet

Event Timeline

MoritzMuehlenhoff updated the task description. (Show Details)

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host durum7001.magru.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host durum7001.magru.wmnet with OS bookworm completed:

  • durum7001 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405021616_sukhe_4120555_durum7001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host doh7001.wikimedia.org with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host doh7001.wikimedia.org with OS bookworm completed:

  • doh7001 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405021911_sukhe_4141526_doh7001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Change #1026729 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add install7001 to site.pp

https://gerrit.wikimedia.org/r/1026729

Change #1026729 merged by Muehlenhoff:

[operations/puppet@production] Add install7001 to site.pp

https://gerrit.wikimedia.org/r/1026729

Change #1026786 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add bast7001 to site.pp

https://gerrit.wikimedia.org/r/1026786

Change #1026786 merged by Muehlenhoff:

[operations/puppet@production] Add bast7001 to site.pp

https://gerrit.wikimedia.org/r/1026786

Change #1026787 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] preseed: Extend globbing for bast and prometheus to cover magru

https://gerrit.wikimedia.org/r/1026787

Change #1026787 merged by Muehlenhoff:

[operations/puppet@production] preseed: Extend globbing for bast and prometheus to cover magru

https://gerrit.wikimedia.org/r/1026787

Mentioned in SAL (#wikimedia-operations) [2024-05-03T10:32:03Z] <jmm@cumin2002> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add bast7001 - jmm@cumin2002 - T364016"

Mentioned in SAL (#wikimedia-operations) [2024-05-03T10:33:43Z] <jmm@cumin2002> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add bast7001 - jmm@cumin2002 - T364016"

Change #1026824 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Make bast7001 a bastion

https://gerrit.wikimedia.org/r/1026824

Change #1026824 merged by Muehlenhoff:

[operations/puppet@production] Make bast7001 a bastion

https://gerrit.wikimedia.org/r/1026824

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host durum7002.magru.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host doh7002.wikimedia.org with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host durum7002.magru.wmnet with OS bookworm completed:

  • durum7002 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405031127_sukhe_72384_durum7002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host doh7002.wikimedia.org with OS bookworm completed:

  • doh7002 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405031141_sukhe_77196_doh7002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1028229 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Make install7001 an installserver

https://gerrit.wikimedia.org/r/1028229

Change #1028229 merged by Muehlenhoff:

[operations/puppet@production] Make install7001 an installserver

https://gerrit.wikimedia.org/r/1028229

Change #1028236 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[labs/private@master] Add dummy keytab for install7001

https://gerrit.wikimedia.org/r/1028236

Change #1028236 merged by Muehlenhoff:

[labs/private@master] Add dummy keytab for install7001

https://gerrit.wikimedia.org/r/1028236

Change #1028245 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/dns@master] Enable install7001 as webproxy in magru

https://gerrit.wikimedia.org/r/1028245

Change #1028245 merged by Muehlenhoff:

[operations/dns@master] Enable install7001 as webproxy in magru

https://gerrit.wikimedia.org/r/1028245

Change #1028456 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] site: add prometheus7001

https://gerrit.wikimedia.org/r/1028456

Change #1028456 merged by Filippo Giunchedi:

[operations/puppet@production] site: add prometheus7001

https://gerrit.wikimedia.org/r/1028456

Change #1028464 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/dns@master] wmnet: add prometheus.svc.magru

https://gerrit.wikimedia.org/r/1028464

Change #1028501 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] site: provision prometheus7001 with insetup

https://gerrit.wikimedia.org/r/1028501

Change #1028502 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: use datacenters for snmp_exporter

https://gerrit.wikimedia.org/r/1028502

Change #1028503 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] grafana: add magru prometheus

https://gerrit.wikimedia.org/r/1028503

Change #1028504 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] trafficserver: add prometheus-magru.w.o

https://gerrit.wikimedia.org/r/1028504

Change #1028501 merged by Filippo Giunchedi:

[operations/puppet@production] site: provision prometheus7001 with insetup

https://gerrit.wikimedia.org/r/1028501

cookbooks.sre.hosts.decommission executed by filippo@cumin1002 for hosts: prometheus7001.magru.wmnet

  • prometheus7001.magru.wmnet (WARN)
    • Host not found on Icinga, unable to downtime it
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster magru02 to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster magru02 to Netbox

I've tried installing prometheus7001 today with help from @Muehlenhoff although there's no console and some pxe/tftp interaction with install7001 is suspected. I'll hold off further steps for now until VMs can be installed

Change #1028707 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Revert "hiera: update installserver for magru"

https://gerrit.wikimedia.org/r/1028707

Change #1028707 merged by Muehlenhoff:

[operations/puppet@production] Revert "hiera: update installserver for magru"

https://gerrit.wikimedia.org/r/1028707

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: install7001.wikimedia.org

  • install7001.wikimedia.org (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster magru01 to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster magru01 to Netbox

Change #1028502 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: use datacenters for snmp_exporter

https://gerrit.wikimedia.org/r/1028502

Change #1028759 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: assemble snmp.yml when updating modules

https://gerrit.wikimedia.org/r/1028759

Change #1028759 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: assemble snmp.yml when updating modules

https://gerrit.wikimedia.org/r/1028759

cookbooks.sre.hosts.decommission executed by filippo@cumin1002 for hosts: prometheus7001.magru.wmnet

  • prometheus7001.magru.wmnet (WARN)
    • Host not found on Icinga, unable to downtime it
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster magru02 to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster magru02 to Netbox

cookbooks.sre.hosts.decommission executed by filippo@cumin1002 for hosts: prometheus7001.magru.wmnet

  • prometheus7001.magru.wmnet (WARN)
    • Host not found on Icinga, unable to downtime it
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster magru02 to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster magru02 to Netbox

Change #1028848 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] Revert "site: provision prometheus7001 with insetup"

https://gerrit.wikimedia.org/r/1028848

Change #1028848 merged by Filippo Giunchedi:

[operations/puppet@production] Revert "site: provision prometheus7001 with insetup"

https://gerrit.wikimedia.org/r/1028848

Change #1028464 merged by Filippo Giunchedi:

[operations/dns@master] wmnet: add prometheus.svc.magru

https://gerrit.wikimedia.org/r/1028464

Mentioned in SAL (#wikimedia-operations) [2024-05-07T14:50:19Z] <godog> silence site=magru alerts during prometheus7001 - T364016

Change #1028503 merged by Filippo Giunchedi:

[operations/puppet@production] grafana: add magru prometheus

https://gerrit.wikimedia.org/r/1028503

Change #1028504 merged by Filippo Giunchedi:

[operations/puppet@production] trafficserver: add prometheus-magru.w.o

https://gerrit.wikimedia.org/r/1028504

Mentioned in SAL (#wikimedia-operations) [2024-05-07T15:13:17Z] <godog> remove accidentally set site!=magru silence, add site=magru silence instead - T364016

From my POV prometheus in magru is live and working, see also https://prometheus-magru.wikimedia.org/