Page MenuHomePhabricator

Observability Bookworm upgrades
Open, In Progress, Needs TriagePublic

Description

Tracking task for o11y Bookworm upgrades

  • Alert (2 hosts) T333615
  • Grafana (2 hosts) T352665
  • Centrallog (2 hosts)
  • Logstash (30 hosts)
  • Prometheus (4 hosts)
  • Kafka-logging (6 hosts)
  • Kafkamon (2 hosts)
  • mwlog (2 hosts)

Event Timeline

andrea.denisse changed the task status from Open to In Progress.Jan 8 2024, 4:24 PM

Change #1014057 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] beta-logs: add ssd-0[123] host configs

https://gerrit.wikimedia.org/r/1014057

Change #1014057 merged by Cwhite:

[operations/puppet@production] beta-logs: add ssd-0[123] host configs

https://gerrit.wikimedia.org/r/1014057

Change #1014062 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] beta-logs: replace logging-logstash-01 with -03

https://gerrit.wikimedia.org/r/1014062

Change #1014062 merged by Cwhite:

[operations/puppet@production] beta-logs: replace logging-logstash-01 with -03

https://gerrit.wikimedia.org/r/1014062

Change #1014063 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: enable openjdk-17 support

https://gerrit.wikimedia.org/r/1014063

Change #1014063 merged by Cwhite:

[operations/puppet@production] logstash: enable openjdk-17 support

https://gerrit.wikimedia.org/r/1014063

Change #1014064 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: introduce java_package option

https://gerrit.wikimedia.org/r/1014064

Change #1014064 merged by Cwhite:

[operations/puppet@production] logstash: introduce java_package option

https://gerrit.wikimedia.org/r/1014064

Change #1014664 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] beta-logs: move jobs host duties to logging-logstash-03

https://gerrit.wikimedia.org/r/1014664

Change #1014664 merged by Cwhite:

[operations/puppet@production] beta-logs: move jobs host duties to logging-logstash-03

https://gerrit.wikimedia.org/r/1014664

Cookbook cookbooks.sre.hosts.reimage was started by cwhite@cumin2002 for host logging-hd2001.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by cwhite@cumin2002 for host logging-hd2001.codfw.wmnet with OS bookworm executed with errors:

  • logging-hd2001 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" logging-hd2001.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by cwhite@cumin2002 for host logging-hd2001.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by cwhite@cumin2002 for host logging-hd2001.codfw.wmnet with OS bookworm executed with errors:

  • logging-hd2001 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" logging-hd2001.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by cwhite@cumin2002 for host logging-hd2001.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by cwhite@cumin2002 for host logging-hd2001.codfw.wmnet with OS bookworm completed:

  • logging-hd2001 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404022205_cwhite_3900766_logging-hd2001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage was started by cwhite@cumin2002 for host logging-hd2003.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by cwhite@cumin2002 for host logging-hd2003.codfw.wmnet with OS bookworm executed with errors:

  • logging-hd2003 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" logging-hd2003.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by cwhite@cumin2002 for host logging-hd2003.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by cwhite@cumin2002 for host logging-hd2002.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by cwhite@cumin2002 for host logging-hd2003.codfw.wmnet with OS bookworm completed:

  • logging-hd2003 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404030003_cwhite_4011899_logging-hd2003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by cwhite@cumin2002 for host logging-hd2002.codfw.wmnet with OS bookworm executed with errors:

  • logging-hd2002 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" logging-hd2002.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by cwhite@cumin2002 for host logging-hd2002.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by cwhite@cumin2002 for host logging-hd2002.codfw.wmnet with OS bookworm completed:

  • logging-hd2002 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404030119_cwhite_4083457_logging-hd2002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB