Page MenuHomePhabricator

Put centrallog1002 in service
Closed, ResolvedPublic

Description

Host was installed/racked in T313858: Q1:rack/setup/install centrallog1002

  • Apply partman standard software raid recipe #882718
    • Reimage centrallog1002
  • Apply role::syslog::centralserver on centrallog1002 #881939
  • Add Add centrallog1002 to the kafka-jumbo allow list #882747
  • Add centrallog1002 to eqiad anycast_neighbors #882724
  • Sync centrallog1001 to centrallog1002 #882760
    • Show transfer progress when using quickdatacopy #889239
  • Add centrallog1002 as eqiad TLS rsyslog destination #882761
  • Add centrallog1002 as logsource for logstash tests #882762

Issues found:

  • The sync took many hours to complete 11:48:23

Event Timeline

Change 881939 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] centrallog: apply role::syslog::centralserver on centrallog instances

https://gerrit.wikimedia.org/r/881939

It looks like the host hasn't been provisioned with the correct partman recipe (some drives are not partitioned)

centrallog1002:~$ sudo vgs
  VG  #PV #LV #SN Attr   VSize   VFree  
  vg0   1   3   0 wz--n- 893.84g 178.77g
centrallog1002:~$ cat /proc/partitions 
major minor  #blocks  name

   8        0  937692504 sda
   8        1     291840 sda1
   8        2  937399296 sda2
   8       32  937692504 sdc
   8       48  937692504 sdd
   8       16  937692504 sdb
   8       17     291840 sdb1
   8       18  937399296 sdb2
   9        0  937267200 md0
 253        0     999424 dm-0
 253        1   78123008 dm-1
 253        2  670683136 dm-2

Thanks @fgiunchedi , I've sent patch #882718 to apply the partman standard software raid recipe.

Change 882724 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/homer/public@master] centrallog1002: Add to eqiad anycast_neighbors

https://gerrit.wikimedia.org/r/882724

Change 882747 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] centrallog: Add centrallog1002 as Kafka broker

https://gerrit.wikimedia.org/r/882747

andrea.denisse changed the task status from Open to In Progress.Jan 23 2023, 9:49 PM

Change 882760 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] centrallog: Sync centrallog1001 to centrallog1002

https://gerrit.wikimedia.org/r/882760

Change 882761 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] rsyslog: Add centrallog1002 as eqiad TLS rsyslog destination

https://gerrit.wikimedia.org/r/882761

Change 882762 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] logstash: Add centrallog1002 as logsource for logstash tests

https://gerrit.wikimedia.org/r/882762

Change 882718 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] centrallog: Apply partman standard software raid recipe

https://gerrit.wikimedia.org/r/882718

Change 882718 merged by Andrea Denisse:

[operations/puppet@production] centrallog: Apply partman standard software raid recipe

https://gerrit.wikimedia.org/r/882718

Cookbook cookbooks.sre.hosts.reimage was started by denisse@cumin1001 for host centrallog1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by denisse@cumin1001 for host centrallog1002.eqiad.wmnet with OS bullseye executed with errors:

  • centrallog1002 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • The reimage failed, see the cookbook logs for the details

Change 882762 merged by Cwhite:

[operations/puppet@production] logstash: Add centrallog1002 as logsource for logstash tests

https://gerrit.wikimedia.org/r/882762

Cookbook cookbooks.sre.hosts.reimage was started by denisse@cumin1001 for host centrallog1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by denisse@cumin1001 for host centrallog1002.eqiad.wmnet with OS bullseye executed with errors:

  • centrallog1002 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by denisse@cumin1001 for host centrallog1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by denisse@cumin1001 for host centrallog1002.eqiad.wmnet with OS bullseye completed:

  • centrallog1002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202301251920_denisse_3238543_centrallog1002.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 881939 merged by Andrea Denisse:

[operations/puppet@production] centrallog: apply role::syslog::centralserver on centrallog instances

https://gerrit.wikimedia.org/r/881939

Change 882724 merged by Andrea Denisse:

[operations/homer/public@master] centrallog1002: Add to eqiad anycast_neighbors

https://gerrit.wikimedia.org/r/882724

Change 882747 merged by Andrea Denisse:

[operations/puppet@production] centrallog: Add centrallog1002 to the kafka-jumbo allow list

https://gerrit.wikimedia.org/r/882747

Change 882760 merged by Andrea Denisse:

[operations/puppet@production] centrallog: Sync centrallog1001 to centrallog1002

https://gerrit.wikimedia.org/r/882760

Change 882761 merged by Andrea Denisse:

[operations/puppet@production] rsyslog: Add centrallog1002 as eqiad TLS rsyslog destination

https://gerrit.wikimedia.org/r/882761

Change 887812 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] centrallog: Add centrallog1001 to quickdatacopy allow hosts

https://gerrit.wikimedia.org/r/887812

Change 887812 abandoned by Andrea Denisse:

[operations/puppet@production] centrallog: Enable auto_ferm_ipv6 to quickdatacopy

Reason:

Closing in favor of achieving the same result using Hiera lookups

https://gerrit.wikimedia.org/r/887812

Change 889231 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] quickdatacopy: Add option to show progress during transfer

https://gerrit.wikimedia.org/r/889231

Change 889239 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] centrallog: Show transfer progress when using quickdatacopy

https://gerrit.wikimedia.org/r/889239

Change 889239 merged by Andrea Denisse:

[operations/puppet@production] centrallog: Show transfer progress when using quickdatacopy

https://gerrit.wikimedia.org/r/889239

andrea.denisse updated the task description. (Show Details)