Page MenuHomePhabricator

Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of kafka-jumbo101[0-5]

Hostname / Racking / Installation Details

Hostnames: kafka-jumbo101[0-5]
Racking Proposal: Please spread these as evenly as possible between rows, taking into account where kafka-jumbo100[6-9] currently are too.
Networking Setup: Number of Connections: 1, Speed: 10G, Vlan: Private, AAAA records:Y
Partitioning/Raid: HW Raid: Y, Partman recipe and/or desired Raid Level: (See kafka-jumbo as already configured in netboot.cfg) reuse-parts.cfg partman/custom/reuse-kafka-jumbo.cfg
OS Distro:Bullseye.

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

kafka-jumbo1010:
  • - receive in system on procurement task T303447 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit

[x - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details

  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
kafka-jumbo1011:
  • - receive in system on procurement task T303447 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
kafka-jumbo1012:
  • - receive in system on procurement task T303447 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
kafka-jumbo1013:
  • - receive in system on procurement task T303447 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
kafka-jumbo1014:
  • - receive in system on procurement task T303447 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
kafka-jumbo1015:
  • - receive in system on procurement task T303447 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Ottomata @wiki_willy We are at full capacity in 10g racks in a,b,c,d Taking into account kafka-jumbo100[6-9] 3 are in 1 row already.

my proposal is
Server rack
kafka-jumbo1010: E1
kafka-jumbo1011: E2
kafka-jumbo1012: E3
kafka-jumbo1013: F1
kafka-jumbo1014: F2
kafka-jumbo1015: F3

Hi, I think that is fine. Spread over at least 3 rows is pretty good. 1007 is C and 1008 and 1009 are D.

kafka-jumbo1010: E1 U17 Port 17 CableID 20220240
kafka-jumbo1011: E2 U19 Port 19 CableID 20220239
kafka-jumbo1012: E3 U19 Port 19 CableID 20220231
kafka-jumbo1013: F1 U17 Port 17 CableID 20220213
kafka-jumbo1014: F2 U19 Port 19 CableID 20220232
kafka-jumbo1015: F3 U19 Port 19 CableID 20220214

RobH mentioned this in Unknown Object (Task).Jul 27 2022, 4:24 PM
RobH added a parent task: Unknown Object (Task).Jul 28 2022, 2:59 PM
Jclark-ctr updated the task description. (Show Details)
Jclark-ctr added a subscriber: Cmjohnson.
Netbox updated with new host  
                                    
 kafka-jumbo1010: E1 U17 Port 17 CableID 20220240
 kafka-jumbo1011: E2 U19 Port 19 CableID 20220239
 kafka-jumbo1012: E3 U19 Port 19 CableID 20220231
 kafka-jumbo1013: F1 U17 Port 17 CableID 20220213
 kafka-jumbo1014: F2 U19 Port 19 CableID 20220232
 kafka-jumbo1015: F3 U19 Port 19 CableID 20220214

Not sure what happened, but there are many outstanding diffs on switches:

[edit interfaces xe-0/0/17]
-   description "kafka-jumbo1010 {#20220240}";
+   description "WMF10609 {#20220240}";

---------------
Changes for 1 devices: ['lsw1-e2-eqiad.mgmt.eqiad.wmnet']

[edit interfaces xe-0/0/19]
-   description "kafka-jumbo1011 {#20220239}";
+   description "WMF10610 {#20220239}";

---------------
Changes for 1 devices: ['lsw1-e3-eqiad.mgmt.eqiad.wmnet']

[edit interfaces xe-0/0/19]
-   description "kafka-jumbo1012 {#20220231}";
+   description "WMF10611 {#20220231}";

---------------
Changes for 1 devices: ['lsw1-f1-eqiad.mgmt.eqiad.wmnet']

[edit interfaces xe-0/0/17]
-   description "kafka-jumbo1013 {#20220213}";
+   description "WMF10606 {#20220213}";

---------------
Changes for 1 devices: ['lsw1-f2-eqiad.mgmt.eqiad.wmnet']

[edit interfaces xe-0/0/19]
-   description "kafka-jumbo1014 {#20220232}";
+   description "WMF10607 {#20220232}";

---------------
Changes for 1 devices: ['lsw1-f3-eqiad.mgmt.eqiad.wmnet']

[edit interfaces xe-0/0/19]
-   description "kafka-jumbo1015 {#20220214}";
+   description "WMF10608 {#20220214}";

And looking at for example https://netbox.wikimedia.org/dcim/devices/4310/changelog/ vs. https://netbox.wikimedia.org/dcim/devices/4449/changelog/ there has been lots of manual edits on those devices.
So I'm not sure what's correct, but we shouldn't have un-named hosts eg. WMF10609 with their switch port configured.

Additionally they're triggering the Netbox report for test_mgmt_dns_hostname

 WMF10609 (WMF10609) Invalid management interface DNS (kafka-jumbo1010.mgmt.eqiad.wmnet != WMF10609.mgmt.eqiad.wmnet)
WMF10610 (WMF10610) Invalid management interface DNS (kafka-jumbo1011.mgmt.eqiad.wmnet != WMF10610.mgmt.eqiad.wmnet)
WMF10611 (WMF10611) Invalid management interface DNS (kafka-jumbo1012.mgmt.eqiad.wmnet != WMF10611.mgmt.eqiad.wmnet)
WMF10606 (WMF10606) Invalid management interface DNS (kafka-jumbo1013.mgmt.eqiad.wmnet != WMF10606.mgmt.eqiad.wmnet)
WMF10607 (WMF10607) Invalid management interface DNS (kafka-jumbo1014.mgmt.eqiad.wmnet != WMF10607.mgmt.eqiad.wmnet)
WMF10608 (WMF10608) Invalid management interface DNS (kafka-jumbo1015.mgmt.eqiad.wmnet != WMF10608.mgmt.eqiad.wmnet)

On the short term this should be fixed ASAP.

For the longer term, I'd like to know how to prevent such events from happening again. Could you detail what led to those manual changes? And if possible your take on what should change in our process or automation?

ayounsi raised the priority of this task from Medium to High.Oct 18 2022, 9:31 AM

@ayounsi these Servers where removed from racks Dell had sent the wrong configuration. New servers where installed and took those names after removing information in netbox i did run cookbook. These where not in decom status so that script was not run. But might of prevented those alerts

I don't fully understand your comment above.

I see that the alerts above are gone, but are now replaced with:

test_enabled_not_connected
xe-0/0/17 	Interface enabled but not connected on lsw1-e1-eqiad (WMF11404)
xe-0/0/19 	Interface enabled but not connected on lsw1-e2-eqiad (WMF11403)
xe-0/0/19 	Interface enabled but not connected on lsw1-e3-eqiad (WMF11405)
xe-0/0/17 	Interface enabled but not connected on lsw1-f1-eqiad (WMF11407)
xe-0/0/19 	Interface enabled but not connected on lsw1-f2-eqiad (WMF11408)
xe-0/0/19 	Interface enabled but not connected on lsw1-f3-eqiad (WMF11409)

And homer still have outstanding diffs, for example:

Changes for 1 devices: ['lsw1-e1-eqiad.mgmt.eqiad.wmnet']

[edit interfaces xe-0/0/17]
-   description "kafka-jumbo1010 {#20220240}";

Hi, checking in, any updates here?

Thank you!

Also CC @BTullis and @Stevemunene

These servers are racked waiting to be imaged not sure if @Papaul or @RobH can assist imaging these and getting them handed over

@Jclark-ctr - I can take on the initial server imaging, if that helps you out. I know that you've got SLAs in place and whatnot, but from our perspective I don't think there's any specific urgency for these servers to be handed over. Let me know if you'd like any help or input.

@BTullis thank you I will take over this tasks

@Jclark-ctr I have no node in netbox with the name kafka-jumbo1013 but i do have a node wmf10606 whit purchase date 2022-06-07 that is set to offline in netbox . can you please track where is kafka-1013 and update netbox for me?

Thanks

@Ottomata @BTullis what HW RAID are we using for those servers ?
Thanks

Papaul updated the task description. (Show Details)

@Papaul corrected netbox it was in as asset tag WMF10621

Papaul updated the task description. (Show Details)

@Ottomata @BTullis what HW RAID are we using for those servers ?
Thanks

Hi @Papaul - could we have the following RAID configuration please?

  • Hardware RAID 1 for the O/S on the 2 x 480 GB SSDs in the flex bay
  • Hardware RAID10 for the 12 x 4 TB nearline HDDs

image.png (122×437 px, 15 KB)

If I remember correctly, since this is one of the H750 RAID controllers, we need to define the RAID10 first, followed by the RAID1, in order for the O/S to be installed to /dev/sda.

I think that the partman/custom/kafka-jumbo.cfg partition recipe should work for these hosts to set up the partitioning, but please feel free to reach out early if there are any problems with it. Thanks.

Change 859107 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/puppet@production] Add kafka-jumbo101[0-5] to site.pp and netboot.cfg

https://gerrit.wikimedia.org/r/859107

Change 859107 merged by Papaul:

[operations/puppet@production] Add kafka-jumbo101[0-5] to site.pp and netboot.cfg

https://gerrit.wikimedia.org/r/859107

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kafka-jumbo1010.eqiad.wmnet with OS bullseye

@BTullis thanks for the update looks like we have an issue with the partman recipe can you please take a look and let me know thanks

────────────────────┤ [!] Partition disks ├───────────
  │ │
  │ │                        383.6 GB is too small
  │ │ You asked for 383.6 GB to be used for guided partitionin
  │ │ selected partitioning recipe requires at least 6.0 TB.
  │ │
  └─│     <Go Back>                                        <Co
    │

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kafka-jumbo1010.eqiad.wmnet with OS bullseye executed with errors:

  • kafka-jumbo1010 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

@Ottomata this SSD looks like the first disk /dev/sda below is what I have

Virtual Disk 238: RAID1, 446.625GB, Ready
  Virtual Disk 239: RAID10, 21.829TB, Ready
Papaul removed Papaul as the assignee of this task.Dec 1 2022, 4:32 PM

Is it OK if I have a crack at this @Papaul?

(Sorry, I missed your ping to me above)

WARNING: This comment relates to a different host from the one originally mentioned in this ticket. Namely kafka-stretch2002. However, the reason for it seems to be a problem that affects many of these new hosts with the H750 RAID controller and two different tiers of drives. I'm leaving the comment in place with this proviso, given that the problem affects all of these hosts. OK, I'm starting to look into this now.

First of all, checking the two servers in codfw we can see that kafka-stretch2001 is set up correctly:

image.png (338×931 px, 77 KB)

However, kafka-stretch2002 has the drives the wrong way around. Although the large drives are assigned to /srv, which is correct, the order of the drives is incorrect, so that /dev/sdb is the boot drive.

image.png (337×956 px, 77 KB)

I'm sure I wrote a note in Wikitech about how I changed this boot order previously, so I'll search for that now.

Icinga downtime and Alertmanager silence (ID=0e9efa15-8be5-4b76-ad0c-4cdddb24836e) set by btullis@cumin1001 for 3:00:00 on 1 host(s) and their services with reason: Accessing BIOS on kafka-stretch2002

kafka-stretch2002.codfw.wmnet

Icinga downtime and Alertmanager silence (ID=2a6ae5e9-eb12-4391-a29f-7bd50197f53a) set by btullis@cumin1001 for 3:00:00 on 1 host(s) and their services with reason: Accessing BIOS on kafka-stretch2001

kafka-stretch2001.codfw.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-stretch2002.codfw.wmnet with OS bullseye

This comment has been deleted.
This comment has been deleted.

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-stretch2002.codfw.wmnet with OS bullseye executed with errors:

  • kafka-stretch2002 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-stretch1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-stretch1001.eqiad.wmnet with OS bullseye executed with errors:

  • kafka-stretch1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-jumbo1010.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-jumbo1010.eqiad.wmnet with OS bullseye completed:

  • kafka-jumbo1010 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212151645_btullis_1840258_kafka-jumbo1010.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

I have written up a note about these H750 based controllers here: https://wikitech.wikimedia.org/wiki/Raid_setup#Hosts_with_PERC_H750_RAID_Controllers

However, whilst carrying out these installs recently I have noticed that the behaviour of the cards isn't very predictable. Sometimes I have had to enter the same RAID configuration twice before it started working correctly.

I have noticed that there is a recent firmware upgrade for these PERC H750 cards that is marked Urgent by Dell and was released on November 14th.
https://www.dell.com/support/home/en-uk/drivers/driversdetails?driverid=n31cj

The release notes mention at least a couple of potential fixes for this kind of behaviour, so I'm going to go ahead with a firmware upgrade on kafka-jumbo1011 before proceeding any further with the installation.

image.png (869×1 px, 237 KB)

The firmware upgrade was successful. I inadvertently upgraded the iDRAC as well to version 6 but downgraded it again after I read this:
https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Urgent_Firmware_Revision_Notices:

iDrac shouldn't upgrade to 6.00.00.00 (breaks https mgmt access), cap at 5.10.30.00.

Next I have:

  • cleared the RAID controller configuration
  • configured the RAID10 volume
  • configured the RAID1 volume
  • configured the RAID1 volume to be the boot volume

Now I'm starting the reimage cookbook for kafka-jumbo1011

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-jumbo1011.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-jumbo1011.eqiad.wmnet with OS bullseye completed:

  • kafka-jumbo1011 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212201209_btullis_3068787_kafka-jumbo1011.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-jumbo1012.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-jumbo1012.eqiad.wmnet with OS bullseye completed:

  • kafka-jumbo1012 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212201438_btullis_3097841_kafka-jumbo1012.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-jumbo1013.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-jumbo1013.eqiad.wmnet with OS bullseye completed:

  • kafka-jumbo1013 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212220938_btullis_3559741_kafka-jumbo1013.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-jumbo1014.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-jumbo1014.eqiad.wmnet with OS bullseye completed:

  • kafka-jumbo1014 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212221132_btullis_3581519_kafka-jumbo1014.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-jumbo1015.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-jumbo1015.eqiad.wmnet with OS bullseye executed with errors:

  • kafka-jumbo1015 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-jumbo1015.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-jumbo1015.eqiad.wmnet with OS bullseye executed with errors:

  • kafka-jumbo1015 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-jumbo1015.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-jumbo1015.eqiad.wmnet with OS bullseye completed:

  • kafka-jumbo1015 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202212221651_btullis_3641511_kafka-jumbo1015.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)
BTullis updated the task description. (Show Details)
BTullis moved this task from Blocked to Racking Tasks on the ops-eqiad board.