Page MenuHomePhabricator

cloudcontrol2004-dev: make it a cloudlb backend
Closed, ResolvedPublic

Description

This task is to track the work to make cloudcontrol2004-dev a cloudlb backend.

This host is a bit special, given it is racked on codfw D1 and needs to be relocated and connected to cloudsw1-b1-codfw.

Steps are:

  • decommission
  • re-rack / reconnect to cloudsw
  • update netbox and other steps as described in wikitech page
  • reimage

Event Timeline

aborrero triaged this task as Medium priority.May 31 2023, 9:01 AM
aborrero moved this task from Backlog to Next on the User-aborrero board.

Change 924899 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] codfw1dev: remove traces of cloudcontrol2004-dev.wikimedia.org

https://gerrit.wikimedia.org/r/924899

cookbooks.sre.hosts.decommission executed by aborrero@cumin2002 for hosts: cloudcontrol2004-dev.wikimedia.org

  • cloudcontrol2004-dev.wikimedia.org (WARN)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 924899 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] codfw1dev: remove traces of cloudcontrol2004-dev.wikimedia.org

https://gerrit.wikimedia.org/r/924899

@aborrero server has been re-racked in B1 - U21 and connected to the cloudsw-b1 switch on port ge-1/0/21.

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudcontrol2004-dev.codfw.wmnet with OS bullseye

Change 925683 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudcontrol2004-dev: put it into service with new IP address

https://gerrit.wikimedia.org/r/925683

Change 925683 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudcontrol2004-dev: put it into service with new IP address

https://gerrit.wikimedia.org/r/925683

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudcontrol2004-dev.codfw.wmnet with OS bullseye executed with errors:

  • cloudcontrol2004-dev (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin2002 for host cloudcontrol2004-dev.codfw.wmnet with OS bullseye

@aborrero server has been re-racked in B1 - U21 and connected to the cloudsw-b1 switch on port ge-1/0/21.

Hi @Jhancock.wm. We're having some trouble getting this server reimaged unfortunately. We've it correct in Netbox but when we try to set it up switch port 21 (ge-0/0/21) stays hard down. We upgraded the firmware on the 1G embedded NIC from 21.60.2 to 22.31.6 but the symptoms remain the same.

Can you check the cabling and switch port make sure it all looks ok? If it does it might be worth swapping out the cable and copper SFP module, perhaps one or other is faulty.

Thanks.

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin2002 for host cloudcontrol2004-dev.codfw.wmnet with OS bullseye executed with errors:

  • cloudcontrol2004-dev (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

@cmooney That was my bad. didn't get it seated all the way. You should be good now!

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudcontrol2004-dev.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudcontrol2004-dev.codfw.wmnet with OS bullseye executed with errors:

  • cloudcontrol2004-dev (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudcontrol2004-dev.codfw.wmnet with OS bullseye

In the debian installer, I'm getting a failure related to the RAID:

image.png (470×660 px, 36 KB)

Upon investigation I found that:

Jun  1 15:02:54 partman-auto-raid: mdadm: cannot open /dev/sda2: No such file or directory
Jun  1 15:02:54 partman-auto-raid: Error creating array /dev/md0

It is true that such device doesn't exist:

~ # ls /dev/s*
/dev/sda       /dev/sdb1      /dev/sdc       /dev/stderr    /dev/stdout
/dev/sdb       /dev/sdb2      /dev/snapshot  /dev/stdin

More data. The disk seems detected by the installer at boot:

# grep -i sda /var/log/syslog
Jun  1 15:01:49 kernel: [   12.571010] sd 0:0:0:0: [sda] Attached SCSI removable disk

However:

# lvmdiskscan 
  /dev/sdb1 [     285.00 MiB] 
  /dev/sdb2 [      <1.75 TiB] 
  /dev/sdc  [      <1.75 TiB] 
  1 disk
  2 partitions
  0 LVM physical volume whole disks
  0 LVM physical volumes

It's odd, fdisk only detects two drives, but they are sdb and sdc:

/var/log # fdisk -l 
Disk /dev/sdb: 1.75 TiB, 1920383410176 bytes, 3750748848 sectors
Disk model: MZ7KH1T9HAJR0D3 
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 441D646F-CE36-4D44-9B83-0CFDB1D9E951

Device      Start        End    Sectors  Size Type
/dev/sdb1    2048     585727     583680  285M BIOS boot
/dev/sdb2  585728 3750748159 3750162432  1.7T Linux RAID


Disk /dev/sdc: 1.75 TiB, 1920383410176 bytes, 3750748848 sectors
Disk model: MZ7KH1T9HAJR0D3 
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

Presumably they are the two that should be sda and sdb, which is what partman is looking for. There is an sda on the system, but I'm not sure what it is, might be this last entry:

/var/log # lsscsi
[4:0:0:0]       disk    ATA     MZ7KH1T9HAJR0D3 HF56
[3:0:0:0]       disk    ATA     MZ7KH1T9HAJR0D3 HF56
[0:0:0:0]       disk    Linux   MAS022  0399
NOTE: I had to run anna-install fdisk-udeb in the debian installer rescue environment to make fdisk available.

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudcontrol2004-dev.codfw.wmnet with OS bullseye executed with errors:

  • cloudcontrol2004-dev (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudcontrol2004-dev.codfw.wmnet with OS bullseye

I believe based on this that the last device may have been a virtual USB device. I connected to the iDRAC web GUI and it said that "virtual media connected" so I clicked to disconnect, after which that device is now showing in lsscsi.

Hopefully that means if we try again the other two drives will become sda and sdb, and the partman recipe will work.

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudcontrol2004-dev.codfw.wmnet with OS bullseye executed with errors:

  • cloudcontrol2004-dev (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin2002 for host cloudcontrol2004-dev.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin2002 for host cloudcontrol2004-dev.codfw.wmnet with OS bullseye completed:

  • cloudcontrol2004-dev (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run failed and logged in /var/log/spicerack/sre/hosts/reimage/202306011610_aborrero_3154891_cloudcontrol2004-dev.out, asking the operator what to do
    • First Puppet run failed and the operator skipped it
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Failed to run the sre.puppet.sync-netbox-hiera cookbook, run it manually