Page MenuHomePhabricator

(Need By: TBD) rack/setup/install backup2003
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of backup2003

Hostname / Racking / Installation Details

Hostnames: backup2003
Racking Proposal: 10G, avoid sharing a rack (or row if possible) with backup200[12].
Networking/Subnet/VLAN/IP: 10G single production link, 1G mgmt link
Partitioning/Raid: backup1001 and backup1002 have software RAID1 for OS SSDs and RAID 6 for each array. If a new spec, just having HDs in RAID6 would be enough. HW raids requires manual DC ops setup- we can take care of installation, using partman recipe (custom/backup-format.cfg)
OS Distro: buster

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

backup2003: C4 U3/U4 xe-4/0/2

  • - receive in system on procurement task T271231 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer to commit
    • end on-site specific steps
  • - operations/puppet update - this should include updates to install_server dhcp and netboot, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via wmf-auto-reimage or wmf-auto-reimage-host
  • - host state in netbox set to staged

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Event Timeline

RobH added a parent task: Unknown Object (Task).Feb 8 2021, 8:09 PM
RobH mentioned this in Unknown Object (Task).
RobH moved this task from Backlog to Racking Tasks on the ops-codfw board.
RobH removed a subscriber: RobH.

This hasn't arrived yet, right? It would be useful to have one large capacity system for next week's test, but this is unrelated to the final, intended usage of backup1003. So I will work with some other system for that.

When processing this, please don't worry about os installation, I will take care of this. Please setup network (and the parameters only you can get on puppet, such as the mac) and the hw raid with all HDs in RAID6.

@jcrespo why do you have to do the easy part? :)

So, ideally, SSDs would be detected as sda and sdb, and then the recipe custom/backup-format.cfg would work, but I would like to install it myself in case it fails so I don't make you lose your time.

Change 668129 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/puppet@production] DHCP: Add MAC adress for backup2003

https://gerrit.wikimedia.org/r/668129

Change 668129 merged by Papaul:
[operations/puppet@production] DHCP: Add MAC adress for backup2003

https://gerrit.wikimedia.org/r/668129

Change 668137 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/puppet@production] Add backup2003 to site.pp

https://gerrit.wikimedia.org/r/668137

Change 668137 merged by Papaul:
[operations/puppet@production] Add backup2003 to site.pp

https://gerrit.wikimedia.org/r/668137

Papaul updated the task description. (Show Details)
Papaul added a subscriber: Papaul.

@jcrespo all yours

Change 668346 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] install_Server: Apply custom/backup-format.cfg to backup[12]003

https://gerrit.wikimedia.org/r/668346

Change 668346 merged by Jcrespo:
[operations/puppet@production] install_Server: Apply custom/backup-format.cfg to backup[12]003

https://gerrit.wikimedia.org/r/668346

Script wmf-auto-reimage was launched by jynus on cumin2001.codfw.wmnet for hosts:

backup2003.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202103040927_jynus_15216_backup2003_codfw_wmnet.log.

Completed auto-reimage of hosts:

['backup2003.codfw.wmnet']

Of which those FAILED:

['backup2003.codfw.wmnet']

Change 668352 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] dhcp: Removing extra space after MAC address to discard boot issues

https://gerrit.wikimedia.org/r/668352

Change 668352 merged by Jcrespo:
[operations/puppet@production] dhcp: Removing extra space after MAC address to discard boot issues

https://gerrit.wikimedia.org/r/668352

Script wmf-auto-reimage was launched by jynus on cumin2001.codfw.wmnet for hosts:

backup2003.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202103040959_jynus_19993_backup2003_codfw_wmnet.log.

Hey, please help me,

The server doesn't boot with PXE:

Booting from BRCM MBA Slot 0400 v21.6.0

Broadcom UNDI PXE-2.1 v21.6.0
Copyright (C) 2000-2020 Broadcom Corporation
Copyright (C) 1997-2000 Intel Corporation
All rights reserved.
PXE-E61: Media test failure, check cable
PXE-M0F: Exiting Broadcom PXE ROM.

Booting from Hard drive C:

No boot device available.
Current boot mode is set to BIOS.
Please ensure compatible bootable media is available.
Use the system setup program to change the boot mode as needed.

Strike F1 to retry boot, F2 for system setup, F10 for Lifecycle Controller, F11
for boot manager.
Note that in F2/F10/F11 cases a system reboot will be initiated

There is normally just a few reasons why that is:

  • Network/netbox/dns missconfiguration
  • Cable disconnected (note I checked and it shows link/10Gb negotiated on the interface with the matching mac)
  • Configured to boot the wrong MAC (I checked and it matches the one configured)
  • Something wrong with the BIOS/hw

Completed auto-reimage of hosts:

['backup2003.codfw.wmnet']

Of which those FAILED:

['backup2003.codfw.wmnet']

Script wmf-auto-reimage was launched by jynus on cumin2001.codfw.wmnet for hosts:

backup2003.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202103041640_jynus_24092_backup2003_codfw_wmnet.log.

PXE worked after I:

  1. Disabled PXE on the 1Gb device.
  2. Enabled PXE on the 10Gb device
  3. Rebooted
  4. After reboot, the 10Gb one appeared on the boot sequence.

Completed auto-reimage of hosts:

['backup2003.codfw.wmnet']

Of which those FAILED:

['backup2003.codfw.wmnet']

Sadly, while the installer now works correctly, after install, local disk drive boot fails, and goes back to network boot:

Booting from Hard drive C:

Booting from BRCM MBA Slot B300 v214.0.241.0

Broadcom UNDI PXE-2.1 v214.0.241.0
Copyright (C) 2000-2019 Broadcom Limited
Copyright (C) 1997-2000 Intel Corporation
All rights reserved.

CLIENT MAC ADDR: 2C EA 7F 94 71 5A  GUID: 4C4C4544-005A-3110-8043-C2C04F384233
CLIENT IP: 10.192.32.35  MASK: 255.255.252.0  DHCP IP: 208.80.153.51
GATEWAY IP: 10.192.32.1
      
PXELINUX 6.03 lwIP 20150819 Copyright (C) 1994-2014 H. Peter Anvin et al

leading to an endless boot/restart cycle.

Either the bios has disable booting from the SSDs or the recipe is failing to partition the disks/install grub correctly.

Script wmf-auto-reimage was launched by jynus on cumin2001.codfw.wmnet for hosts:

backup2003.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202103041738_jynus_1336_backup2003_codfw_wmnet.log.

Completed auto-reimage of hosts:

['backup2003.codfw.wmnet']

Of which those FAILED:

['backup2003.codfw.wmnet']

I found the issue, it was the same as with the NIC, but with Drives: only one disk can be set as "bootable", so it was trying to boot from the HW raid, not the SW raid. I changed the "bootable" flag on the bios to the first SSD drive and it finally booted to the installed os.

We will likely need to manually change the second SSD as "bootable" if the first disk of the SW raid gets degraded. It is a bit silly to only have 1 disk being bootable- even my pc bios allows to have more than 1 disk on the boot list, in a list.

Script wmf-auto-reimage was launched by jynus on cumin2001.codfw.wmnet for hosts:

backup2003.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202103041822_jynus_10015_backup2003_codfw_wmnet.log.

Completed auto-reimage of hosts:

['backup2003.codfw.wmnet']

and were ALL successful.

ssh backup2003.codfw.wmnet
Linux backup2003 4.19.0-14-amd64 #1 SMP Debian 4.19.171-2 (2021-01-30) x86_64
Debian GNU/Linux 10 (buster)
backup2003 is a Host being setup for later application of a role (insetup)

@jcrespo what you can try to do is first create a HW RAID 0 on the first SSD disk then another HW RAID 0 on the second SSD disk once that done, create a HW RAID 6 on the other 24 HDD disks. In this case the OS will see the 2 SDD's as sda and sdb.

It was not an os issue (the os already saw the disks as sda and sdb), but a bios/boot issue. fixed with T274185#6883969

Script wmf-auto-reimage was launched by jynus on cumin2001.codfw.wmnet for hosts:

backup2003.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202103050826_jynus_26490_backup2003_codfw_wmnet.log.

Completed auto-reimage of hosts:

['backup2003.codfw.wmnet']

and were ALL successful.