Page MenuHomePhabricator

(Need by: TDB) rack/setup/install cloudelastic100[56]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of cloudelastic100[56]

Hostname / Racking / Installation Details

Hostnames: cloudelastic100[56]
Racking Proposal: Those servers should be evenly spread across rows with cloudelastic100[1-4]
Current racking:
cloudelastic1001: A2
cloudelastic1002: B2
cloudelastic1003: C2
cloudelastic1004: D2

New servers (proposal, we're fine as long as those 2 servers are not in the same row):
cloudelastic1005: A
cloudelastic1006: B

Networking/Subnet/VLAN/IP: 10G, single network port connection, same vlan as existing cloudelastic servers
Partitioning/Raid: RAID10 software: raid10-gpt-srv-lvm-ext4-6disks.cfg

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

cloudelastic1005:

  • - receive in system on procurement task T233720
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

cloudelastic1006:

  • - receive in system on procurement task T233720
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged

Once the system(s) above have had all checkbox steps completed, this task can be resolved.

Event Timeline

RobH added a parent task: Unknown Object (Task).
RobH mentioned this in Unknown Object (Task).
Jclark-ctr added a subscriber: Jclark-ctr.

cloudelastic1005. rack A4 U34 WMF5337 switchport 29

cloudelastic1006 rack b4 u23. WMF5338 switchport 41

Change 591377 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding mgmt dns for cloudelastic100[56]

https://gerrit.wikimedia.org/r/591377

Change 591377 merged by Cmjohnson:
[operations/dns@master] Adding mgmt dns for cloudelastic100[56]

https://gerrit.wikimedia.org/r/591377

Change 592743 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding production dns for cloudelastic1005-6

https://gerrit.wikimedia.org/r/592743

Change 592743 merged by Cmjohnson:
[operations/dns@master] Adding production dns for cloudelastic1005-6

https://gerrit.wikimedia.org/r/592743

@Cmjohnson thanks a lot for the work, if you have time to prioritize these two nodes during the next days it would be super helpful (the elastic cluster is a little bit under pressure during reindexations, and two more nodes will help a lot). I can take over from os install onward if you are busy!

Change 593613 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Adding netboot.cfg and dhcpd for cloudelastic100[56]

https://gerrit.wikimedia.org/r/593613

Change 593613 merged by Elukey:
[operations/puppet@production] Adding netboot.cfg and dhcpd for cloudelastic100[56]

https://gerrit.wikimedia.org/r/593613

Change 594093 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Assign role insetup to cloudelastic100[5,6]

https://gerrit.wikimedia.org/r/594093

Change 594093 merged by Elukey:
[operations/puppet@production] Assign role insetup to cloudelastic100[5,6]

https://gerrit.wikimedia.org/r/594093

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

cloudelastic1005.wikimedia.org

The log can be found in /var/log/wmf-auto-reimage/202005040735_elukey_97184_cloudelastic1005_wikimedia_org.log.

Completed auto-reimage of hosts:

['cloudelastic1005.wikimedia.org']

Of which those FAILED:

['cloudelastic1005.wikimedia.org']

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

cloudelastic1005.wikimedia.org

The log can be found in /var/log/wmf-auto-reimage/202005040736_elukey_97196_cloudelastic1005_wikimedia_org.log.

Completed auto-reimage of hosts:

['cloudelastic1005.wikimedia.org']

Of which those FAILED:

['cloudelastic1005.wikimedia.org']

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

cloudelastic1006.wikimedia.org

The log can be found in /var/log/wmf-auto-reimage/202005040854_elukey_105244_cloudelastic1006_wikimedia_org.log.

Completed auto-reimage of hosts:

['cloudelastic1006.wikimedia.org']

Of which those FAILED:

['cloudelastic1006.wikimedia.org']

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

cloudelastic1005.wikimedia.org

The log can be found in /var/log/wmf-auto-reimage/202005040911_elukey_107851_cloudelastic1005_wikimedia_org.log.

@Cmjohnson, @Jclark-ctr I am unable to DHCP 1005/1006, I don't see any DHCP REQUEST landing to either install1003/2003. I checked 1005's system config and I noticed that the NICs config is the following:

  • NIC in Slot 2 Port 1 - B0:26:28:CD:07:C0 - Boot option PXE - No link speed registered
  • NIC in Slot 2 Port 2 - B0:26:28:CD:07:C1 - Boot option None - 10Gbps link speed registered

Boot is attempted only on the NIC in slot 1, but I think that the other one is connected to the switch. I am not sure what to do in these cases, is it ok to set Boot Option PXE to the second NIC or do we need to swap cables?

Completed auto-reimage of hosts:

['cloudelastic1005.wikimedia.org']

Of which those FAILED:

['cloudelastic1005.wikimedia.org']

@elukey Thanks, looks like the dac cable is in the wrong nic port. This will require an on-site visit.

dac cable is in the wrong nic port switched. Confirmed with @elukey is working now

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

cloudelastic1005.wikimedia.org

The log can be found in /var/log/wmf-auto-reimage/202005051445_elukey_67550_cloudelastic1005_wikimedia_org.log.

15:03:13 | cloudelastic1005.wikimedia.org | WARNING: unable to verify that BIOS boot parameters are back to normal, got:
Boot parameter version: 1
Boot parameter 5 is valid/unlocked
Boot parameter data: 8000020000
 Boot Flags :
   - Boot Flag Valid
   - Options apply to only next boot
   - BIOS PC Compatible (legacy) boot
   - Boot Device Selector : No override
   - Console Redirection control : System Default
   - Lock Out Sleep Button
   - BIOS verbosity : Request console redirection be enabled
   - BIOS Mux Control Override : BIOS uses recommended setting of the mux at the end of POST

Completed auto-reimage of hosts:

['cloudelastic1005.wikimedia.org']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

cloudelastic1006.wikimedia.org

The log can be found in /var/log/wmf-auto-reimage/202005051508_elukey_71925_cloudelastic1006_wikimedia_org.log.

Completed auto-reimage of hosts:

['cloudelastic1006.wikimedia.org']

and were ALL successful.

elukey updated the task description. (Show Details)
09:09:53 | druid1002.eqiad.wmnet | WARNING: unable to verify that BIOS boot parameters are back to normal, got:
Boot parameter version: 1
Boot parameter 5 is valid/unlocked
Boot parameter data: 0004000000
 Boot Flags :
   - Boot Flag Invalid
   - Options apply to only next boot
   - BIOS PC Compatible (legacy) boot
   - Boot Device Selector : Force PXE
   - Console Redirection control : System Default
   - BIOS verbosity : Console redirection occurs per BIOS configuration setting (default)
   - BIOS Mux Control Override : BIOS uses recommended setting of the mux at the end of POST

Change 602469 had a related patch set uploaded (by Ryan Kemper; owner: Ryan Kemper):
[operations/puppet@production] cloudelastic: bring new nodes into service

https://gerrit.wikimedia.org/r/602469