Page MenuHomePhabricator

(Need by: 2020-04-02) rack/setup/install relforge100[34]
Closed, ResolvedPublic

Description

This task will track the racking and setup of relforge100[34].eqiad.wmnet.

Hostnames: relforge100[34]
Racking Proposal: Two different rows so the hosts don't share a row, 10G network rack. No other constraints/considerations.
Networking/Subnet/VLAN/IP: 10G. Only one network port connection. In the Analytics network zone (not sure what it means in terms of VLAN).
Partitioning/Raid: Software RAID, raid10-gpt-srv-lvm-ext4.cfg.

relforge1003:

  • - receive in system on procurement task T232649
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

relforge1004:

  • - receive in system on procurement task T232649
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run (with role:spare)
  • - host state in netbox set to staged
  • - handoff for service implementation
  • - service implementer changes from 'staged' status to 'active' status in netbox'

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
RobH triaged this task as Medium priority.Jan 2 2020, 9:44 PM
wiki_willy renamed this task from rack/setup/install relforge100[34] to (No Need By Date) rack/setup/install relforge100[34].Jan 2 2020, 11:34 PM

regarding the racking proposal

1G network rack.

These should be 10G (I verified PO contains 10G cards as well).

RobH renamed this task from (No Need By Date) rack/setup/install relforge100[34] to (Need by: TBD) rack/setup/install relforge100[34].Feb 24 2020, 9:10 PM
Jclark-ctr added a subscriber: Jclark-ctr.

relforge1003 A2 U34 Port 34
relforge1004 B2 U31 Port 37

Change 601356 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding all dns entries for relforge1003/1004. removed old asset tag entry

https://gerrit.wikimedia.org/r/601356

Change 601356 merged by Cmjohnson:
[operations/dns@master] Adding all dns entries for relforge1003/1004. removed old asset tag entry

https://gerrit.wikimedia.org/r/601356

Cmjohnson updated the task description. (Show Details)

Network switches have been updated but I disabled the ports until after the bios is set up and ready for imaging.

Change 603440 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Add relforge100[34] to netboot cfg and dhcpd file

https://gerrit.wikimedia.org/r/603440

Change 603440 merged by Cmjohnson:
[operations/puppet@production] Add relforge100[34] to netboot cfg and dhcpd file

https://gerrit.wikimedia.org/r/603440

wiki_willy renamed this task from (Need by: TBD) rack/setup/install relforge100[34] to (Need by: 2020-04-02) rack/setup/install relforge100[34].Jun 8 2020, 8:22 PM

Change 603634 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Adding relforge100[34] to site.pp role insetup

https://gerrit.wikimedia.org/r/603634

Change 603634 merged by Cmjohnson:
[operations/puppet@production] Adding relforge100[34] to site.pp role insetup

https://gerrit.wikimedia.org/r/603634

I am having an issue with these. While running the script it gives me an error about IPMI. These are HP servers and I do not know of any manual IPMI setting.

Cmjohnson added subscribers: Dzahn, Cmjohnson.

@Dzahn Could you try to image one of these and let me know if you see a setting missed. I am able to login to the ilo via mgmt network but when I use the install script on cumin I am getting an IPMI connection error.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

relforge1003.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202006261902_cmjohnson_11868_relforge1003_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

relforge1004.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202006261904_cmjohnson_12109_relforge1004_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['relforge1003.eqiad.wmnet']

Of which those FAILED:

['relforge1003.eqiad.wmnet']

Completed auto-reimage of hosts:

['relforge1004.eqiad.wmnet']

Of which those FAILED:

['relforge1004.eqiad.wmnet']

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

relforge1004.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202006301553_cmjohnson_20344_relforge1004_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['relforge1004.eqiad.wmnet']

Of which those FAILED:

['relforge1004.eqiad.wmnet']

Change 608664 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Updating relforge1003-4 netboot.cfg

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608664

Change 608664 merged by Cmjohnson:
[operations/puppet@production] Updating relforge1003-4 netboot.cfg

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608664

Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts:

relforge1004.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202006301631_cmjohnson_20663_relforge1004_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['relforge1004.eqiad.wmnet']

Of which those FAILED:

['relforge1004.eqiad.wmnet']

@Gehel I am getting a partman error. Is the partman recipe given correct? The raid10-4dev is not working.

┌────────────────────┤ [!!] Partition disks ├─────────────────────┐
│                                                                 │
│                   Error while setting up RAID                   │
│ An unexpected error occurred while setting up a preseeded RAID  │
│ configuration.                                                  │
│                                                                 │
│ Check /var/log/syslog or see virtual console 4 for the details. │
│                                                                 │
│     <Go Back>                                    <Continue>     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

@Cmjohnson: looking at the quote that was validated (T232649#5681830) it specified 8x 1.92T SSD (4 SSD per server). The scanned packaging slip (T232649#5939682) only shows 4 SSDs. I assume that there is only 2 SSDs per server? This isn't the end of the world on our side, we can live with the reduced disk space. But if we paid for additional SSD but did not get them, we should look into it.

If we only have 2 SSD, we should move to raid1-2dev instead.

Hi @Gehel - when I look at the packing slip, it looks like it separates the quantity for internal components per server. Since there were qty=2 servers in the order, it's really a total of 8x SSDs, 16x RAM, 4x power supplies, etc. Thanks, Willy

So it looks like there is a real issue with the raid config. @ryankemper, can you have a look ?

(Meant to comment earlier, but I'm looking into the cause of the RAID failure)

EDIT: Picking this back up on Monday

@RKemper a few pointers for your investigation:

  • you can use install_console $fqdn_of_server to connect to the server during install stage, that should allow you to have a look at the logs
  • you can reimage the server with [[ https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Reimage | wmf-auto-reimage-host ]] if you need to re-run that partman step
  • the last cleanup of our partman configs was done by @fgiunchedi, he might have an additional idea or two

I had a very quick look at the host from the mgmt console, looks like at least the disks don't show up with the right lettering. sda is missing and sdb through sde are present instead, hope that helps!

Going through dmesg, I see:

[    5.799850] scsi 0:0:0:0: Direct-Access     Generic- SD/MMC CRW       1.00 PQ: 0 ANSI: 6
[    5.804493] sd 0:0:0:0: [sda] Attached SCSI removable disk

Do we have an SD reader in that server? Or did someone plug a removable drive? Can we disable that device and go back to the standard sda-sdd disk mapping? Or do we need a new partman config for this case?

Going through dmesg, I see:

[    5.799850] scsi 0:0:0:0: Direct-Access     Generic- SD/MMC CRW       1.00 PQ: 0 ANSI: 6
[    5.804493] sd 0:0:0:0: [sda] Attached SCSI removable disk

Do we have an SD reader in that server? Or did someone plug a removable drive? Can we disable that device and go back to the standard sda-sdd disk mapping? Or do we need a new partman config for this case?

I rebooted relforge1003, hit F9 to enter the system config menu', and followed System Configuration > BIOS/Platform Configuration (RBSU) > System Options > USB Options. There is a config setting for the SD reader, I disabled it and I was able to complete a PXE boot. Nice catch @Gehel !

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['relforge1003.eqiad.wmnet', 'relforge1004.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202009071412_elukey_7772.log.

Completed auto-reimage of hosts:

['relforge1003.eqiad.wmnet', 'relforge1004.eqiad.wmnet']

Of which those FAILED:

['relforge1003.eqiad.wmnet', 'relforge1004.eqiad.wmnet']

Change 625646 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Fix relforge100[3,4] definitions in site.pp

https://gerrit.wikimedia.org/r/625646

Change 625646 merged by Elukey:
[operations/puppet@production] Fix relforge100[3,4] definitions in site.pp

https://gerrit.wikimedia.org/r/625646

Current layout:

elukey@relforge1003:~$ df -h
Filesystem            Size  Used Avail Use% Mounted on
udev                   63G     0   63G   0% /dev
tmpfs                  13G   18M   13G   1% /run
/dev/mapper/vg0-root   73G  1.4G   68G   2% /
tmpfs                  63G     0   63G   0% /dev/shm
tmpfs                 5.0M     0  5.0M   0% /run/lock
tmpfs                  63G     0   63G   0% /sys/fs/cgroup
/dev/mapper/vg0-srv   3.4T   89M  3.2T   1% /srv
tmpfs                  13G     0   13G   0% /run/user/13926

elukey@relforge1003:~$ sudo lvs
  LV   VG  Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  root vg0 -wi-ao----  74.50g
  srv  vg0 -wi-ao----   3.42t
  swap vg0 -wi-ao---- 976.00m

Set both hosts to "Staged" in netbox.

elukey updated the task description. (Show Details)

@RKemper @Gehel does the layout works for you? If so I think this task is done :)

@RKemper @Gehel does the layout works for you? If so I think this task is done :)

I'd like to make sure that @RKemper knows what happened here before closing the task. Including making sure he has the proper access to change BIOS settings himself next time.

@RKemper not sure how familiar are you with the magical world of serial consoles, I'll add a few links and then you decide what to keep :)

The relforge nodes are HP, you can see it from netbox or when you connect to the serial console. The other vendor is DELL, and both have different commands, you can find a summary in https://wikitech.wikimedia.org/wiki/Platform-specific_documentation.

I went on cumin1001, opened a tmux and ssh-ed to root@relforge1003.mgmt.eqiad.wmnet. The root@ and .mgmt parts are very important bits! At this point, I got a password prompt, and the key needed is in the management file of pwstore. Then I was able to use iLO, the interface to basically command the host from remote. In order to be able to set any BIOS setting, you need to first get a view of what's happening on the host, using vsp (DELL's alternative is console com2). I checked and the host was still in Debian Install, so I used the command to detach from the serial console (https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/HP_Documentation#Connecting_to_Serial_Console) and I forced a reboot with power reset. Then I got back into the serial console, and waited: at some point of the boot you can enter keys like F12 (forcing a PXE boot) or F9 (Enter system setup), and I chose the latter. Then I navigated up to the SD card reader setting, disabled it, and exited (the prompt will ask you if you want to save changes etc..). Then rebooted again, and done :)

I am removing the ops-eqiad tag from this task. If you need an on-site task please create a new ticket.

Service implementation will follow in T262211.