Page MenuHomePhabricator

rack/setup/install cloudstore1008 & cloudstore1009
Closed, ResolvedPublic

Description

This task will track the racking, setup, and installation of labstore1008 and labstore1009. These will replace the functionality of labstore1003. Both hosts come with arrays, name them labstore1008-array1, and labstore1009-array1, to match the use of array naming for the rest of the labstore hosts.

Racking Plan: These can go in ANY rack and row (no restrictions) other than they should be in different rows from one another, and ideally a different rack than labstore1003 (just for redundancy during transition of services to the new hosts.) If one has to share with labstore1003, it is acceptable.

cloudstore1008 + cloudstore1008-array1:

  • - change hostname labels on server and array from labstore1008 to cloudstore1008
  • - receive in system on procurement task T186931
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, public vlan)
    • end on-site specific steps
  • - production dns entries added - public vlan
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation - stretch
  • - puppet accept/initial run
  • - handoff for service implementation to cloud-services-team

cloudstore1009 + cloudstore1009-array1:

  • - change hostname labels on server and array from labstore1008 to cloudstore1008
  • - receive in system on procurement task T186931
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, public vlan)
    • end on-site specific steps
  • - production dns entries added - public vlan
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation - stretch
  • - puppet accept/initial run
  • - handoff for service implementation to cloud-services-team

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@chasemp I do not have 2 adjacent 10G racks and do not have space in 2 10G racks in the same row (maybe row D) and 2nd I am not 100% I can get the 10G nics to work without lots of time to invest. If you want to try and get it going on your labnet1003/1004 first and try but right now I do not have 10G space in the same row. see https://phabricator.wikimedia.org/T193196

OK, then let's do 1G for now to get this moving? Thanks @Cmjohnson

faidon raised the priority of this task from Medium to High.Jun 8 2018, 2:35 PM

Change 439287 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding mgmt dns for labstore1008/9

https://gerrit.wikimedia.org/r/439287

The raid has been setup as well. Raid 10 256k Stripe on the server and disk shelf.

I think this is ready for OS install and such? I spoke with @Bstorm who is going to take this on and may need some advice.

Still need add mac address to the dhcp file and the netboot.cfg. I just
enabled the switch ports so once the other items are completed then they
are both ready for install.

Hmm. I'm coming up dry on how to find the MAC address in all the things here. labstore1008/9.mgmt.eqiad.wmnet doesn't seem to work yet.

Hrm, that's odd ....dns is setup and I setup idrac...I wondering if I
forgot to connect the green mgmt cable...Something I may have done in my
rush to get them going. I will double check tomorrow when I am at the data
center.

Change 439287 merged by Cmjohnson:
[operations/dns@master] Adding mgmt dns for labstore1008/9

https://gerrit.wikimedia.org/r/439287

@Bstorm the dns patch was not merged. FIxed now...feel free to take over

For my reference, if nothing else: labstore1008

Embedded NIC MAC Addresses:
NIC.Integrated.1-1-1    Ethernet                = D0:94:66:26:D5:6A
                        iSCSI                   = D0:94:66:26:D5:6B
                        FIP                     = D0:94:66:26:D5:6B
                        WWN                     = 20:00:D0:94:66:26:D5:6B
                        WWPN                    = 20:01:D0:94:66:26:D5:6B
NIC.Integrated.1-2-1    Ethernet                = D0:94:66:26:D5:6C
                        iSCSI                   = D0:94:66:26:D5:6D
                        FIP                     = D0:94:66:26:D5:6D
                        WWN                     = 20:00:D0:94:66:26:D5:6D
                        WWPN                    = 20:01:D0:94:66:26:D5:6D
NIC.Integrated.1-3-1    Ethernet                = D0:94:66:26:D5:6E
                        iSCSI                   = D0:94:66:26:D5:6F
                        FIP                     = D0:94:66:26:D5:6F
                        WWN                     = 20:00:D0:94:66:26:D5:6F
                        WWPN                    = 20:01:D0:94:66:26:D5:6F
NIC.Integrated.1-4-1    Ethernet                = D0:94:66:26:D5:70
                        iSCSI                   = D0:94:66:26:D5:71
                        FIP                     = D0:94:66:26:D5:71
                        WWN                     = 20:00:D0:94:66:26:D5:71
                        WWPN                    = 20:01:D0:94:66:26:D5:71

And labstore1009:

Embedded NIC MAC Addresses:
NIC.Integrated.1-1-1    Ethernet                = D0:94:66:2D:94:3B
                        iSCSI                   = D0:94:66:2D:94:3C
                        FIP                     = D0:94:66:2D:94:3C
                        WWN                     = 20:00:D0:94:66:2D:94:3C
                        WWPN                    = 20:01:D0:94:66:2D:94:3C
NIC.Integrated.1-2-1    Ethernet                = D0:94:66:2D:94:3D
                        iSCSI                   = D0:94:66:2D:94:3E
                        FIP                     = D0:94:66:2D:94:3E
                        WWN                     = 20:00:D0:94:66:2D:94:3E
                        WWPN                    = 20:01:D0:94:66:2D:94:3E
NIC.Integrated.1-3-1    Ethernet                = D0:94:66:2D:94:3F
                        iSCSI                   = D0:94:66:2D:94:40
                        FIP                     = D0:94:66:2D:94:40
                        WWN                     = 20:00:D0:94:66:2D:94:40
                        WWPN                    = 20:01:D0:94:66:2D:94:40
NIC.Integrated.1-4-1    Ethernet                = D0:94:66:2D:94:41
                        iSCSI                   = D0:94:66:2D:94:42
                        FIP                     = D0:94:66:2D:94:42
                        WWN                     = 20:00:D0:94:66:2D:94:42
                        WWPN                    = 20:01:D0:94:66:2D:94:42

Change 441247 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] install: Add labstore1008 & labstore1009

https://gerrit.wikimedia.org/r/441247

Change 441247 merged by Bstorm:
[operations/puppet@production] install: Add labstore1008 & labstore1009

https://gerrit.wikimedia.org/r/441247

on labstore1008, that setup didn't work. It might be the 10g interface plugged in. I'll try that one instead.

Change 441311 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] Change labstore1008 to 10g interface

https://gerrit.wikimedia.org/r/441311

Change 441311 merged by Bstorm:
[operations/puppet@production] Change labstore1008 to 10g interface

https://gerrit.wikimedia.org/r/441311

@Cmjohnson Could you do me a favor and cancel out of the blasted lifecycle controller setup view on the console of labstore1008? I cannot find a way to get out of that without disabling the controller from here, and it is locking up things I need to do.

Nevermind, I found a way that worked (HTML5 remote console in the GUI)

Got it to attempt PXE on the interface that is actually plugged in, however, dhcp failed.

Current status: labstore1009 appears to not be plugged in on any port.

labstore1008 is plugged in on a port, and that port will attempt PXE when enabled, but no DHCP traffic was received on install1002 when I was running a dump and DHCP fails. Seems like something isn't set up on the switch?

These should honestly also be renamed/relabeled before they are put in production to cloudstore1008/9.

Current status: labstore1009 appears to not be plugged in on any port.

labstore1008 is plugged in on a port, and that port will attempt PXE when enabled, but no DHCP traffic was received on install1002 when I was running a dump and DHCP fails. Seems like something isn't set up on the switch?

So current status is that stuff above (the problem with 1008 is quite possibly the network card issue from T199125 -- but I will try to confirm if so) and that these need rename to cloudstore1008 and cloudstore1009. The 1009 box didn't show any network connection to the cards when checking in the management interface (which is connected).

No, this is an onboard 10G card. It doesn't seem to be able to reach install1002 for DHCP only, and otherwise isn't getting to the same boot step from what I can tell.

@RobH Can you help with the installs please. right now labstore1008 will not work because it's on the new switch on A5. If it's need immediately then I can move it to the old asw2-a5 switch with an sfp-t. Check to make sure the correct nic is set for pxe. They do have onboad 10G NICs that could be set as the primary.

RobH renamed this task from rack/setup/install labstore1008 & labstore1009 to rack/setup/install cloudstore1008 & cloudstore1009.Aug 23 2018, 5:08 PM
RobH updated the task description. (Show Details)

Ok, I've synced up with @Cmjohnson and have the next steps to bring these online:

  • - @Cmjohnson to replace hostname labels from labstore100[89] (and labstore100[89]-array1) to cloudstore100[89] (and cloudstore100[89]-array1
  • - @Cmjohnson to migrate the network conntections for both cloudstore1008 and cloudstore1009 back to asw-a-eqiad. They are presently on the non-working stack asw2-a-eqiad.
  • - @Cmjohnson to update this task with the new port information and then reassign this over to @RobH

Then I'll take over and work on the network connectivity, vlan assignments, and production dns assignments prior to installation.

@Cmjohnson is off Friday (tomorrow) so this on-site work won't happen until Monday.

Both cloudstores are updated

ge-0/0/14 up up cloudstore1008
ge-6/0/17 up up cloudstore1009

hey @RobH @Cmjohnson what is the status of this? Could we move forward with OS installation?

Change 462755 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/dns@master] Change mgmt dns from labstore1008/9 to cloudstore1008/9

https://gerrit.wikimedia.org/r/462755

Change 462755 merged by Cmjohnson:
[operations/dns@master] Change mgmt dns from labstore1008/9 to cloudstore1008/9

https://gerrit.wikimedia.org/r/462755

Change 462773 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding production dns for cloudstore100[89]

https://gerrit.wikimedia.org/r/462773

please let me know the partman recipe you want current labstore1006/7 is dumps-distribution-100x.cfg

Change 462776 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: Change the names of labstore1008/9 to cloudstore1008/9

https://gerrit.wikimedia.org/r/462776

It's already there, just under the wrong names. Fixing that.

Change 462776 merged by Bstorm:
[operations/puppet@production] cloudstore: Change the names of labstore1008/9 to cloudstore1008/9

https://gerrit.wikimedia.org/r/462776

Change 462773 merged by Cmjohnson:
[operations/dns@master] Adding production dns for cloudstore100[89]

https://gerrit.wikimedia.org/r/462773

So at this point, we've got the naming solid (and have added prod DNS). However, we still have the problem that cloudstore1008 appears to send no traffic over the network to install1002 and does not do DHCP, though one interface shows a link up, while cloudstore1009 shows no link active. This is the real blocker that this is stuck on.

@Bstorm cloudstore1008 and 1009 were in the wrong vlans on the switch port. I updated the ports. you should be able to get the installer now

For cloudstore1008, I updated asw2-a5-eqiad to put this server in the public vlan. Everything was accepted like normal but when I display inheritance it's not showing up in that vlan. When I search the ports in the public vlan the port shows as being there.

mjohnson@asw2-a5-eqiad# show interfaces ge-0/0/14 |display inheritance
description cloudstore1008;

  1. '9192' was inherited from group 'access-port' ##

mtu 9192;

  1. '0' was inherited from group 'access-port' ##

unit 0 {

  1. 'ethernet-switching' was inherited from group 'access-port' family ethernet-switching {
  2. 'access' was inherited from group 'access-port' ## port-mode access; }

}

and then

cmjohnson@asw2-a5-eqiad# show interface-range vlan-public1-a-eqiad
member xe-0/0/1;
member ge-0/0/14;

Change 463300 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Fixing dhcpd file to match correct mac

https://gerrit.wikimedia.org/r/463300

Change 463300 merged by Cmjohnson:
[operations/puppet@production] Fixing dhcpd file to match correct mac

https://gerrit.wikimedia.org/r/463300

Both of these servers are able to be installed. assigning to @Bstorm

Screen Shot 2018-10-10 at 1.16.25 PM.png (344×1 px, 66 KB)

cloudstore1008 appears to be stuck here. That's quite interesting, since it seems to have the installer.

When you're initiating an install, do you see a GET for autoinstall/preseed.cfg showing up in nginx access.log on install1002? This would show us whether it's really looking up hard early or whether it's just an output error of some kind.

Is the error specific to cloudstore1008 or also reproducible with 1009? To rule out it's just a single broken piece of hardware.

To rule out an error specific to the 4.9 stretch kernel with that hardware, we could also a test installation with our 4.14-based installer image. It's nothing supported for production, but could be useful to narrow things down. For that, basically apply the reverse of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/465434/

The EDD error is likely a red herring (it's simply the last output before things break. But might still be worth a shot just in case. For that temporarily add edd=off to the /srv/tftpboot/stretch-installer/pxelinux.cfg/ttyS1-115200 (in the end of the last line where the kernel options are passed)

I agree on the EDD message. Thanks for the pointers, I'll go poke more things shortly.

Finally coming back to this. The exact same condition is true on cloudstore1009. Will check for the GET.
If that all looks fine, I'll look at the alternate kernel.

Yes, the timestamps for GET requests for the right scripts on install1002 are there when I reset one of the servers.

Change 470871 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: try installing the 4.14 kernel on the new cloudstore servers

https://gerrit.wikimedia.org/r/470871

Change 470871 merged by Bstorm:
[operations/puppet@production] cloudstore: try installing the 4.14 kernel on the new cloudstore servers

https://gerrit.wikimedia.org/r/470871

Aaaand same freeze when installing on 4.14. That's fun. I can try the kernel on cloudstore1009 as well, but cloudstore1009 so far behaves the same as 08 in general.

I've seen the same lockup effect in the past when there was contention between the BIOS and Linux for the serial port. This happened when the serial port redirect settings were misconfigured and e.g. set up for "redirect after boot" and directed to COM1, while Linux was also set up for ttyS0. I'd recommend verifying the BIOS settings against our docs on wikitech if you haven't already!

Oh thanks! I'll take a look at that. I figure it must either be a BIOS config or possibly kernel option issue.

Redirection settings are confirmed correct. Looking around other settings in the docs.

Nothing. I guess this is just more digging, then, unless both systems are somehow broken.

Change 473566 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: install with the default kernel for cloudstore1008

https://gerrit.wikimedia.org/r/473566

Change 473566 merged by Bstorm:
[operations/puppet@production] cloudstore: install with the default kernel for cloudstore1008

https://gerrit.wikimedia.org/r/473566

This was just stuck at a prompt. Stupid mistake, the output after that stage of boot was redirected to the other console. Proceeding.

Bstorm updated the task description. (Show Details)