Page MenuHomePhabricator

rack/setup/install labvirt102[12]
Closed, ResolvedPublic

Description

This task will track the racking and setup of labvirt102[12].eqiad.wmnet. These are essentially the same as the rest of our labvirts, using the same row (B) and same vlan/os as the rest of labvirt servers.

Special Note: Please ensure the virtualization options for the CPUs on these hosts are ENABLED, this is the opposite of the majority of the fleet.

Racking Proposal: These MUST be in row B of eqiad, with other labvirts.

labvirt1021:

  • - receive in system on procurement task T178937
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, labs vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept
  • - initial run failure due to kernel version mismatch, leaving for @chasemp to fix, install_console from neodymium allows connection
  • - handoff for service implementation

labvirt1022:

  • - receive in system on procurement task T178937
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, labs vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run
  • - handoff for service implementation

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
RobH triaged this task as Medium priority.Jan 2 2018, 5:21 PM

Change 401793 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding mgmt dns labvirt1021/22

https://gerrit.wikimedia.org/r/401793

Change 401793 merged by Cmjohnson:
[operations/dns@master] Adding mgmt dns labvirt1021/22

https://gerrit.wikimedia.org/r/401793

These seem pretty close, any chance they are on the agenda for early next week? We are looking at a potential resource crunch for CPU and these would be heartwarming to have ready to go.

Change 402361 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding production dns labvirt1021/22

https://gerrit.wikimedia.org/r/402361

Change 402361 merged by Cmjohnson:
[operations/dns@master] Adding production dns labvirt1021/22

https://gerrit.wikimedia.org/r/402361

Change 402417 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] Add new labvirts to netboot and dhcpd

https://gerrit.wikimedia.org/r/402417

Change 402417 merged by Cmjohnson:
[operations/puppet@production] Add new labvirts to netboot and dhcpd

https://gerrit.wikimedia.org/r/402417

These are hitting the install server but not receiving the image. Chasemp or robh can you take a look at this please. They were received with 10G Nics that I turned off and set the 1GB Nic to pxe. Please verify that everything looks okay.

@Andrew said he would have a minute to take a look and that this sounded vaguely familiar

OK -- on 1021 in System Setup:Device Settings I see one NIC with four ports:

Integrated NIC 1 Port 1: Intel(R) Ethernet 10G 4P X520/I350 rNDC -            
 24:6E:96:8D:B3:A0                                                            
Integrated NIC 1 Port 2: Intel(R) Ethernet 10G 4P X520/I350 rNDC -            
 24:6E:96:8D:B3:A2                                                            
Integrated NIC 1 Port 3: Intel(R) Gigabit 4P X520/I350 rNDC -                 
 24:6E:96:8D:B3:A4                                                            
Integrated NIC 1 Port 4: Intel(R) Gigabit 4P X520/I350 rNDC -                 
 24:6E:96:8D:B3:A5

Is that everything, or is there also an external NIC?

What I would expect is for 1 and 2 to be disabled, and for 3 and 4 to show that they're connected (3 for eth0 and 4 for eth1). That's not how it looks to me, which makes me think I'm missing something.

In System Bios I see "Integrated Network Card 1" which is shown as 'enabled'. I assume that's the same nic 1 that I'm seeing in Device Settings... I don't so far see how to selectively disable ports on that one card without disabling the whole card.

I powered these off for the moment, just to cut down on dhcp noise.

Note from T184909 that these were spamming teh DHCP server

I've seen this on install1002 via syslog, both servers seem to send multiple DHCPDISCOVER requests every minute, to which the dhcp server answers correctly.

We should take a hard look at what is going on there, something seems deifnitely wrong.

ping @RobH it seemed like you had some ideas here during the meeting today, could you coordinate with @Andrew on our side if we can help?

So, it seems these were not ordered with the right kind of network card.

Ideally, we keep the onboard 4x1GB and add a second broadcom dual port 10G. However, these two systems were ordered with "Intel X520 DP 10Gb DA/SFP+, + I350 DP 1Gb Ethernet Network Daughter Card"

I think these cards are where the 4x1GB usually goes, need to sync with @Cmjohnson to confirm. We may need to do some procurement correction, and order additional proper nics, or try these out somehow.

Mainly I want to know if there are any copper network ports, and if not, I suppose we have ONLY the optic 1Gb port we could put an sfp-t on if needed. Need to coordinate with Chris in real time via irc.

One will got A4 but we have to move or remove rdb1003. I don't know who owns that. The other will go in a 10G rack in row D. I am working on the network refresh now and will do it next.

Is it possible this is a comment for another task? :) I'm wondering if we have the right nics here, these don't necessarily need 10G nics and probably shouldn't have them.

@chasemp yes this is for another task..sorry I do not have an update for
why your labvirts are not working. Maybe @RobH has made some progress.

@chasemp yes this is for another task..sorry I do not have an update for
why your labvirts are not working. Maybe @RobH has made some progress.

No worries, cheers and thanks for your help

@RobH or @Cmjohnson any luck figuring out what the NIC situation is here? We will have to figure out something fairly soon if we need to order different NICs.

So for some reason (WMCS bad luck!), these seem to have been ordered with Intel NIC daughter cards. We have had Intel NICs only in the distant past, 99% of our 10G fleet is on QLogic (née Broadcom) these days. We still have kernel command-line options in the puppet tree to make those work with our optics, and it's very likely that we'd be able to make these work somehow.

While it's a relatively safe bet, I don't think it's worth the effort of trying to make them work, or the risk of them creating any trouble and having these be snowflakes, so I'd be inclined to just pay a small amount ($200 per system or so?) to buy the standard-issued Broadcom/QLogic daughter-card NICs. Any disagreements?

Any disagreements?

Nope, please and thank you. Really appreciate you and DC Ops working through our puzzles.

Chris,

Can you install the two NICs that came in on T188297 into labvirt102[12]? They will replace the current Intel daughter cards with these new qlogic ones. Once the swap is done, you can assign back to me for followup.

Thanks!

Change 419524 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] updating labvirt102[12] mac addresses

https://gerrit.wikimedia.org/r/419524

Change 419524 merged by RobH:
[operations/puppet@production] updating labvirt102[12] mac addresses

https://gerrit.wikimedia.org/r/419524

Ok, the new network cards are installed. They have the first 2 ports as 10G, and the second 2 as 1G. When I go to boot labvirt1021, it PXE boots, and hits the install server, but then doesn't actually load the image:

install server syslog:

Mar 14 20:58:09 install1002 dhcpd: DHCPDISCOVER from 18:66:da:bb:31:76 via 10.64.20.2
Mar 14 20:58:09 install1002 dhcpd: DHCPOFFER on 10.64.20.40 to 18:66:da:bb:31:76 via 10.64.20.2
Mar 14 20:58:09 install1002 dhcpd: DHCPDISCOVER from 18:66:da:bb:31:76 via 10.64.20.3
Mar 14 20:58:09 install1002 dhcpd: DHCPOFFER on 10.64.20.40 to 18:66:da:bb:31:76 via 10.64.20.3
Mar 14 20:58:13 install1002 dhcpd: DHCPREQUEST for 10.64.20.40 (208.80.154.22) from 18:66:da:bb:31:76 via 10.64.20.3
Mar 14 20:58:13 install1002 dhcpd: DHCPACK on 10.64.20.40 to 18:66:da:bb:31:76 via 10.64.20.3
Mar 14 20:58:13 install1002 dhcpd: DHCPREQUEST for 10.64.20.40 (208.80.154.22) from 18:66:da:bb:31:76 via 10.64.20.2
Mar 14 20:58:13 install1002 dhcpd: DHCPACK on 10.64.20.40 to 18:66:da:bb:31:76 via 10.64.20.2
Mar 14 20:58:13 install1002 atftpd[514]: Serving lpxelinux.0 to 10.64.20.40:2070
Mar 14 20:58:13 install1002 atftpd[514]: Serving lpxelinux.0 to 10.64.20.40:2071

The actual system shows the PXE post, and then:

Booting from QLogic MBA Slot 0102 v7.14.2

QLogic UNDI PXE-2.1 v7.14.2
Copyright (C) 2016 QLogic Corporation
Copyright (C) 1997-2000 Intel Corporation
All rights reserved.

CLIENT MAC ADDR: 18 66 DA BB 31 76 GUID: 4C4C4544-0059-5310-804A-B8C04F384D32
CLIENT IP: 10.64.20.40 MASK: 255.255.255.0 DHCP IP: 208.80.154.22
GATEWAY IP: 10.64.20.1

PXELINUX 6.03 lwIP 20150819 Copyright (C) 1994-2014 H. Peter Anvin et al
Failed to load ldlinux.c32
Boot failed: press a key to retry, or wait for reset...

Also labvirt1022 has the exact same behavior.

Odd that it suddenly sits and (seemingly) does nothing at this point. Is this due to it being a 4 port card and we're using the 3rd of 4?

Googling for the issue presents some interesting results, but most refer to bugs that have since been fixed (according to their relevant bugtracker links)

https://lists.debian.org/debian-user/2016/02/msg00295.html

This happens for both strech and jessie.

Change 419594 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] setting labvirt102[12] to tftp load (not http)

https://gerrit.wikimedia.org/r/419594

Change 419594 merged by RobH:
[operations/puppet@production] setting labvirt102[12] to tftp load (not http)

https://gerrit.wikimedia.org/r/419594

IRC Update:

After chatting with @faidon it was determined this was failing due to the recent change to serve kernel images via http but the labs ACLs don't allow that.

So pushed back to tftp and all is good on the installer end.

Change 419609 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] setting new labvirts to spare role

https://gerrit.wikimedia.org/r/419609

Change 419609 merged by RobH:
[operations/puppet@production] setting new labvirts to spare role

https://gerrit.wikimedia.org/r/419609

RobH edited projects, added cloud-services-team; removed Patch-For-Review.
RobH updated the task description. (Show Details)

Ok, escalating this to @chasemp for completion. The systems are installed and calling into puppet. Their 1G ports are showing as eth2/3, with eth2 being the primary used interface. eth0/1 are the 10G ports. Irc discussion noted that will cause some refactoring of some puppet code to accommodate the change.

chasemp closed subtask Unknown Object (Task) as Resolved.Mar 15 2018, 2:19 PM

Ok, escalating this to @chasemp for completion. The systems are installed and calling into puppet. Their 1G ports are showing as eth2/3, with eth2 being the primary used interface. eth0/1 are the 10G ports. Irc discussion noted that will cause some refactoring of some puppet code to accommodate the change.

I resolved https://phabricator.wikimedia.org/T188297#4053178 (hope that's cool)

I don't believe eth3 is connected for labvirt1021:

lldpcli show neighbors only shows eth2 connected and icinga is reporting eth3 reporting no carrier. FWIW eth3 here should be connected and configured as a trunk switch (match eth1 for existing labvirts).

Ok, escalating this to @chasemp for completion. The systems are installed and calling into puppet. Their 1G ports are showing as eth2/3, with eth2 being the primary used interface. eth0/1 are the 10G ports. Irc discussion noted that will cause some refactoring of some puppet code to accommodate the change.

I resolved https://phabricator.wikimedia.org/T188297#4053178 (hope that's cool)

I don't believe eth3 is connected for labvirt1021:

lldpcli show neighbors only shows eth2 connected and icinga is reporting eth3 reporting no carrier. FWIW eth3 here should be connected and configured as a trunk switch (match eth1 for existing labvirts).

[edit interfaces interface-range vlan-private1-b-eqiad]
-    member ge-4/0/34;
[edit interfaces interface-range cloud-instance-ports]
     member ge-8/0/13 { ... }
+    member ge-4/0/34;
+    member ge-8/0/23;

commit comment "T183937 labvirt102[12] instance trunk"

@ayounsi fyi

@RobH just a ping that we are talking about this in our weekly, are you going to have time to check into where to go from here? easy money says maybe we should just connect up the 10G interfaces to image these for a short term thing.

Summary of current state from @chasemp:

  • Imaging as Jessie works, but our OpenStack deploy is not ready for mixed Trusty/Jessie
  • Trusty will only image from eth0
  • eth0 on the cards in these boxes is 10G and unconnected
  • We can't figure out how to disable the 10G ports so that the installer sees the connected 1G port as eth0

Is there any way we can help? Do you have logs or more information about the "trusty will only image from eth0" that we could perhaps help troubleshoot together?

@RobH figured out what he believed is the eth0 issue described, unless a screenshot was captured I don't think there are logs but the message he pasted in irc from console was something very literal like "failed as eth0 is not connected". It was my understanding that this has been seen in the past on Trusty and was a somewhat forgotten but known old issue. Trying to image with Trusty reproduces. I thought @RobH was going to try to circle back on this and see if the issue can be overcome easily but I was away on vacation and then we have both been busy, I figured T187373 was the priority of the two pending cloud hardware isues from a dcops perspective so I wasn't too worried.

So there is an issue where trusty expects the os to be on eth0, and its on eth3. However, after discussion in IRC, @ayounsi pointed out the new switch in this rack is 10G.

So labvirt1021 needs to have its two network ports (os vlan, and instance vlan) setup on the new 10G switch, and then have these moved over to those (via DAC cables, removing the cat5 cables on eth3/4)

ge-4/0/18 - description labvirt1021:eth0 - vlan-cloud-hosts1-b-eqiad
xe-4/0/35 - description labvirt1021:eth1 - cloud-instance-ports

So there is an issue where trusty expects the os to be on eth0, and its on eth3. However, after discussion in IRC, @ayounsi pointed out the new switch in this rack is 10G.

Even if this is not actually pertinent here (given 10G), it may be for the next install. How does it expect that? There are no eth0 references in that part of our puppet tree. More information please! :)

So there is an issue where trusty expects the os to be on eth0, and its on eth3. However, after discussion in IRC, @ayounsi pointed out the new switch in this rack is 10G.

Even if this is not actually pertinent here (given 10G), it may be for the next install. How does it expect that? There are no eth0 references in that part of our puppet tree. More information please! :)

During the install, I saw it display trying to setup eth0 for dhcp, and failing. It wouldn't fail over to the later interfaces and try them. I've already asked Chris to fix labvirt1021, but I'll re-paste the error on labvirt1022 later today.

@Cmjohnson: as pinged in IRC, please go to labvirt1021 and remove the cat5 (production) connections, and plug the two 10G connections into the new 10G switch in that rack:

xe-4/0/18 - description labvirt1021:eth0 - vlan-cloud-hosts1-b-eqiad
xe-4/0/35 - description labvirt1021:eth1 - cloud-instance-ports

Those two ports have been setup by @ayounsi, please connect DAC cables from labvirt1021 to those ports and assign back to me, thanks!

Change 423751 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] labvirt1021 mac address update

https://gerrit.wikimedia.org/r/423751

Change 423751 merged by RobH:
[operations/puppet@production] labvirt1021 mac address update

https://gerrit.wikimedia.org/r/423751

Ok, labvirt1021, which was moved to 10G and uses its eth0 for primary OS install, installed fine. I went ahead and have documented the issue, as shown on labvirt1022, which was left with its 1G eth3 interface as primary.

labvirt1022 installer loads fine via tftp, the issue only starts inside the installer

Apr  3 18:22:24 install1002 dhcpd: DHCPDISCOVER from d0:94:66:07:8f:0e via 10.64.20.2
Apr  3 18:22:24 install1002 dhcpd: DHCPOFFER on 10.64.20.41 to d0:94:66:07:8f:0e via 10.64.20.2
Apr  3 18:22:24 install1002 dhcpd: DHCPDISCOVER from d0:94:66:07:8f:0e via 10.64.20.3
Apr  3 18:22:24 install1002 dhcpd: DHCPOFFER on 10.64.20.41 to d0:94:66:07:8f:0e via 10.64.20.3
Apr  3 18:22:28 install1002 dhcpd: DHCPREQUEST for 10.64.20.41 (208.80.154.22) from d0:94:66:07:8f:0e via 10.64.20.2
Apr  3 18:22:28 install1002 dhcpd: DHCPACK on 10.64.20.41 to d0:94:66:07:8f:0e via 10.64.20.2
Apr  3 18:22:28 install1002 dhcpd: DHCPREQUEST for 10.64.20.41 (208.80.154.22) from d0:94:66:07:8f:0e via 10.64.20.3
Apr  3 18:22:28 install1002 dhcpd: DHCPACK on 10.64.20.41 to d0:94:66:07:8f:0e via 10.64.20.3
Apr  3 18:22:28 install1002 atftpd[6415]: Serving trusty-installer/ubuntu-installer/amd64/pxelinux.0 to 10.64.20.41:2070
Apr  3 18:22:28 install1002 atftpd[6415]: Serving trusty-installer/ubuntu-installer/amd64/pxelinux.0 to 10.64.20.41:2071
Apr  3 18:22:28 install1002 atftpd[6415]: Serving trusty-installer/pxelinux.cfg/ttyS1-115200 to 10.64.20.41:49152
Apr  3 18:22:28 install1002 atftpd[6415]: Serving trusty-installer/pxelinux.cfg/boot.txt to 10.64.20.41:49153
Apr  3 18:22:38 install1002 atftpd[6415]: Serving trusty-installer/ubuntu-installer/amd64/linux to 10.64.20.41:49154
Apr  3 18:22:39 install1002 atftpd[6415]: Serving trusty-installer/ubuntu-installer/amd64/initrd.gz to 10.64.20.41:49155

Installer loads fine via PXE on port 3 of the onboard/daughter NIC. The first 2 ports are 10G, the second 2 are 1G. We have labvirt1022 connected via 1G on eth3.

When the installer loads up, post the tftp load and the actual configuring of the network, it fails on the "Configuring the network with DHC" step:

┌────────────────────┤ [!!] Configure the network ├─────────────────────┐
│                                                                       │
│                   Network autoconfiguration failed                    │
│ Your network is probably not using the DHCP protocol. Alternatively,  │
│ the DHCP server may be slow or some network hardware is not working   │
│ properly.                                                             │
│                                                                       │
│                              <Continue>                               │
│                                                                       │
└───────────────────────────────────────────────────────────────────────┘

The issue is Ubuntu expects to use the lowest numbered ethernet port as its OS install port. When we went ahead and moved labvirt1021 to a 10G rack, and wired eth0 (10g) up first, it installed just fine. It experienced identical errors to labvirt1022 does with the eth3/1G port used.

Unfortunately, I cannot mount the installer logs for casual perusal, since it has no network and expects eth0. Instead I've created a pastebin of the syslog from the installer. You can see that the installer sees eth0-3, and cycles through them in detection, but doesn't seem to use anything but eth0 dhcp discovery.

This doesn't seem to happen in debian, only ubuntu. However, labvirts use trusty.

Installer syslog: P6929
Installer hardware-summary: P6930

With that document, and the known fix being use eth0 in 10g, I advise we simply use 10g eth0 in a newly installed 10G switch in row B.

So @ayounsi found this: https://help.ubuntu.com/community/Installation/Netboot#Multiple_Network_Interface_Note

this seems to describe our issue. However, I'm uncertain its worth hacking around it when we can just put in a 10G spot that is free. @faidon advised to move ahead on this install, but that was before we had a potential solution.

I'm going to assume we move ahead on 10G, since there is space free in 10G racks (and chris already started to move it) into b7-eqiad.

@ayounsi is updating the network port config, and then I'll update install server module with new mac and install labvirt1022.

ayounsi@asw2-b-eqiad# show | compare 
[edit interfaces interface-range vlan-cloud-hosts1-b-eqiad]
     member xe-2/0/24 { ... }
+    member xe-7/0/16;
[edit interfaces interface-range cloud-instance-ports]
     member xe-4/0/35 { ... }
+    member xe-7/0/17;
[edit interfaces]
+   xe-7/0/16 {
+       description labvirt1022:eth0;
+   }
+   xe-7/0/17 {
+       description labvirt1022:eth1;
+   }

So @ayounsi found this: https://help.ubuntu.com/community/Installation/Netboot#Multiple_Network_Interface_Note

this seems to describe our issue. However, I'm uncertain its worth hacking around it when we can just put in a 10G spot that is free. @faidon advised to move ahead on this install, but that was before we had a potential solution.

That's not the right fix, no -- we set netcfg/choose_interface to auto across distros, as to not hardcode interface names. Normally that uses the first interface that has a link, but that codepath is broken on Ubuntu (see below). With PXE there is another alternative, namely to use pxelinux's ipappend setting to append BOOTIF=$MAC (where $MAC the MAC address of the interface that PXE booted from) to the kernel command-line. d-i's netcfg is capable of reading BOOTIF, so that should just work in all cases too; netcfg/choose_interface=auto is an even more generic option though.

I investigated this a little bit out of curiosity and found that netcfg/choose_interface=auto + link detection is broken in Ubuntu's debian-installer, starting with netcfg 1.111ubuntu1 (first released with trusty?) and up until current Ubuntu (1.142ubuntu6, which bionic has). The offending code seems to be an attempted fix for LP#848072 with a description of Flush all addresses and routes before configuring interfaces. The code reads:


kill_wpa_supplicant();

/* Reset all interfaces first */
num_ifaces = get_all_ifs(1, &ifaces);
if (num_ifaces > 0) {
    while (*ifaces) {
        di_debug("Flushing addresses and routes on interface: %s\n", *ifaces);

        /* Flush all IPv4 addresses */
        snprintf(buf, sizeof(buf), "ip -f inet addr flush dev %s", *ifaces);
        rv |= di_exec_shell_log(buf);

        /* Flush all IPv6 addresses */
        snprintf(buf, sizeof(buf), "ip -f inet6 addr flush dev %s", *ifaces);
        rv |= di_exec_shell_log(buf);

        /* Flush all IPv4 routes */
        snprintf(buf, sizeof(buf), "ip -f inet route flush dev %s", *ifaces);
        rv |= di_exec_shell_log(buf);

        /* Flush all IPv6 routes */
        snprintf(buf, sizeof(buf), "ip -f inet6 route flush dev %s", *ifaces);
        rv |= di_exec_shell_log(buf);

        ifaces++;
    }
}

/* Choose a default by looking for link */
if (num_ifaces > 1) {
    while (*ifaces) {
        struct netcfg_interface link_interface;

The if (num_ifaces > 0) { … } block is Ubuntu's modification; in the process of flushing, it advances *ifaces enough for it to become NULL, but then never resets it back, so when the code enters the "Choose a default by looking for link" block, *ifaces is NULL and the while (*ifaces) block never executes.

I'd report it, but Ubuntu forked off and never submitted that patch (or any other of their patches) upstream but kept rebasing them against latest Debian, so… not much incentive for it. Plus, thankfully, we're moving away from Ubuntu, so no impact from not doing so :)

Change 424016 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] updating labvirt1022 mac address

https://gerrit.wikimedia.org/r/424016

Change 424016 merged by RobH:
[operations/puppet@production] updating labvirt1022 mac address

https://gerrit.wikimedia.org/r/424016

RobH updated the task description. (Show Details)
RobH removed a project: Patch-For-Review.

Ok, these are ready for service implementation. Handing off to @chasemp. labvirt1021 has puppet signed but wont run (kernel version issue for some puppet packages for use in cloud), and labvirt1022 is installed but no puppet cert sign yet.

Both these systems are now puppetized and ready for testing.