Page MenuHomePhabricator

setup/deploy server analytics1003/WMF4541
Closed, ResolvedPublic8 Estimated Story Points

Description

This task will track the setup and deployment of server analytics1003 / WMF4541 for analytics as an new hive/oozie server.

  • - update mgmt dns entries (asset tag entries exist) and add in hostname entries
  • - test all bios and drac settings
  • - production dns entries (analytics vlan)
  • - switch port description, enable, and vlan
  • - install_server updates
  • - install OS - trusty
  • - accept/sign puppet/salt
  • - service implementation (hand off to @Ottomata for this stage.)

Event Timeline

RobH created this task.

@Ottomata:

Before we can allocate and setup this system, we need to confirm with you that this doesn't have a 'cluster' type name. Our naming conventions page shows 'analytics' for hive systems, but I was under the impression that analytics is renaming systems from the name analytics? (This could be incorrect, hence my asking on this task.) So if this is a misc system name, we'll call it phosphorus. If it is a cluster name, let me know what cluster name to use.

Also, we need to know if we should install Jessie or Trusty.

Please advise and assign this task back to me, thanks!

Hm, there's no effort to rename analytics* systems that I'm aware of. This node will run a few different Analytics Cluster related services, so I think we should stick with the analytics* naming. I know you don't like to reuse hostnames, but its been years since we had an analytics1003. Can we reuse this name? We'll probably get new Hadoop worker nodes next FY, and it'd be nice to keep adding at the end of our numbers for workers so that they are all grouped together consistently.

Please install with Trusty. Thanks!

RobH renamed this task from setup/deploy server phosphorus/WMF4541 to setup/deploy server analytics1003/WMF4541.Mar 24 2016, 5:12 PM
RobH updated the task description. (Show Details)

When attempting to netboot this, it doesn't seem to hit carbon.

So I'm not entirely certain what is going on. I think it may be something on the network level, since I can document the following:

Interface       Admin Link Description
ge-4/0/18       up    up   analytics1003

robh@asw-c-eqiad> show vlans 
Name           Tag     Interfaces
analytics1-c-eqiad 1022   
                       ae1.0*, ae2.0*, ge-2/0/19.0*, ge-2/0/20.0*, ge-2/0/21.0*, ge-2/0/22.0*, ge-3/0/12.0*, ge-3/0/13.0*, ge-3/0/14.0*, ge-4/0/16.0*, ge-4/0/18.0*, ge-4/0/19.0*, ge-7/0/1.0*, ge-7/0/2.0*, ge-7/0/6.0*, ge-7/0/11.0*, ge-7/0/15.0*, ge-7/0/16.0*

So the network port is enabled with the proper vlan set. However, when I run

show ethernet-switching table

on the switch stack, there is no entry for F0:1F:AF:E7:EF:41, the mac of the primary NIC on analytics1003. It should show up, as during the PXE boot on the serial console, I can see it attempting the PXE boot with that MAC address.

Before escalating to our network administrators, the only other item I can think to do is have @Cmjohnson swap the patch cable out and double check my work in the above task.

The only other thing it may also be is the port that was labeled with this servers asset tag wasn't the actual correct port, so an on-site verification that analytics1003 is plugged into ge-4/0/18 is also needed.

Ok, this is odd:

analytics1-c-eqiad 1022

ae1.0*, ae2.0*, ge-2/0/19.0*, ge-2/0/20.0*, ge-2/0/21.0*, ge-2/0/22.0*, ge-3/0/12.0*, ge-3/0/13.0*, ge-3/0/14.0*, ge-4/0/16.0*, ge-4/0/18.0*, ge-4/0/19.0*, ge-7/0/1.0*,
ge-7/0/2.0*, ge-7/0/6.0*, ge-7/0/11.0*, ge-7/0/15.0*, ge-7/0/16.0*

and then

public1-c-eqiad 1003

ae1.0*, ae2.0*, ge-2/0/45.0*, ge-2/0/46.0*, ge-2/0/47.0*, ge-4/0/13.0*, ge-4/0/14.0, ge-4/0/17.0, ge-4/0/18.0*, ge-4/0/22.0*, ge-4/0/23.0*, ge-4/0/24.0*, ge-4/0/25.0*,
ge-5/0/32.0*, ge-6/0/45.0*, ge-6/0/46.0*, ge-6/0/47.0*, ge-7/0/7.0*, ge-7/0/8.0*, ge-7/0/19.0*, ge-7/0/23.0, ge-7/0/24.0*, ge-7/0/26.0*, ge-7/0/32.0, xe-8/0/23.0*, xe-8/0/24.0*,
xe-8/0/25.0*, xe-8/0/26.0*, xe-8/0/27.0*, xe-8/0/28.0*

So its in both vlans, but I've removed from public and committed, and yet its still there?

I check the config for an interface-range entry or regex entry that would cover that still showing in public vlan assignment, but I cannot locate anything.

interface-range vlan-analytics1-c-eqiad {
    member "ge-7/0/[1-2]";
    member ge-7/0/11;
    member ge-4/0/16;
    member ge-4/0/19;
    member "ge-7/0/[15-16]";
    member ge-7/0/6;
    member ge-4/0/18;
    member-range ge-2/0/19 to ge-2/0/22;
    member-range ge-3/0/12 to ge-3/0/14;
    unit 0 {
        family ethernet-switching {
            vlan {
                members analytics1-c-eqiad;
            }
        }
    }
}

interface-range vlan-public1-c-eqiad {
    member ge-7/0/19;
    member ge-4/0/13;
    member ge-7/0/24;
    member ge-4/0/17;
    member ge-5/0/32;
    member ge-4/0/25;               
    member ge-7/0/26;
    unit 0 {
        family ethernet-switching {
            vlan {
                members public1-c-eqiad;
            }
        }
    }
}

I'll note my switch config change was two parts, first I added the port description, enabled, added to analytics vlan. Then once I rolled that live and it failed to PXE boot, I saw it was still in the public vlan. JuneOS used to warn on this, but it didn't for me this time. I went ahead and ran delete interface-range vlan... member... and it showed removed (-) in the show | compare before committing.

Yet it now still shows in both vlans in the output above.

Also odd:

default

ge-1/0/0.0, ge-1/0/1.0, ge-1/0/2.0, ge-1/0/3.0, ge-1/0/4.0, ge-1/0/5.0, ge-1/0/6.0, ge-1/0/7.0, ge-1/0/8.0, ge-1/0/9.0, ge-1/0/10.0, ge-1/0/11.0, ge-1/0/12.0, ge-1/0/13.0,
ge-1/0/14.0, ge-1/0/15.0, ge-1/0/16.0, ge-1/0/17.0, ge-1/0/18.0, ge-1/0/19.0, ge-1/0/20.0, ge-1/0/21.0, ge-1/0/22.0, ge-1/0/23.0, ge-1/0/24.0, ge-1/0/25.0, ge-1/0/26.0,
ge-1/0/27.0, ge-1/0/28.0, ge-1/0/29.0, ge-1/0/30.0, ge-1/0/31.0, ge-1/0/32.0, ge-1/0/33.0, ge-1/0/34.0, ge-1/0/35.0, ge-1/0/36.0, ge-1/0/37.0, ge-1/0/38.0, ge-1/0/39.0,
ge-1/0/40.0, ge-1/0/41.0, ge-1/0/42.0, ge-1/0/43.0, ge-1/0/44.0, ge-1/0/45.0, ge-1/0/46.0, ge-1/0/47.0, ge-2/0/0.0, ge-2/0/1.0, ge-2/0/8.0, ge-2/0/25.0, ge-2/0/26.0,
ge-2/0/27.0, ge-2/0/28.0, ge-2/0/29.0, ge-2/0/30.0, ge-2/0/31.0, ge-2/0/32.0, ge-2/0/33.0, ge-2/0/34.0, ge-2/0/35.0, ge-2/0/36.0, ge-2/0/37.0, ge-2/0/38.0, ge-2/0/39.0,
ge-2/0/40.0, ge-2/0/41.0, ge-2/0/42.0, ge-2/0/43.0, ge-2/0/44.0, ge-3/0/15.0, ge-3/0/18.0, ge-3/0/19.0, ge-3/0/20.0, ge-3/0/21.0, ge-3/0/22.0, ge-3/0/23.0, ge-3/0/24.0,
ge-3/0/25.0, ge-3/0/26.0, ge-3/0/27.0, ge-3/0/28.0, ge-3/0/29.0, ge-3/0/30.0, ge-3/0/31.0, ge-3/0/32.0, ge-3/0/33.0, ge-3/0/34.0, ge-3/0/35.0, ge-3/0/36.0, ge-3/0/37.0,
ge-3/0/38.0, ge-3/0/39.0, ge-3/0/40.0, ge-3/0/41.0, ge-3/0/42.0, ge-3/0/43.0, ge-3/0/44.0, ge-3/0/45.0, ge-3/0/46.0, ge-3/0/47.0, ge-4/0/10.0, ge-4/0/31.0, ge-4/0/32.0,
ge-4/0/33.0, ge-4/0/34.0, ge-4/0/35.0, ge-4/0/36.0, ge-4/0/37.0, ge-4/0/38.0, ge-4/0/39.0, ge-4/0/40.0, ge-4/0/41.0, ge-4/0/42.0, ge-4/0/43.0, ge-4/0/44.0, ge-4/0/45.0,
ge-4/0/46.0, ge-4/0/47.0, ge-5/0/40.0, ge-5/0/41.0, ge-5/0/42.0, ge-5/0/43.0, ge-5/0/44.0, ge-5/0/45.0, ge-5/0/46.0, ge-5/0/47.0, ge-6/0/40.0, ge-6/0/41.0, ge-6/0/42.0,
ge-6/0/43.0, ge-6/0/44.0, ge-7/0/35.0, ge-7/0/36.0, ge-7/0/37.0, ge-7/0/38.0, ge-7/0/39.0, ge-7/0/40.0, ge-7/0/41.0, ge-7/0/42.0, ge-7/0/43.0, ge-7/0/44.0, ge-7/0/45.0,
ge-7/0/47.0

So default doesnt show ge-4/0/18 in there, but the config does:

interface-range vlan-default {
    member ge-4/0/16;
    member ge-4/0/18;
    member ge-4/0/19;
    member ge-4/0/20;
}

Shouldn't that auto-remove when its put into any other vlan?

At this point I'd love to rollback all my changes, but I have to go back multiple revisions. It seems the network admin pass has changed since I last used it, so I'll need someone with it to either update me, or rollback my 3 revisions on asw-c-eqiad that I made today.

I'm not entirely certain I didn't somehow cause the odd issues above, so rolling back and retrying seems the best solution.

I'm pretty sure @BBlack, @mark, and @faidon have the network admin pass and can either re-share it with me (I expect it was rotated as a matter of course?) or rollback my changes.

I rather try to re-do all my steps carefully before escalating the issue upwards.

Ok, Brandon rolled back all my changes (since I only have rollback 0). I'm reattempting them more carefully in a single changeset.

[edit interfaces interface-range vlan-analytics1-c-eqiad]
     member ge-7/0/6 { ... }
+    member ge-4/0/18;
[edit interfaces interface-range vlan-public1-c-eqiad]
-    member ge-4/0/18;
[edit interfaces ge-4/0/18]
-   description wmf4541;
+   description analytics1003;
-   disable;
+   enable;

Even with this now rolled live and everything looking correct, I still don't get a lease from dhcp, nor do I see it hit the dhcp server at all to request one. However, every attempt results in:

Mar 25 16:19:05 carbon kernel: [36443755.023769] iptables-dropped: IN=eth0 OUT= MAC=78:2b:cb:09:0e:a0:5c:5e:ab:3d:87:c1:08:00 SRC=1.52.230.15 DST=208.80.154.10 LEN=52 TOS=0x00 PREC=0x00 TTL=51 ID=60824 DF PROTO=TCP SPT=52920 DPT=23 WINDOW=14600 RES=0x00 SYN URGP=0
Mar 25 16:19:08 carbon kernel: [36443758.022526] iptables-dropped: IN=eth0 OUT= MAC=78:2b:cb:09:0e:a0:5c:5e:ab:3d:87:c1:08:00 SRC=1.52.230.15 DST=208.80.154.10 LEN=52 TOS=0x00 PREC=0x00 TTL=51 ID=60825 DF PROTO=TCP SPT=52920 DPT=23 WINDOW=14600 RES=0x00 SYN URGP=0

The odd part is those don't look like they should be tftp calls (based on port), but they only trigger in the log when I attempt to pxe boot. However, the src ip keeps changing, so the iptables drop may be a red herring/false positive.

So, despite having confirmed this was likely background noise with @Dzahn yesterday, it's continued occurrence made me re-investigate. The iptables drops have nothing to do with my problem of PXE on analytics1003.

For some reason, that vlan doesn't seem to hit the PXE systems right now? Additionally, @Cmjohnson swapped out the patch cable to ensure it wasnt the issue. I've rolled the switch config live as noted above and it looks good.

At this point, I think I need a network admin to confirm that all the proper routing/dhcp rules are in place still for analytics1-c-eqiad to hit carbon/tftp/dhcpd.

RobH removed RobH as the assignee of this task.Apr 6 2016, 3:26 PM
RobH added a project: netops.

Yes, I think we need a network admin to investigate the dhcp ability of the analytics vlan to carbon, as I cannto seem to get the requests though.

The port was also on the labs-instance-ports interface-range, which set the port-mode to trunk (and also added labs-instances1-eqiad to the VLAN set). Since we have no VLAN set on the NIC's bios (which is good), this would account for PXE failing.

I removed it from that interface-range, please make another attempt to reinstall :)

Ok, multiple attempts have still resulted in no joy (no dhcp request hitting carbon.)

The system was also showing in the config in the default vlan stanza. Faidon removed it, and it still fails to hit carbon.

OK, I debugged this some more, and this was caused by a Juniper bug and one that I vaguelly recall experiencing before. The issue was that when I removed the interface from the interface-range that applied that labs-instances VLAN, it didn't actually modify the interface. This was evident from this:

faidon@asw-c-eqiad> show ethernet-switching table interface ge-4/0/18    
Ethernet-switching table: 0 unicast entries
  VLAN	            MAC address       Type         Age Interfaces
  analytics1-c-eqiad *                Flood          - All-members
  labs-instances1-eqiad *             Flood          - All-members

{master:1}
faidon@asw-c-eqiad> edit 
Entering configuration mode

{master:1}[edit]
faidon@asw-c-eqiad# edit interfaces ge-4/0/18 

{master:1}[edit interfaces ge-4/0/18]
faidon@asw-c-eqiad# show | display inheritance 
description analytics1003;
enable;
##
## 'no-traps' was expanded from interface-range 'access-ports'
##
no-traps;
##
## '9192' was inherited from group 'access-port'
##
mtu 9192;
##
## '0' was expanded from interface-range 'vlan-analytics1-c-eqiad'
##
unit 0 {
    ##
    ## 'ethernet-switching' was expanded from interface-range 'vlan-analytics1-c-eqiad'
    ##
    family ethernet-switching {
        ##
        ## 'access' was inherited from group 'access-port'
        ##
        port-mode access;
        ##
        ## 'vlan' was expanded from interface-range 'vlan-analytics1-c-eqiad'
        ##
        vlan {
            ##
            ## 'analytics1-c-eqiad' was expanded from interface-range 'vlan-analytics1-c-eqiad'
            ##
            members analytics1-c-eqiad;
        }
    }
}

I manually set port-mode and VLANs on the port:

{master:1}[edit interfaces ge-4/0/18]
faidon@asw-c-eqiad# set unit 0 family ethernet-switching port-mode access 

{master:1}[edit interfaces ge-4/0/18]
faidon@asw-c-eqiad# set unit 0 family ethernet-switching vlan members analytics1-c-eqiad

...and DHCP immediately worked. I will revert this change once the installer finishes and we have Linux up and running, as now that the configuration got "unstuck", it shouldn't apply this again.

The network config was reverted and it still works, so it should be final now. However, the installer failed at the last step with:

┌┤ [!!] Install the GRUB boot loader on a hard disk ├┐
│                                                    │
│         Unable to install GRUB in /dev/sdc         │
│ Executing 'grub-install /dev/sdc' failed.          │
│                                                    │
│ This is a fatal error.                             │
│                                                    │
│     <Go Back>                       <Continue>     │
│                                                    │
└────────────────────────────────────────────────────┘

On the logs it says:

Apr 14 16:21:26 grub-installer: info: Installing grub on '/dev/sdc /dev/sdd'
Apr 14 16:21:26 grub-installer: info: grub-install does not support --no-floppy
Apr 14 16:21:26 grub-installer: info: Running chroot /target grub-install  --force "/dev/sdc"
Apr 14 16:21:26 grub-installer: Installing for i386-pc platform.
Apr 14 16:21:34 grub-installer: grub-install: warning: this GPT partition label contains no BIOS Boot Partition; embedding won't be possible.
Apr 14 16:21:34 grub-installer: grub-install: error: embedding is not possible, but this is required for RAID and LVM install.
Apr 14 16:21:34 grub-installer: error: Running 'grub-install  --force "/dev/sdc"' failed.

…and /dev/sdc is indeed a 3TB disk so GPT only. I think this needs a different partman configuration, but I'll leave @RobH/@Ottomata to it :)

Will work on this today/tomorrow.

Ottomata set the point value for this task to 8.Apr 19 2016, 5:11 PM

Change 284249 had a related patch set uploaded (by Ottomata):
Add analytics1003 in netboot.cfg and site.pp

https://gerrit.wikimedia.org/r/284249

Change 284249 merged by Ottomata:
Add analytics1003 in netboot.cfg and site.pp

https://gerrit.wikimedia.org/r/284249

Change 284272 had a related patch set uploaded (by Ottomata):
Include analytics_cluster::client and analytics_cluster::database::meta roles on analytics1003

https://gerrit.wikimedia.org/r/284272

Change 284272 merged by Ottomata:
Include analytics_cluster::client and analytics_cluster::database::meta roles on analytics1003

https://gerrit.wikimedia.org/r/284272

Change 284276 had a related patch set uploaded (by Ottomata):
analytics1015 -> analytics1003 migration

https://gerrit.wikimedia.org/r/284276

Ok, we are ready to proceed. Plan here: https://etherpad.wikimedia.org/p/analytics-meta

  1. stop camus early (elukey & joal)
  2. verify no oozie and hive jobs are running
  3. stop puppet on analytics1015 and analytics1003
  4. stop oozie, hive, mysql on analytics1015.
  5. copy /var/lib/mysql from analytics1015 -> analytics1003.
  6. start mysql on analytics1003, make sure all is well.
  7. merge https://gerrit.wikimedia.org/r/#/c/284276/, run puppet on analytics1003.
  8. make sure hive and oozie look good from analytics1003
  9. make sure mysql backup works from analytics1003 -> analytics1002.
  10. reenable camus, run puppet on stat1002, stat1004, analytics1027.
  11. Make sure camus works fine and oozie jobs resume.

Change 284276 merged by Ottomata:
analytics1015 -> analytics1003 migration

https://gerrit.wikimedia.org/r/284276