Page MenuHomePhabricator

rack/setup/install an-worker10[78-95].eqiad.wmnet
Closed, ResolvedPublic

Description

This task will track the racking, setup, and installation of 18 new an-worker (hadoop) nodes to the existing hadoop cluster.

Racking Proposal: A2=2 servers, A4=1 server, A7=2 servers, B2=2 servers, B4=1 server, B7=2 servers, C2 =1, C4=2, C7=1, D2=2, D7=2

This proposal was approved via IRC chat with @Cmjohnson, @RobH, and @elukey on 2018-10-16.

an-worker1078-1095

  • - receive in system on procurement task T204177
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run
  • - handoff for service implementation

Event Timeline

RobH created this task.Oct 16 2018, 4:45 PM
RobH triaged this task as Normal priority.
Cmjohnson moved this task from Backlog to Racking Tasks on the ops-eqiad board.Oct 18 2018, 4:14 PM
fdans moved this task from Incoming to Radar on the Analytics board.Oct 18 2018, 5:03 PM

Change 469656 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding dns entries for an-worker10[78-96]

https://gerrit.wikimedia.org/r/469656

Change 469656 abandoned by Cmjohnson:
Adding dns entries for an-worker10[78-96]

https://gerrit.wikimedia.org/r/469656

Change 469664 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding dns entries an-worker10[78-96]

https://gerrit.wikimedia.org/r/469664

Cmjohnson updated the task description. (Show Details)Oct 25 2018, 6:13 PM

Change 469664 merged by Cmjohnson:
[operations/dns@master] Adding dns entries an-worker10[78-95]

https://gerrit.wikimedia.org/r/469664

Cmjohnson updated the task description. (Show Details)Oct 31 2018, 3:42 PM
Cmjohnson updated the task description. (Show Details)Nov 7 2018, 3:10 PM
Cmjohnson updated the task description. (Show Details)
Cmjohnson reassigned this task from Cmjohnson to RobH.Nov 13 2018, 7:44 PM

Robh can you do the installs please.

elukey claimed this task.Nov 14 2018, 6:36 AM

Change 473359 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add an-worker1078-95 basic settings

https://gerrit.wikimedia.org/r/473359

elukey moved this task from Backlog to In Progress on the User-Elukey board.Nov 14 2018, 7:36 AM

Change 473359 merged by Elukey:
[operations/puppet@production] Add an-worker1078-95 basic settings

https://gerrit.wikimedia.org/r/473359

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1078.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201811140746_elukey_100297.log.

Change 473387 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set a different mac address for an-worker1078's DHCP

https://gerrit.wikimedia.org/r/473387

Change 473387 merged by Elukey:
[operations/puppet@production] Set a different mac address for an-worker1078's DHCP

https://gerrit.wikimedia.org/r/473387

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1078.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201811140943_elukey_141516.log.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1078.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201811141008_elukey_149572.log.

Completed auto-reimage of hosts:

['an-worker1078.eqiad.wmnet']

and were ALL successful.

@Cmjohnson @RobH

I tried this morning to configure the an-workers with https://gerrit.wikimedia.org/r/#/c/473359/ but then I realized that the integrated NICs were not used, since those hosts have 10G interfaces. I then followed up with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/473359/, grabbed from System Setup directly.

To make an-worker1078's reimage work I also had to;

  1. set Serial Communication to: On with console redirection via COM2 (was Auto) - I didn't test extensively if this is a problem, because it was one of the tests that I did before finding the real issue. In case I'll switch back to Auto, but the docs needs to be updated.
  2. set the Legacy Boot Protocol option of the 10G NICE in the BIOS to PXE as indicated in the docs.

an-worker1078 is done, but now I am not able to proceed with PXE boots since it seems that a lot of hosts have link down status:

elukey@asw2-a-eqiad> show interfaces descriptions | match an-worker
Interface       Admin Link Descriptio
------------------------------------------------
xe-2/0/11       up    up   an-worker1078
xe-2/0/14       up    down an-worker1079
xe-4/0/11       up    down an-worker1080
xe-7/0/32       up    down an-worker1081
xe-7/0/33       up    down an-worker1081

elukey@asw2-b-eqiad> show interfaces descriptions | match an-worker
xe-2/0/10       up    down an-worker1083
xe-2/0/11       up    down an-worker1084
xe-4/0/4        up    down an-worker1085
xe-7/0/22       up    down an-worker1086
xe-7/0/23       up    down an-worker1087

elukey@asw2-c-eqiad> show interfaces descriptions | match an-worker
xe-2/0/29       up    down an-worker1088
xe-4/0/19       up    down an-worker1090
xe-4/0/20       up    down an-worker1089
xe-7/0/23       up    down an-worker1091

elukey@asw2-d-eqiad> show interfaces descriptions | match an-worker
xe-2/0/9        up    down an-worker1092
xe-2/0/10       up    down an-worker1093
xe-7/0/2        down  down an-worker1094      <-------------- | these have also admin down
xe-7/0/3        down  down an-worker1095      <-------------- | status

Can you check?

elukey reassigned this task from elukey to Cmjohnson.Nov 15 2018, 8:27 AM

Change 474273 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Update an-worker1080's DHCP MAC address (10G interface)

https://gerrit.wikimedia.org/r/474273

Change 474273 merged by Elukey:
[operations/puppet@production] Update an-worker1080's DHCP MAC address (10G interface)

https://gerrit.wikimedia.org/r/474273

elukey added a comment.EditedNov 16 2018, 3:36 PM

So I changed the MAC address of an-worker1080 to the 10G interface listed in the System Setup as "connected", and forced a PXE boot with F12 (via serial console). I see a blank screen and the cursor like they were stuck (so no d-i popping up) and the following on the switch:

elukey@asw2-a-eqiad> show interfaces descriptions |match an-worker
xe-2/0/11       up    up   an-worker1078
xe-2/0/14       up    down an-worker1079
xe-4/0/11       up    up   an-worker1080         <==============
xe-7/0/32       up    down an-worker1081
xe-7/0/33       up    down an-worker1081

Nov 16 15:28:43  asw2-a-eqiad l2cpd[2036]: LLDP_NEIGHBOR_UP: A neighbor has come up for interface xe-4/0/11. Now, this interface has 1 neighbor/s .
Nov 16 15:29:27  asw2-a-eqiad fpc4 UPDN msg to kernel for ifd:xe-4/0/11, flag:2, speed: 10000000000, duplex:2
Nov 16 15:29:27  asw2-a-eqiad mcsnoopd[99426]: EVENT <UpDown> xe-4/0/11.16386 index 600 <Broadcast Multicast> address #<0> <04:06:04:03:05:06>
Nov 16 15:29:27  asw2-a-eqiad mcsnoopd[99426]: EVENT <UpDown> xe-4/0/11 index 1015 <Broadcast Multicast> address #<0> <04:06:04:03:05:06>
Nov 16 15:29:27  asw2-a-eqiad rpd[2042]: EVENT <UpDown> xe-4/0/11.16386 index 600 <Broadcast Multicast> address #0 4c.16.fc.fb.3d.8e
Nov 16 15:29:27  asw2-a-eqiad rpd[2042]: EVENT <UpDown> xe-4/0/11 index 1015 <Broadcast Multicast> address #0 4c.16.fc.fb.3d.8e
Nov 16 15:29:27  asw2-a-eqiad mib2d[99429]: SNMP_TRAP_LINK_DOWN: ifIndex 930, ifAdminStatus up(1), ifOperStatus down(2), ifName xe-4/0/11
Nov 16 15:29:28  asw2-a-eqiad fpc4 UPDN msg to kernel for ifd:xe-4/0/11, flag:1, speed: 10000000000, duplex:2
Nov 16 15:29:28  asw2-a-eqiad l2cpd[2036]: LLDP_NEIGHBOR_DOWN: A neighbor of interface xe-4/0/11 has gone down. Now, this interface has 0 neighbor/s.
Nov 16 15:29:28  asw2-a-eqiad mcsnoopd[99426]: EVENT <UpDown> xe-4/0/11.16386 index 600 <Up Broadcast Multicast> address #<0> <04:06:04:03:05:06>
Nov 16 15:29:28  asw2-a-eqiad mcsnoopd[99426]: EVENT <UpDown> xe-4/0/11 index 1015 <Up Broadcast Multicast> address #<0> <04:06:04:03:05:06>
Nov 16 15:29:28  asw2-a-eqiad rpd[2042]: EVENT <UpDown> xe-4/0/11.16386 index 600 <Up Broadcast Multicast> address #0 4c.16.fc.fb.3d.8e
Nov 16 15:29:28  asw2-a-eqiad rpd[2042]: EVENT <UpDown> xe-4/0/11 index 1015 <Up Broadcast Multicast> address #0 4c.16.fc.fb.3d.8e
Nov 16 15:29:28  asw2-a-eqiad l2cpd[2036]: LLDP_NEIGHBOR_UP: A neighbor has come up for interface xe-4/0/11. Now, this interface has 1 neighbor/s .
Nov 16 15:29:28  asw2-a-eqiad mib2d[99429]: SNMP_TRAP_LINK_UP: ifIndex 930, ifAdminStatus up(1), ifOperStatus up(1), ifName xe-4/0/11

But on install1002, no trace of DHCP. So the up/down state is probably due to the host state (not cabling?), but I cannot explain the DHCP thing (since it worked perfectly fine for an-worker1079).

I also tried to disable explicitly the integrated nic's boot option (and allow only the one from the 10G NIC) but didn't work.. (blank screen with cursor stuck instead of d-i, no dhcp request on install1002).

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1079.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201811161710_elukey_130936.log.

elukey added a comment.EditedNov 16 2018, 5:24 PM

Chris solved the mistery!

17:22  <cmjohnson1> elukey: I believe it's a vlan issue. I don't think anything but maybe 1 of the servers in row A is in a vlan at the moment. the vlan was not available when I was adding the switch ports.  updating now
17:25  <cmjohnson1> elukey success!  that was the issue

Completed auto-reimage of hosts:

['an-worker1079.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1080.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201811190729_elukey_12231.log.

Completed auto-reimage of hosts:

['an-worker1080.eqiad.wmnet']

and were ALL successful.

elukey renamed this task from rack/setup/install an-worker10[78-96].eqiad.wmnet to rack/setup/install an-worker10[78-95].eqiad.wmnet.Nov 19 2018, 8:27 AM

Change 474635 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set the DHCP settings for the an-worker nodes to their 10G NIC mac

https://gerrit.wikimedia.org/r/474635

Change 474635 merged by Elukey:
[operations/puppet@production] Set 10G mac addresses of an-worker nodes in DHCP config

https://gerrit.wikimedia.org/r/474635

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1081.eqiad.wmnet', 'an-worker1082.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201811190857_elukey_38321.log.

Completed auto-reimage of hosts:

['an-worker1082.eqiad.wmnet']

Of which those FAILED:

['an-worker1082.eqiad.wmnet']

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1082.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201811191002_elukey_57880.log.

Completed auto-reimage of hosts:

['an-worker1082.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1083.eqiad.wmnet', 'an-worker1084.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201811191027_elukey_67173.log.

elukey added a comment.EditedNov 19 2018, 10:44 AM

@Cmjohnson the Debian OS install is in progress, but I think that an-worker109[45] have their network ports disabled. Can you check whenever you have time?

elukey@asw2-d-eqiad> show interfaces descriptions | match an-worker
xe-2/0/9        up    up   an-worker1092
xe-2/0/10       up    up   an-worker1093
xe-7/0/2        down  down an-worker1094         <------------
xe-7/0/3        down  down an-worker1095         <------------

Addendum: it seems that analytics1086/91 shows the same hanging behavior while PXE booting, could it be VLAN related as well?

Completed auto-reimage of hosts:

['an-worker1084.eqiad.wmnet']

Of which those FAILED:

['an-worker1084.eqiad.wmnet']

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1085.eqiad.wmnet', 'an-worker1086.eqiad.wmnet', 'an-worker1087.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201811191128_elukey_86765.log.

Completed auto-reimage of hosts:

['an-worker1086.eqiad.wmnet']

Of which those FAILED:

['an-worker1086.eqiad.wmnet']

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1087.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201811191256_elukey_113297.log.

Completed auto-reimage of hosts:

['an-worker1087.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1088.eqiad.wmnet', 'an-worker1089.eqiad.wmnet', 'an-worker1090.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201811191309_elukey_119791.log.

Completed auto-reimage of hosts:

['an-worker1089.eqiad.wmnet']

Of which those FAILED:

['an-worker1089.eqiad.wmnet']

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1084.eqiad.wmnet', 'an-worker1086.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201811191345_elukey_130887.log.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1091.eqiad.wmnet', 'an-worker1092.eqiad.wmnet', 'an-worker1093.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201811191455_elukey_150235.log.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1089.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201811191517_elukey_157215.log.

Completed auto-reimage of hosts:

['an-worker1089.eqiad.wmnet']

Of which those FAILED:

['an-worker1089.eqiad.wmnet']

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1090.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201811191546_elukey_165718.log.

Completed auto-reimage of hosts:

['an-worker1090.eqiad.wmnet']

and were ALL successful.

Change 474724 had a related patch set uploaded (by Volans; owner: Volans):
[operations/dns@master] Fix an-worker1089 management PTRs

https://gerrit.wikimedia.org/r/474724

Change 474724 merged by Elukey:
[operations/dns@master] Fix an-worker1089 management PTRs

https://gerrit.wikimedia.org/r/474724

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1094.eqiad.wmnet', 'an-worker1095.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201811201016_elukey_248035.log.

@Cmjohnson the Debian OS install is in progress, but I think that an-worker109[45] have their network ports disabled. Can you check whenever you have time?

elukey@asw2-d-eqiad> show interfaces descriptions | match an-worker
xe-2/0/9        up    up   an-worker1092
xe-2/0/10       up    up   an-worker1093
xe-7/0/2        down  down an-worker1094         <------------
xe-7/0/3        down  down an-worker1095         <------------

Addendum: it seems that analytics1086/91 shows the same hanging behavior while PXE booting, could it be VLAN related as well?

Alex fixed the issue, the interfaces were still listed in interface-range disabled.

Completed auto-reimage of hosts:

['an-worker1094.eqiad.wmnet', 'an-worker1095.eqiad.wmnet']

and were ALL successful.

elukey closed this task as Resolved.Nov 20 2018, 11:28 AM