Page MenuHomePhabricator

setup/install an-coord1001/wmf7621
Closed, ResolvedPublic

Description

This task will track the setup and installation of one new analytics machine, approved on T198685.

an-coord1001/wmf7621:

  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run
  • - handoff for service implementation

Event Timeline

RobH triaged this task as Medium priority.Sep 20 2018, 3:47 PM
RobH created this task.
This comment was removed by RobH.
RobH renamed this task from setup/install new analytics servers (hostname needed) to setup/install new analytics servers (hostname needed) wmf7621.Sep 20 2018, 3:53 PM
RobH updated the task description. (Show Details)
RobH renamed this task from setup/install new analytics servers (hostname needed) wmf7621 to setup/install analytics-coord1001/wmf7621.Sep 20 2018, 3:57 PM
RobH updated the task description. (Show Details)
RobH updated the task description. (Show Details)
RobH updated the task description. (Show Details)
RobH renamed this task from setup/install analytics-coord1001/wmf7621 to setup/install an-coord1001/wmf7621.Sep 20 2018, 8:57 PM
robh@asw2-b-eqiad# show | compare 
[edit interfaces interface-range disabled]
-    member ge-1/0/7;
[edit interfaces interface-range vlan-private1-b-eqiad]
-    member ge-1/0/7;
[edit interfaces interface-range vlan-analytics1-b-eqiad]
     member ge-8/0/21 { ... }
+    member ge-1/0/7;
[edit interfaces ge-1/0/7]
-   description wmf7621-DISABLED;
+   description an-coord1001;

Change 461771 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] an-coord1001 dns updates

https://gerrit.wikimedia.org/r/461771

Change 461771 merged by RobH:
[operations/dns@master] an-coord1001 dns updates

https://gerrit.wikimedia.org/r/461771

RobH added a subscriber: Cmjohnson.

@Cmjohnson:

It appears this system had dns added for mgmt, and is racked, but its mgmt is not online. I've attempted to ping it directly, with no result from neodymium where other mgmt interfaces return ping:

robh@neodymium:~$ ping bast1002.mgmt.eqiad.wmnet
PING bast1002.mgmt.eqiad.wmnet (10.65.7.78) 56(84) bytes of data.
64 bytes from wmf4749.mgmt.eqiad.wmnet (10.65.7.78): icmp_seq=1 ttl=62 time=0.860 ms
64 bytes from wmf4749.mgmt.eqiad.wmnet (10.65.7.78): icmp_seq=2 ttl=62 time=0.838 ms
^C
--- bast1002.mgmt.eqiad.wmnet ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.838/0.849/0.860/0.011 ms
robh@neodymium:~$ ping wmf7621.mgmt.eqiad.wmnet
PING wmf7621.mgmt.eqiad.wmnet (10.65.1.12) 56(84) bytes of data.

no returns.

Can you check the mgmt interface and ensure it is both plugged in and properly configured? Also please go ahead and check all bios/drac settings and then you can assign back to me for installation.

Any chance that this work can be done before the end of next week? If so I'll plan some maintenance time for Hadoop :)

Cmjohnson updated the task description. (Show Details)

@elukey everything looks good on our end I was able to access the server

root@an-coord1001.mgmt.eqiad.wmnet's password:
/admin1->

Change 463297 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add an-coord1001 to puppet

https://gerrit.wikimedia.org/r/463297

Change 463297 merged by Elukey:
[operations/puppet@production] Add an-coord1001 to puppet

https://gerrit.wikimedia.org/r/463297

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

an-coord1001.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201810010828_elukey_12971_an-coord1001_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['an-coord1001.eqiad.wmnet']

Of which those FAILED:

['an-coord1001.eqiad.wmnet']

So attempting to install the OS leads to:

iDRAC Settings:
CBL0009: Backplane 1 connector A0 is not connected.
CBL0009: Backplane 1 connector B0 is not connected.

UEFI0116: One or more boot drivers have reported issue(s).
Check the Driver Health Menu in Boot Manager for details.


Available Actions:
F1 to Continue and Retry Boot Order
F2 for System Setup (BIOS)
F10 for LifeCycle Controller
- Enable/Configure iDRAC
- Update or Backup/Restore Server Firmware
- Help Install an Operating System
F11 for Boot Manager

DHCP seems to work from install1002's logs, but if I check the boot options I don't see any HD mentioned. @Cmjohnson any suggestions?

This comment was removed by RobH.

THe disks are now being seen by the contorller, this server was the spare we borrowed a cable from to work on cloudvirt1023. Re-connected the cable and not disks are showing up. I also set the bios boot order to boot from disks first.

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

an-coord1001.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201810030714_elukey_21698_an-coord1001_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['an-coord1001.eqiad.wmnet']

Of which those FAILED:

['an-coord1001.eqiad.wmnet']

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

an-coord1001.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201810030715_elukey_21771_an-coord1001_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['an-coord1001.eqiad.wmnet']

Of which those FAILED:

['an-coord1001.eqiad.wmnet']

After hitting by mistake F10 the host got stuck several times in:

Unified Server Configurator does not support console redirection

After some reboots:

UEFI0019: Lifecycle Controller (LC) is unable to complete a requested task or
function and prevented the boot process from completing on multiple attempts.
LC is in Recovery Mode.
Repair Lifecycle Controller firmware using the Lifecycle Controller Dell Update
Package (DUP) or Lifecycle Controller Repair Package via iDRAC. For more
information, see Lifecycle Controller User's Guide.

Tried to follow https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/Dell_PowerEdge_RN10#Unified_Server_Configurator_does_not_support_console_redirection but didn't manage to hit the ctrl+e.

From System Setup I can see that D0:94:66:5F:75:BC (set in puppet) shows Link Status Connected, while the other NICs are disconnected, so I should have set the right NIC in the dhcp settings in puppet. I also checked on install1002 and I can see DHCP OFFERs/ACKs happening.

NIC configuration is set to PXE for Legacy Boot protocol.

No idea what's happening, there might be some magic to do that I don't know. @RobH @Cmjohnson any idea?

If I leave the PXE boot running (even if it seems stuck in a blank screen) I end up in:

Loading Linux 4.9.0-8-amd64 ...
Loading initial ramdisk ...

But it doesn't go further than this.

From the install1002 point of view:

Oct  3 12:38:12 install1002 dhcpd: DHCPOFFER on 10.64.21.104 to d0:94:66:5f:75:bc via 10.64.21.2
Oct  3 12:38:12 install1002 dhcpd: DHCPOFFER on 10.64.21.104 to d0:94:66:5f:75:bc via 10.64.21.3
Oct  3 12:38:16 install1002 dhcpd: DHCPREQUEST for 10.64.21.104 (208.80.154.22) from d0:94:66:5f:75:bc via 10.64.21.3
Oct  3 12:38:16 install1002 dhcpd: DHCPACK on 10.64.21.104 to d0:94:66:5f:75:bc via 10.64.21.3
Oct  3 12:38:16 install1002 dhcpd: DHCPREQUEST for 10.64.21.104 (208.80.154.22) from d0:94:66:5f:75:bc via 10.64.21.2
Oct  3 12:38:16 install1002 dhcpd: DHCPACK on 10.64.21.104 to d0:94:66:5f:75:bc via 10.64.21.2
Oct  3 12:38:16 install1002 atftpd[507]: Serving lpxelinux.0 to 10.64.21.104:2070
Oct  3 12:38:16 install1002 atftpd[507]: Serving lpxelinux.0 to 10.64.21.104:2071
Oct  3 12:38:52 install1002 dhcpd: DHCPOFFER on 10.64.21.104 to d0:94:66:5f:75:bc via 10.64.21.3
Oct  3 12:38:52 install1002 dhcpd: DHCPOFFER on 10.64.21.104 to d0:94:66:5f:75:bc via 10.64.21.2
Oct  3 12:38:52 install1002 dhcpd: DHCPREQUEST for 10.64.21.104 (208.80.154.22) from d0:94:66:5f:75:bc via 10.64.21.3
Oct  3 12:38:52 install1002 dhcpd: DHCPACK on 10.64.21.104 to d0:94:66:5f:75:bc via 10.64.21.3
Oct  3 12:38:52 install1002 dhcpd: DHCPREQUEST for 10.64.21.104 (208.80.154.22) from d0:94:66:5f:75:bc via 10.64.21.2
Oct  3 12:38:52 install1002 dhcpd: DHCPACK on 10.64.21.104 to d0:94:66:5f:75:bc via 10.64.21.2
root@install1002:/home/elukey# grep 10.64.21.104 /var/log/nginx/access.log | grep 12:3
10.64.21.104 - - [03/Oct/2018:12:38:16 +0000] "GET /tftpboot/stretch-installer/ldlinux.c32 HTTP/1.0" 200 116552 "-" "Syslinux/6.03"
10.64.21.104 - - [03/Oct/2018:12:38:16 +0000] "GET /tftpboot/stretch-installer/pxelinux.cfg/ttyS1-115200 HTTP/1.0" 200 479 "-" "Syslinux/6.03"
10.64.21.104 - - [03/Oct/2018:12:38:16 +0000] "GET /tftpboot/stretch-installer/pxelinux.cfg/boot.txt HTTP/1.0" 200 49 "-" "Syslinux/6.03"
10.64.21.104 - - [03/Oct/2018:12:38:26 +0000] "GET /tftpboot/stretch-installer/debian-installer/amd64/linux HTTP/1.0" 200 4224800 "-" "Syslinux/6.03"
10.64.21.104 - - [03/Oct/2018:12:38:28 +0000] "GET /tftpboot/stretch-installer/debian-installer/amd64/initrd.gz HTTP/1.0" 200 62215609 "-" "Syslinux/6.03"

Got some help from Faidon, one setting in the BIOS for the serial console wasn't correct (I've set Serial Port Address set to Serial Device1=COM1,Serial Device2=COM2). Now I can PXE Boot correctly!

After hitting by mistake F10 the host got stuck several times in:

Unified Server Configurator does not support console redirection

After some reboots:

UEFI0019: Lifecycle Controller (LC) is unable to complete a requested task or
function and prevented the boot process from completing on multiple attempts.
LC is in Recovery Mode.
Repair Lifecycle Controller firmware using the Lifecycle Controller Dell Update
Package (DUP) or Lifecycle Controller Repair Package via iDRAC. For more
information, see Lifecycle Controller User's Guide.

Tried to follow https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/Dell_PowerEdge_RN10#Unified_Server_Configurator_does_not_support_console_redirection but didn't manage to hit the ctrl+e.

I went in the BIOS settings -> IDRAC -> enabled again lifecycle controller, all good (not in recovery mode anymore now).

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

an-coord1001.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201810031346_elukey_4191_an-coord1001_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['an-coord1001.eqiad.wmnet']

and were ALL successful.

elukey updated the task description. (Show Details)

Assigning to Rob to see if anything needs to be done from the DC ops side before closing.