Page MenuHomePhabricator

reinstall bast4001 with jessie
Closed, ResolvedPublic

Description

bast4001 is still on precise, reinstall it with jessie

Event Timeline

Dzahn updated the task description. (Show Details)
Dzahn raised the priority of this task from to Needs Triage.
Dzahn claimed this task.
Dzahn added a project: Operations.

Change 264330 had a related patch set uploaded (by Dzahn):
install_server: switch bast4001 to jessie

https://gerrit.wikimedia.org/r/264330

Change 264330 merged by Dzahn:
install_server: switch bast4001 to jessie

https://gerrit.wikimedia.org/r/264330

eh, yea:) So this is also the tftp-server so reinstalling it is kind of tricky because it needs itself for install.

Is it possible to point it at one of the other DC's tftp servers?

Dzahn triaged this task as Normal priority.Jan 22 2016, 8:27 PM

@Krenair yea, i asked about it and Mark and Rob told me it works just very slowly.

Change 275139 had a related patch set uploaded (by Dzahn):
ganglia: don't try to use upstart on jessie

https://gerrit.wikimedia.org/r/275139

Change 275139 merged by Dzahn:
ganglia: don't try to use upstart on jessie

https://gerrit.wikimedia.org/r/275139

Change 275146 had a related patch set uploaded (by Dzahn):
ganglia: add unit file for systemd on jessie

https://gerrit.wikimedia.org/r/275146

Change 275146 merged by Dzahn:
ganglia: add unit file template for systemd

https://gerrit.wikimedia.org/r/275146

This should now be unblocked since ganglia::monitor::aggregator now works with systemd.

Dzahn set Security to None.
Dzahn added a subtask: Restricted Task.Mar 18 2016, 4:33 PM
Dzahn raised the priority of this task from Normal to High.Mar 31 2016, 1:02 AM
Dzahn removed Dzahn as the assignee of this task.Apr 1 2016, 11:01 PM

Just giving it to pool while i'm on vacation. If anyone wants to take it please go for it. Otherwise i'll take it back when i get back.

ganglia-aggregator works on jessie now (see bast3001) so that should not be a problem

Change 283359 had a related patch set uploaded (by Dzahn):
dhcp: let ulsfo public subnet use carbon as TFTP

https://gerrit.wikimedia.org/r/283359

Change 283359 merged by Dzahn:
dhcp: let ulsfo public subnet use carbon as TFTP

https://gerrit.wikimedia.org/r/283359

Change 283361 had a related patch set uploaded (by Dzahn):
network: remove bast4001 SLAAC IPs

https://gerrit.wikimedia.org/r/283361

Dzahn added a comment.EditedApr 15 2016, 11:16 PM

on carbon, /var/log/syslog, we can see how DHCP works:

198.35.26.5 is bast4001

Apr 15 22:33:51 carbon dhcpd: DHCPDISCOVER from 90:b1:1c:4d:42:49 via 198.35.26.2
Apr 15 22:33:51 carbon dhcpd: DHCPOFFER on 198.35.26.5 to 90:b1:1c:4d:42:49 via 198.35.26.2
Apr 15 22:33:51 carbon dhcpd: DHCPDISCOVER from 90:b1:1c:4d:42:49 via 198.35.26.3
Apr 15 22:33:51 carbon dhcpd: DHCPOFFER on 198.35.26.5 to 90:b1:1c:4d:42:49 via 198.35.26.3

on bast4001.mgmt, with racadm getsysinfo , MAC confirmed

Embedded NIC MAC Addresses:
NIC.Embedded.1-1-1      Ethernet                = 90:B1:1C:4D:42:49
NIC.Embedded.2-1-1      Ethernet                = 90:B1:1C:4D:42:4A

the output on the bast4001 console:

Scanning for devices.  Please wait, this may take several minutes...


Broadcom UNDI PXE-2.1 v15.4.2
Copyright (C) 2000-2012 Broadcom Corporation
Copyright (C) 1997-2000 Intel Corporation
All rights reserved.

CLIENT MAC ADDR: 90 B1 1C 4D 42 49  GUID: 44454C4C 5900 104C 8048 C3C04F445831
CLIENT IP: 198.35.26.5  MASK: 255.255.255.240  DHCP IP: 208.80.154.10
GATEWAY IP: 198.35.26.1
      
PXELINUX 6.03 PXE 20150819 Copyright (C) 1994-2014 H. Peter Anvin et al

on the tftp server:
@carbon:~# tcpdump -n dst host 198.35.26.5or tcpdump -vv -n udp dst portrange 69

..nothing..

Faidon made this change https://gerrit.wikimedia.org/r/#/c/283627/1

which meant now install2001 instead of carbon would be used as install server for bast4001

looking on install2001, i can now see that:

Apr 15 23:26:00 install2001 atftpd[1205]: Serving jessie-installer/ldlinux.c32 to 198.35.26.5:49152
Apr 15 23:26:08 install2001 atftpd[1205]: timeout: retrying...

faidon closed this task as Resolved.Apr 26 2016, 2:34 AM
faidon added a subscriber: faidon.

So, I noticed that bast4001's PXE ROM was hanging at either the DHCP or the TFTP step at random (the progress-character stopped moving, network interactions, DHCP or TFTP respectively, halted). This was very similar to the behavior I had very briefly previously encountered with hooft (which I found suspicious).

I fiddled quite a bit with it, which included turning off the BIOS console redirection setting, upgrading iDRAC/BIOS and NIC firmware in case the PXE ROM or BIOS was somehow buggy, I gave up for the evening. A few hours later I returned with a new theory to test: the PXE ROM was getting "overwhelmed" by all of the traffic thrown at the box's IP, which was mostly UDP Ganglia traffic on multiple ports, as bast4001 (and hooft!) is the Ganglia aggregator for the PoP.

I configured asw-ulsfo to drop the traffic as such:

set firewall family ethernet-switching filter temp-bast4001 term ganglia from destination-port 12651
set firewall family ethernet-switching filter temp-bast4001 term ganglia from destination-port 12669
set firewall family ethernet-switching filter temp-bast4001 term ganglia from destination-port 12670
set firewall family ethernet-switching filter temp-bast4001 term ganglia from destination-port 12671
set firewall family ethernet-switching filter temp-bast4001 term ganglia from destination-port 12677
set firewall family ethernet-switching filter temp-bast4001 term ganglia from destination-port 12690
set firewall family ethernet-switching filter temp-bast4001 term ganglia from destination-port 12700
set firewall family ethernet-switching filter temp-bast4001 term ganglia then discard
set firewall family ethernet-switching filter temp-bast4001 term default then accept
set interfaces ge-1/0/16 unit 0 family ethernet-switching filter output temp-bast4001

…and ta-da! PXE worked and (some time and commits later, for unrelated issues), bast4001 is now reinstalled with jessie.

Change 283361 abandoned by Faidon Liambotis:
network: remove bast4001 SLAAC IPs

Reason:
Just saw this, after I already merged I02d461d611de150bbe5ba6467d0fd333a5bff0bb :(

https://gerrit.wikimedia.org/r/283361

Dzahn reassigned this task from Dzahn to faidon.

Thank you very much Faidon! Yes, that was indeed just like with hooft. just that in esams i could work around it by simple picking another server that was idle and install that from the existing install server. In ulsfo there wasn't this option though and I could not find this and really ran out of ideas. I'm glad the bastions are done now :)