bast4001 is still on precise, reinstall it with jessie
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Dzahn | T123525 reduce amount of remaining Ubuntu 12.04 (precise) systems in production | |||
Resolved | faidon | T123674 reinstall bast4001 with jessie | |||
Duplicate | None | T96842 Switch ganglia aggregator init stuff to systemd on jessie | |||
Resolved | Dzahn | T124197 Port Ganglia aggregator setup to systemd | |||
Restricted Task |
Event Timeline
Change 264330 had a related patch set uploaded (by Dzahn):
install_server: switch bast4001 to jessie
eh, yea:) So this is also the tftp-server so reinstalling it is kind of tricky because it needs itself for install.
Change 275139 had a related patch set uploaded (by Dzahn):
ganglia: don't try to use upstart on jessie
Change 275146 had a related patch set uploaded (by Dzahn):
ganglia: add unit file for systemd on jessie
This should now be unblocked since ganglia::monitor::aggregator now works with systemd.
Just giving it to pool while i'm on vacation. If anyone wants to take it please go for it. Otherwise i'll take it back when i get back.
ganglia-aggregator works on jessie now (see bast3001) so that should not be a problem
Change 283359 had a related patch set uploaded (by Dzahn):
dhcp: let ulsfo public subnet use carbon as TFTP
Change 283361 had a related patch set uploaded (by Dzahn):
network: remove bast4001 SLAAC IPs
on carbon, /var/log/syslog, we can see how DHCP works:
198.35.26.5 is bast4001
Apr 15 22:33:51 carbon dhcpd: DHCPDISCOVER from 90:b1:1c:4d:42:49 via 198.35.26.2 Apr 15 22:33:51 carbon dhcpd: DHCPOFFER on 198.35.26.5 to 90:b1:1c:4d:42:49 via 198.35.26.2 Apr 15 22:33:51 carbon dhcpd: DHCPDISCOVER from 90:b1:1c:4d:42:49 via 198.35.26.3 Apr 15 22:33:51 carbon dhcpd: DHCPOFFER on 198.35.26.5 to 90:b1:1c:4d:42:49 via 198.35.26.3
on bast4001.mgmt, with racadm getsysinfo , MAC confirmed
Embedded NIC MAC Addresses: NIC.Embedded.1-1-1 Ethernet = 90:B1:1C:4D:42:49 NIC.Embedded.2-1-1 Ethernet = 90:B1:1C:4D:42:4A
the output on the bast4001 console:
Scanning for devices. Please wait, this may take several minutes... Broadcom UNDI PXE-2.1 v15.4.2 Copyright (C) 2000-2012 Broadcom Corporation Copyright (C) 1997-2000 Intel Corporation All rights reserved. CLIENT MAC ADDR: 90 B1 1C 4D 42 49 GUID: 44454C4C 5900 104C 8048 C3C04F445831 CLIENT IP: 198.35.26.5 MASK: 255.255.255.240 DHCP IP: 208.80.154.10 GATEWAY IP: 198.35.26.1 PXELINUX 6.03 PXE 20150819 Copyright (C) 1994-2014 H. Peter Anvin et al
on the tftp server:
@carbon:~# tcpdump -n dst host 198.35.26.5or tcpdump -vv -n udp dst portrange 69
..nothing..
Faidon made this change https://gerrit.wikimedia.org/r/#/c/283627/1
which meant now install2001 instead of carbon would be used as install server for bast4001
looking on install2001, i can now see that:
Apr 15 23:26:00 install2001 atftpd[1205]: Serving jessie-installer/ldlinux.c32 to 198.35.26.5:49152
Apr 15 23:26:08 install2001 atftpd[1205]: timeout: retrying...
So, I noticed that bast4001's PXE ROM was hanging at either the DHCP or the TFTP step at random (the progress-character stopped moving, network interactions, DHCP or TFTP respectively, halted). This was very similar to the behavior I had very briefly previously encountered with hooft (which I found suspicious).
I fiddled quite a bit with it, which included turning off the BIOS console redirection setting, upgrading iDRAC/BIOS and NIC firmware in case the PXE ROM or BIOS was somehow buggy, I gave up for the evening. A few hours later I returned with a new theory to test: the PXE ROM was getting "overwhelmed" by all of the traffic thrown at the box's IP, which was mostly UDP Ganglia traffic on multiple ports, as bast4001 (and hooft!) is the Ganglia aggregator for the PoP.
I configured asw-ulsfo to drop the traffic as such:
set firewall family ethernet-switching filter temp-bast4001 term ganglia from destination-port 12651 set firewall family ethernet-switching filter temp-bast4001 term ganglia from destination-port 12669 set firewall family ethernet-switching filter temp-bast4001 term ganglia from destination-port 12670 set firewall family ethernet-switching filter temp-bast4001 term ganglia from destination-port 12671 set firewall family ethernet-switching filter temp-bast4001 term ganglia from destination-port 12677 set firewall family ethernet-switching filter temp-bast4001 term ganglia from destination-port 12690 set firewall family ethernet-switching filter temp-bast4001 term ganglia from destination-port 12700 set firewall family ethernet-switching filter temp-bast4001 term ganglia then discard set firewall family ethernet-switching filter temp-bast4001 term default then accept set interfaces ge-1/0/16 unit 0 family ethernet-switching filter output temp-bast4001
…and ta-da! PXE worked and (some time and commits later, for unrelated issues), bast4001 is now reinstalled with jessie.
Change 283361 abandoned by Faidon Liambotis:
network: remove bast4001 SLAAC IPs
Reason:
Just saw this, after I already merged I02d461d611de150bbe5ba6467d0fd333a5bff0bb :(
Thank you very much Faidon! Yes, that was indeed just like with hooft. just that in esams i could work around it by simple picking another server that was idle and install that from the existing install server. In ulsfo there wasn't this option though and I could not find this and really ran out of ideas. I'm glad the bastions are done now :)