Page MenuHomePhabricator

Broken network connection on ganeti2001 after reboot
Closed, ResolvedPublic

Description

After a reboot for a kernel update, ganeti2001 has no network connectivity, I can see the interface up when logging in over the serial console, maybe a broken cable which died during reboot?

An mtr to the server is failing like this:

1. ae1-100.cr2-esams.wikimedia.org                                                                                             0.0%    71    0.4   0.7   0.2  15.3   1.9
2. xe-4-1-3.cr2-eqiad.wikimedia.org                                                                                            0.0%    71   83.4  83.6  83.3  90.9   0.9
3. xe-5-0-1.cr2-codfw.wikimedia.org                                                                                           90.0%    71  2118. 2118. 2117. 2120.   1.2
4. ???

(I initially assumed this was a regression caused by a change between Linux 4.9.168 and 4.9.189, but I also rebooted with the old kernel and it's also happening there)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptThu, Sep 26, 8:57 AM
Papaul triaged this task as High priority.Thu, Sep 26, 1:29 PM

@MoritzMuehlenhoff i checked the cable and switch side all look good. This has to be at another level

papaul@asw-b-codfw> show interfaces ge-1/0/7 descriptions 
Interface       Admin Link Description
ge-1/0/7        up    up   ganeti2001

Maybe the NIC on the server broke? Are there some self-tests/diagnostics for that on the hardware side?

I don't think this is hardware related.

root@ganeti2001:/etc/network# ifup private
Error: argument "private" is wrong: dev is invalid

Found it. I had to comment out from /etc/network/interfaces the line

pre-up /sbin/ip token set ::10:192:16:125 dev private

which makes sense that it fails given that it tries on pre-up to set the token for a non existing yet interface.

akosiaris lowered the priority of this task from High to Normal.Thu, Sep 26, 3:14 PM

Changing priority to normal since the host is now up and running, but we have a chicken and egg problem to solve here.

@akosiaris can you take over the task then?

akosiaris added a subscriber: Papaul.

Sure.

Change 539462 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] add_ip6_mapped: Ignore errors if ip token set fails

https://gerrit.wikimedia.org/r/539462

Change 539523 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Disable ip6 mapped addresses for ganeti hosts

https://gerrit.wikimedia.org/r/539523

Change 539462 abandoned by Alexandros Kosiaris:
add_ip6_mapped: Ignore errors if ip token set fails

Reason:
In favor of https://gerrit.wikimedia.org/r/#/c/operations/puppet/ /539523

https://gerrit.wikimedia.org/r/539462

Change 539523 merged by Alexandros Kosiaris:
[operations/puppet@production] Disable ip6 mapped addresses for ganeti hosts

https://gerrit.wikimedia.org/r/539523

Mentioned in SAL (#wikimedia-operations) [2019-09-27T13:09:06Z] <akosiaris> reboot ganeti2001 T233906

akosiaris closed this task as Resolved.Fri, Sep 27, 1:14 PM

We 've sidestepped the problem for now by disabling ip6 mapped addresses for ganeti hosts. This solves the chicken and problem, although we should arguably find a way to better configure IPv4 and IPv6 addresses on our hosts instead of relying on tricks like setting the token. I 'll resolve this for now.