Page MenuHomePhabricator

Fix ethernet startup race on HP LVS w/ jessie
Closed, ResolvedPublic

Description

A couple of things to dig into here:

  1. The ethtool rxring settings don't apply right automatically via puppet on the first jessie nodes. Probably more about the new HP hardware than jessie. The command runs successfully for eth0 and eth3, but not for eth1 or eth2, in the initial puppet run. Manually running it for the failed ones afterwards works. Puppet only runs this when defining the /e/n/i setting, so it never tries again. Problem seems to persist post-reboot when applied via /e/n/i as well, and notably eth[03]'s /e/n/i settings are in a different ordering than eth[12] as well....
  2. Supposedly we can stop disabling GRO on the LVS nodes with jessie due to newer kernel, but needs investigation/testing. There could be similar things with other related settings...

Event Timeline

BBlack raised the priority of this task from to Medium.
BBlack updated the task description. (Show Details)
BBlack added a project: Traffic.
BBlack subscribed.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 237667 had a related patch set uploaded (by BBlack):
Don't disable LRO/GRO on jessie LVS hosts

https://gerrit.wikimedia.org/r/237667

Change 237667 merged by BBlack:
Don't disable LRO/GRO on jessie LVS hosts

https://gerrit.wikimedia.org/r/237667

GRO and LRO seem fine. Still facing an issue with both the rxring parameters and the interface-rps parameters. They can both be applied successfully post-boot manually, but the statements for them in /e/n/i (which look correct) consistently only work for eth0 and eth3, and don't work for eth1 or eth2. There's probably some kind of race condition going on here at the time of startup (or puppetization, which acted similarly...).

Digging a little further in syslogs, apparently it is a race. systemd ends up trying to configure eth[12] first and they fail the RSS IRQ pattern check, and thus interface bringup fails due to a failed up-command. Probably because the driver hasn't fully configured the interface yet at that point.

So, basically this is a race centered around bnx2x->udev->systemd event notifications and /e/n/i up-commands that set hardware parameters. It's probably going to be tricky to sort out. For now, the simplest workaround is to execute ifup eth1; ifup eth2 on these (HP Jessie LVS) machines from root's commandline post-startup, which fixes everything. I've done that on lvs200[1-6] for now so that we're not blocking further traffic turnup. Will move this ticket to block the deploy of the similar new lvs10xx LVS machines, and we can deal with it properly at that time.

BBlack renamed this task from Re-investigate eth params on jessie LVS nodes to Fix ethernet startup race on HP LVS w/ jessie.Sep 14 2015, 3:13 PM
BBlack set Security to None.

These kinds of tickets come to mind when the additional price of supporting multiple hardware vendors is discussed.

So, @faidon pointed out that this would probably fix it self with s/^auto eth/allow-hotplug eth/ on /etc/network/interfaces. The eth0 entry there is already allow-hotplug rather than auto, but the eth1-3 and various VLAN sub-interfaces are currently auto.

I've tested this theory manually on lvs1007 with puppet disabled and it seems fix the issue on a fresh reboot, so now I just need to figure out puppetizing it.

Change 256734 had a related patch set uploaded (by BBlack):
interface: use allow-hotplug for ::manual and ::tagged

https://gerrit.wikimedia.org/r/256734

Change 256734 merged by BBlack:
interface: use allow-hotplug for ::manual and ::tagged

https://gerrit.wikimedia.org/r/256734

BBlack claimed this task.

The allow-hotplug change is deployed to all lvs* now, and I'm assuming this is resolved unless it recurs. The change may or may not be fully appropriate for a reboot of lvs100[123] which are currently on precise, but the very next step here is reinstalling those to jessie anyways...

Change 256842 had a related patch set uploaded (by BBlack):
interface::tagged - do not use hotplug

https://gerrit.wikimedia.org/r/256842

Change 256842 merged by BBlack:
interface::tagged - do not use hotplug

https://gerrit.wikimedia.org/r/256842