install2001 hardware troubles
Closed, ResolvedPublic

Description

We've been getting the occasional alert from Smokeping about install2001, with the graph showing various amounts of jitter and packet loss (up to 5% at times). I've since added achernar and bast2001 to Smokeping, with bast2001 being on the same rack as install2001 (A5) — and in fact, neighboring ports. Neither achernar nor bast2001 does NOT exhibit this behavior — everything looks perfect there. Compare install2001's graphs with bast2001's.

Therefore, the problem looks localized to install2001. Neither the host nor the switch report any CRC/FCS errors or such and apart from a few RX discards on the install2001 side, everything else looks normal.


Update 2016-06-14: we started getting a *lot * more packet loss, up to 20%. At the same time, CPU graphs showed 100% utilization (system). I logged in and found multiple threads of acpi_pad consuming all of the machine's CPU. I rmmod'ed modules acpi_pad and mei, but packet loss seems to be still there, as well as other behavior I'd call odd (general slowness?)

This looks like either kernel or hardware troubles at this point, I'd bet the latter. Please investigate — thanks!

faidon created this task.Jun 11 2016, 11:06 PM
Restricted Application added a project: Operations. · View Herald TranscriptJun 11 2016, 11:06 PM
Restricted Application added subscribers: Zppix, Southparkfan, Aklapper. · View Herald Transcript
faidon renamed this task from Replace install2001's Ethernet cable to install2001 hardware troubles.Jun 14 2016, 1:39 AM
faidon updated the task description. (Show Details)

Mentioned in SAL [2016-06-14T13:28:27Z] <paravoid> rebooting install2001, T137647

Papaul added a subscriber: Papaul.Jun 14 2016, 4:06 PM

physical observation: everything looks good on the server. all lids are green no sign of server overheating, No error reported in the log. Next step is to run a full hardware diagnostic. Is it possible to turn the server off? Thanks

Yes, that would be fine, please do!

(note that I rebooted the server earlier as well, to rule out the possibility it was a software issue)

Hardware diagnostic shows not HW problem.
I checked first the BIOS settings and "system Profile" was set to "Performance per watt DAPC " supposed to " "Perforamce per Watt OS" i change it to the right settings.
@faidon Can this be related to the issue ?

@faidon the server is up and will leave the task open until the end of the week and see if the BIOS change fixed the problem.

The CPU issue has been alleviated, but the packet loss issue remains, at the previous levels of 0.5-5%. It's unlikely but this might be an entirely different issue altogether — @Papaul, could you change the Ethernet cable just in case?

@faidon cable replacement complete

Volans added a subscriber: Volans.EditedAug 23 2016, 8:31 AM

The CPU issue was back again since 2016-08-22 17:18 UTC.
I've rmmod acpi_pad and the CPU usage is back to normal, but surely need some more investigation.

It happened to mira too, it was having high load and failing random Icinga checks. I've rmmod acpi_pad and is now back to normal.

It started at 2016-11-13 11:40 UTC according to Grafana graphs.
If needed feel free to move it to a separate task.

Dzahn added a subscriber: Dzahn.Dec 14 2016, 7:58 PM

fwiw, install2001 looks alright as of today, CPU usage is very low, and the smokeping graph for bast2001 is a flat line

https://smokeping.wikimedia.org/?target=codfw.Hosts.bast2001

unsure what we want to do next on this ticket

Mentioned in SAL (#wikimedia-operations) [2017-03-07T18:52:50Z] <volans> rmmod acpi_pad on baham, was using 100% CPU T137647

Dzahn closed this task as Resolved.Apr 17 2017, 11:48 PM
Dzahn claimed this task.

closing this subtask since we know from the other similar tasks and the parent ticket that the fix is always` rmmod acpi_pad` and the affected hardware is always Dell R320 and a change has been merged to blacklist loading this module again in the future if on this hardware type.