⚓ T162850 CPU throttling on DELL PowerEdge R320

	Subject	Repo	Branch	Lines +/-
	site/dns: tmp remove acamar from resolv.conf overrides	operations/puppet	production	+3 -3
	base::kernel: mod blacklist for Dell R320, blacklist acpi_pad	operations/puppet	production	+18 -0

Status	Assigned	Task
Resolved	Dzahn	T162850 CPU throttling on DELL PowerEdge R320
Resolved	fgiunchedi	T123924 acpi_pad runaway processes on praseodymium
Resolved	Eevans	T141675 xenon.eqiad.wmnet: very high cpu utilization
Resolved	Dzahn	T137647 install2001 hardware troubles
Resolved	Papaul	T110202 re-seat power cord for Nembus
Resolved	Dzahn	T159870 baham (ns1) CPU-related issues
Resolved	Dzahn	T163158 acpi_pad consuming 100% CPU on tin
Open	None	T163220 Add Icinga check for CPU frequency on Dell R320
Resolved	• Cmjohnson	T147905 investigate lead hardware issue

BBlack created this task.Apr 12 2017, 11:36 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 12 2017, 11:36 PM

BBlack triaged this task as High priority.Apr 12 2017, 11:36 PM

modules/base/files/kernel/blacklist-wmf.conf is probably the place to try disabling this first, FWIW.

i wanted to suggest importing the puppet module kmod (https://gerrit.wikimedia.org/r/#/c/348009/) to get kmod::blacklist, but as you point out we already have the blacklist above so i abandoned that again

Dzahn added subtasks: T123924: acpi_pad runaway processes on praseodymium, T141675: xenon.eqiad.wmnet: very high cpu utilization, T137647: install2001 hardware troubles, T110202: re-seat power cord for Nembus, T159870: baham (ns1) CPU-related issues.Apr 12 2017, 11:44 PM

Is there a common pattern of affected distros/kernels/server models? I'd prefer if we first try to pinpoint this further before blacklisting it globally to avoid unforeseen side effects.

In T162850#3179294, @MoritzMuehlenhoff wrote:

Is there a common pattern of affected distros/kernels/server models?

host	ticket	distro	kernel	server model
praseodymium	T123924	jessie	4.9.0-0.bpo.2-amd64	Dell PowerEdge R320
xenon	T141675	jessie	4.9.0-0.bpo.2-amd64	Dell PowerEdge R320
install2001	T137647	jessie	unknown	Dell PowerEdge R320
nembus	T110202	unknown	unknown (T122100)	Dell PowerEdge R320
baham	T159870	jessie	4.4.0-3-amd64	Dell PowerEdge R320

https://access.redhat.com/solutions/508303 , https://serverfault.com/questions/561190/unusual-load-average-for-an-idle-workstation

@MoritzMuehlenhoff ^ Yea, looks like all Dell PowerEdge R320. re: kernel versions those might have been upgraded AFTER the incidents on the tickets.

Change 348197 had a related patch set uploaded (by Dzahn):
[operations/puppet@production] base::kernel: add mod blacklist specific to R320, blacklist acpi_pad

https://gerrit.wikimedia.org/r/348197

gerritbot added a project: Patch-For-Review.Apr 14 2017, 3:40 AM

tin is also affected: T163158: acpi_pad consuming 100% CPU on tin. Its kernel version appears to be 4.4.0-3-amd64.

Dzahn closed subtask T163158: acpi_pad consuming 100% CPU on tin as Resolved.Apr 17 2017, 10:52 PM

I ran rmmod acpi_pad on tin and it fixed the issue right away. tin is a Dell PowerEdge R320 which is the pattern all affected servers had in common so far, further confirming the theory in the gerrit link above.

Dzahn updated the task description. (Show Details)Apr 17 2017, 11:15 PM

Dzahn updated the task description. (Show Details)

Change 348197 merged by Dzahn:
[operations/puppet@production] base::kernel: mod blacklist for Dell R320, blacklist acpi_pad

https://gerrit.wikimedia.org/r/348197

Mentioned in SAL (#wikimedia-operations) [2017-04-17T23:33:23Z] <mutante> running puppet via cumin on all 16 Dell PowerEdge R320, adding blacklist file for acpi_pad kernel module. 15/16 success, all but tin (T162850)

Mentioned in SAL (#wikimedia-operations) [2017-04-17T23:37:25Z] <mutante> runnin rmmod acpi_pad on the 16 R320 via cumin, since blacklisting in puppet does not actively remove, confirmed unloaded. (16/16) success ratio (>= 100.0% threshold) for command: 'lsmod|grep -c acpi_pad ||:' (T162850)

Now this should either never happen again... or it would affect more than just R320's. But we have not seen a case of that so far.

So.. a bit unsure about ticket status, so for now lowering priority from High to Normal.

Dzahn closed subtask T137647: install2001 hardware troubles as Resolved.Apr 17 2017, 11:48 PM

BBlack closed subtask T159870: baham (ns1) CPU-related issues as Resolved.Apr 18 2017, 2:30 AM

• MoritzMuehlenhoff reopened subtask T163158: acpi_pad consuming 100% CPU on tin as Open.Apr 18 2017, 2:55 PM

fgiunchedi closed subtask T163158: acpi_pad consuming 100% CPU on tin as Resolved.Apr 18 2017, 5:21 PM

All subtasks are resolved. acpid_pad has been unloaded and blacklisted on all Dell R320 machines.

I suggest we try closing it and watch if it ever happens again. If it does not the issue was limited to this hardware type and we probably won't get that again, use the existing 16 servers as they are and are done here.

If it happens again of course we'd reopen this and it'd be very interesting what other platform is affected (which we could add to the blacklist).

Dzahn closed this task as Resolved.Apr 18 2017, 7:36 PM

Dzahn claimed this task.

Dzahn added a subtask: T163220: Add Icinga check for CPU frequency on Dell R320.Apr 18 2017, 7:41 PM

Dzahn closed subtask T163220: Add Icinga check for CPU frequency on Dell R320 as Resolved.Apr 20 2017, 1:45 AM

Dzahn added a subtask: T147905: investigate lead hardware issue.Apr 27 2017, 10:54 PM

Dzahn mentioned this in T147905: investigate lead hardware issue.Apr 27 2017, 10:58 PM

faidon mentioned this in T164675: labservices1002 slow puppet runs and IO issues.May 8 2017, 5:07 PM

acamar hit this again on Sunday, in spite of the (working) acpi_pad blacklist. A simple reboot seems to have cleared it. The next- best advice (based on that old Dell info) would be to blacklist mei. I've rmmod'd it on acamar for now to see if it causes additional issues before we try blacklisting it on all.

tin hit this today. I've tried to rmmod mei_me and rmmod mei as suggested above, but didn't fix the problem live, it probably needs a reboot, but I'm not rebooting it right now (see below).

Another suggestion I've found is to do a cold reboot to reset some sensors, see this DELL Support for T320 and this ServerFault thread.
From the DELL Support thread is not fully clear if the latest BIOS (2.4.2) helped to solve the issue on the T320 (seems same version and release date of the one for the R320) or was just the cold reboot.

I suggest to perform a cold reboot on tin, but if it cannot wait until Tuesday midday UTC (Monday is a bank holiday in the US) we can do the standard reboot now and then try the cold reboot at a later time or when the next host will show the issue again.

Volans renamed this task from acpi_pad issues to CPU throttling on DELL PowerEdge R320.May 27 2017, 9:53 AM

Mentioned in SAL (#wikimedia-operations) [2017-06-19T09:11:32Z] <paravoid> upgrading achernar's BIOS from 1.2.4 to 2.4.2 hoping it will address recurring CPU throttling issue (T162850)

This happened again today with achernar. This is clearly a hardware/firmware issue; however, all of the servers listed in the task description (including achernar) run an outdated BIOS version of 1.2.4, with the exception of heze which runs (the also outdated) 1.3.5.

The latest available is 2.4.2, from Apr/Oct 2015. I downloaded the Linux binary and executed it on achernar, then rebooted. Let's give it a few days and then we can proceed with upgrading and rebooting the rest of these hosts.

Mentioned in SAL (#wikimedia-operations) [2017-08-21T21:26:56Z] <mutante> bast2001 - running Dell BIOS firmware upgrade (T162850)

Mentioned in SAL (#wikimedia-operations) [2017-08-21T21:54:01Z] <mutante> cerium - installing Dell BIOS upgrade (T162850)

Dzahn updated the task description. (Show Details)Aug 21 2017, 10:01 PM

Dzahn updated the task description. (Show Details)

done now: cerium xenon, praseodymium (cassandra test cluster)

removed machines from list that are already gone: subra, suhail, mira

Dzahn updated the task description. (Show Details)Aug 21 2017, 10:33 PM

Dzahn updated the task description. (Show Details)Aug 21 2017, 10:35 PM

Dzahn updated the task description. (Show Details)

RobH updated the task description. (Show Details)Aug 21 2017, 11:01 PM

Dzahn updated the task description. (Show Details)Aug 21 2017, 11:06 PM

RobH updated the task description. (Show Details)Aug 22 2017, 6:50 PM

Mentioned in SAL (#wikimedia-operations) [2017-09-08T20:10:17Z] <mutante> heze (bacula storage) - installing BIOS upgrade (T162850)

Dzahn updated the task description. (Show Details)Sep 8 2017, 8:24 PM

Since we are decommissioning logstash100[1-3] fairly soon, it's probably not worth investing any time in them...

Dzahn updated the task description. (Show Details)Sep 13 2017, 12:30 AM

Dzahn updated the task description. (Show Details)

Thanks @Gehel! So only acamar (DNS recursor) and baham (ns1.wm.org auth DNS) are left. @BBlack are there special precautions to be taken before rebooting these? For the recursor, the wikitech page "service restarts seems to say it's just fine as long as it's just one at a time, it does not even mention depooling. But i see on neodymium i could depool it with confctl. I assume the answer is "just do it quick", but i also heard on IRC that there were some issues around (recursor) availability and Node.js services and that it would be good to ask first. And how about baham? (ns1)?

The last time we rebooted DNS recursors eventbus bailed, but that specific bug has been fixed since then. The high level ticket is T171498. So quick depool should be fine, but keep an eye on Icinga and pybal logs.

Change 378188 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site/dns: tmp remove acamar from resolv.conf overrides

https://gerrit.wikimedia.org/r/378188

Change 378188 merged by Dzahn:
[operations/puppet@production] site/dns: tmp remove acamar from resolv.conf overrides

https://gerrit.wikimedia.org/r/378188

Mentioned in SAL (#wikimedia-operations) [2017-09-14T23:09:58Z] <mutante> acamar - done with upgrade - rebooting - it's depooled and removed from resolv.conf - T162850

14:53 < bblack> 1) Explicitly depool acamar from codfw recdns (you can confirm it in logs and ipvsadm -Ln output on lvs2002, should be the active LVS for it)
14:54 < bblack> 2) Take acamar's IP out of the special-cased resolv.conf overrides (git grep for 208.80.153.12 in site.pp)
14:55 < bblack> (and make sure that puppet change has applied on the relevant hosts (achernar + lvs200x)

^ I did these things before rebooting acamar and pasting it here as docs

Acamar is done. It's pooled again, the resolv.conf change is reverted. I saw no issues, no failed services/units after reboot.

Only a single host is left to upgrade and reboot, baham. But this is a bit more complicated to coordinate.

• Cmjohnson closed subtask T147905: investigate lead hardware issue as Resolved.Feb 2 2018, 6:01 PM

Mentioned in SAL (#wikimedia-operations) [2018-08-09T01:39:24Z] <mutante> baham - installing BIOS upgrade (2.4.2 for Dell R320) - server is on role(spare) and the last that did not get the upgrade on T162850

Dzahn updated the task description. (Show Details)Aug 9 2018, 1:44 AM

Dzahn removed a project: Patch-For-Review.

baham is done: Installed version: 2.4.2

This resolves the ticket.

only 6 R320s left today: acamar.wikimedia.org,achernar.wikimedia.org,baham.wikimedia.org,bast2001.wikimedia.org,heze.codfw.wmnet,labservices1002.wikimedia.org

jbond reopened subtask T163220: Add Icinga check for CPU frequency on Dell R320 as Open.Mar 24 2023, 5:49 PM

CPU throttling on DELL PowerEdge R320
Closed, ResolvedPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

CPU throttling on DELL PowerEdge R320Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

CPU throttling on DELL PowerEdge R320
Closed, ResolvedPublic
Actions

Related Objects
Search...