CPU throttling on DELL PowerEdge R320
Open, NormalPublic

Description

Past incidents in phab: T123924 T137647 T159870 (also observed on acamar today, no separate task for that)

acpid_pad continues to occasionally cause problems, usually fixable by rmmod of the module with no apparent other ill effects in the general case. Dell has some older documentation that talks about blacklisting acpi_pad and a few other modules here: http://en.community.dell.com/techcenter/b/techcenter/archive/2012/08/27/ubuntu-on-dell-12g-poweredge-servers . I don't think we should blindly follow that advice, as it was written in the context of both older kernels and older servers, but I think at this point it might be reasonable for us to universally blacklist this module in base (or disable it on the kernel cmdline, if that's what's necessary to stop it from loading).


list of Dell PowerEdge 320 hosts: (cumin/facter: 'F:productname ~ "^PowerEdge R320"')

(check box if firmware has been upgraded)

  • acamar.wikimedia.org
  • achernar.wikimedia.org
  • baham.wikimedia.org
  • bast2001.wikimedia.org
  • cerium.eqiad.wmnet
  • heze.codfw.wmnet
  • labservices1002.wikimedia.org
  • logstash[1001-1003].eqiad.wmnet (declined, decom)
  • mira.codfw.wmnet (decom)
  • praseodymium.eqiad.wmnet
  • subra.codfw.wmnet (decom)
  • suhail.codfw.wmnet (decom)
  • tin.eqiad.wmnet
  • xenon.eqiad.wmnet
BBlack created this task.Apr 12 2017, 11:36 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 12 2017, 11:36 PM
BBlack triaged this task as High priority.Apr 12 2017, 11:36 PM

modules/base/files/kernel/blacklist-wmf.conf is probably the place to try disabling this first, FWIW.

Dzahn added a subscriber: Dzahn.Apr 12 2017, 11:43 PM

i wanted to suggest importing the puppet module kmod (https://gerrit.wikimedia.org/r/#/c/348009/) to get kmod::blacklist, but as you point out we already have the blacklist above so i abandoned that again

Is there a common pattern of affected distros/kernels/server models? I'd prefer if we first try to pinpoint this further before blacklisting it globally to avoid unforeseen side effects.

Dzahn added a comment.Apr 13 2017, 5:23 PM

Is there a common pattern of affected distros/kernels/server models?

hostticketdistrokernelserver model
praseodymiumT123924jessie4.9.0-0.bpo.2-amd64Dell PowerEdge R320
xenonT141675jessie4.9.0-0.bpo.2-amd64Dell PowerEdge R320
install2001T137647jessieunknownDell PowerEdge R320
nembusT110202unknownunknown (T122100)Dell PowerEdge R320
bahamT159870jessie4.4.0-3-amd64Dell PowerEdge R320

https://access.redhat.com/solutions/508303 , https://serverfault.com/questions/561190/unusual-load-average-for-an-idle-workstation

Dzahn added a comment.EditedApr 13 2017, 5:24 PM

@MoritzMuehlenhoff ^ Yea, looks like all Dell PowerEdge R320. re: kernel versions those might have been upgraded AFTER the incidents on the tickets.

Change 348197 had a related patch set uploaded (by Dzahn):
[operations/puppet@production] base::kernel: add mod blacklist specific to R320, blacklist acpi_pad

https://gerrit.wikimedia.org/r/348197

Catrope added a subscriber: Catrope.

tin is also affected: T163158: acpi_pad consuming 100% CPU on tin. Its kernel version appears to be 4.4.0-3-amd64.

I ran rmmod acpi_pad on tin and it fixed the issue right away. tin is a Dell PowerEdge R320 which is the pattern all affected servers had in common so far, further confirming the theory in the gerrit link above.

Dzahn updated the task description. (Show Details)Apr 17 2017, 11:15 PM
Dzahn updated the task description. (Show Details)

Change 348197 merged by Dzahn:
[operations/puppet@production] base::kernel: mod blacklist for Dell R320, blacklist acpi_pad

https://gerrit.wikimedia.org/r/348197

Mentioned in SAL (#wikimedia-operations) [2017-04-17T23:33:23Z] <mutante> running puppet via cumin on all 16 Dell PowerEdge R320, adding blacklist file for acpi_pad kernel module. 15/16 success, all but tin (T162850)

Mentioned in SAL (#wikimedia-operations) [2017-04-17T23:37:25Z] <mutante> runnin rmmod acpi_pad on the 16 R320 via cumin, since blacklisting in puppet does not actively remove, confirmed unloaded. (16/16) success ratio (>= 100.0% threshold) for command: 'lsmod|grep -c acpi_pad ||:' (T162850)

Dzahn lowered the priority of this task from High to Normal.Apr 17 2017, 11:42 PM

Now this should either never happen again... or it would affect more than just R320's. But we have not seen a case of that so far.

So.. a bit unsure about ticket status, so for now lowering priority from High to Normal.

Dzahn added a comment.Apr 18 2017, 7:35 PM

All subtasks are resolved. acpid_pad has been unloaded and blacklisted on all Dell R320 machines.

I suggest we try closing it and watch if it ever happens again. If it does not the issue was limited to this hardware type and we probably won't get that again, use the existing 16 servers as they are and are done here.

If it happens again of course we'd reopen this and it'd be very interesting what other platform is affected (which we could add to the blacklist).

Dzahn closed this task as Resolved.Apr 18 2017, 7:36 PM
Dzahn claimed this task.

acamar hit this again on Sunday, in spite of the (working) acpi_pad blacklist. A simple reboot seems to have cleared it. The next- best advice (based on that old Dell info) would be to blacklist mei. I've rmmod'd it on acamar for now to see if it causes additional issues before we try blacklisting it on all.

Volans reopened this task as Open.May 27 2017, 9:51 AM
Volans added a subscriber: Volans.

tin hit this today. I've tried to rmmod mei_me and rmmod mei as suggested above, but didn't fix the problem live, it probably needs a reboot, but I'm not rebooting it right now (see below).

Another suggestion I've found is to do a cold reboot to reset some sensors, see this DELL Support for T320 and this ServerFault thread.
From the DELL Support thread is not fully clear if the latest BIOS (2.4.2) helped to solve the issue on the T320 (seems same version and release date of the one for the R320) or was just the cold reboot.

I suggest to perform a cold reboot on tin, but if it cannot wait until Tuesday midday UTC (Monday is a bank holiday in the US) we can do the standard reboot now and then try the cold reboot at a later time or when the next host will show the issue again.

Volans renamed this task from acpi_pad issues to CPU throttling on DELL PowerEdge R320.May 27 2017, 9:53 AM

Mentioned in SAL (#wikimedia-operations) [2017-06-19T09:11:32Z] <paravoid> upgrading achernar's BIOS from 1.2.4 to 2.4.2 hoping it will address recurring CPU throttling issue (T162850)

faidon added a subscriber: faidon.Jun 19 2017, 9:26 AM

This happened again today with achernar. This is clearly a hardware/firmware issue; however, all of the servers listed in the task description (including achernar) run an outdated BIOS version of 1.2.4, with the exception of heze which runs (the also outdated) 1.3.5.

The latest available is 2.4.2, from Apr/Oct 2015. I downloaded the Linux binary and executed it on achernar, then rebooted. Let's give it a few days and then we can proceed with upgrading and rebooting the rest of these hosts.

Mentioned in SAL (#wikimedia-operations) [2017-08-21T21:26:56Z] <mutante> bast2001 - running Dell BIOS firmware upgrade (T162850)

Mentioned in SAL (#wikimedia-operations) [2017-08-21T21:54:01Z] <mutante> cerium - installing Dell BIOS upgrade (T162850)

Dzahn updated the task description. (Show Details)Aug 21 2017, 10:01 PM
Dzahn updated the task description. (Show Details)
Dzahn updated the task description. (Show Details)Aug 21 2017, 10:30 PM

done now: cerium xenon, praseodymium (cassandra test cluster)

removed machines from list that are already gone: subra, suhail, mira

Dzahn updated the task description. (Show Details)Aug 21 2017, 10:33 PM
Dzahn updated the task description. (Show Details)Aug 21 2017, 10:35 PM
Dzahn updated the task description. (Show Details)
Dzahn updated the task description. (Show Details)
RobH updated the task description. (Show Details)Aug 21 2017, 11:01 PM
Dzahn updated the task description. (Show Details)Aug 21 2017, 11:06 PM
RobH updated the task description. (Show Details)Aug 22 2017, 6:50 PM

Mentioned in SAL (#wikimedia-operations) [2017-09-08T20:10:17Z] <mutante> heze (bacula storage) - installing BIOS upgrade (T162850)

Dzahn updated the task description. (Show Details)Sep 8 2017, 8:24 PM
Gehel added a subscriber: Gehel.Sep 9 2017, 7:05 PM

Since we are decommissioning logstash100[1-3] fairly soon, it's probably not worth investing any time in them...

Dzahn updated the task description. (Show Details)Sep 13 2017, 12:30 AM
Dzahn updated the task description. (Show Details)
Dzahn added a comment.Sep 13 2017, 2:37 AM

Thanks @Gehel! So only acamar (DNS recursor) and baham (ns1.wm.org auth DNS) are left. @BBlack are there special precautions to be taken before rebooting these? For the recursor, the wikitech page "service restarts seems to say it's just fine as long as it's just one at a time, it does not even mention depooling. But i see on neodymium i could depool it with confctl. I assume the answer is "just do it quick", but i also heard on IRC that there were some issues around (recursor) availability and Node.js services and that it would be good to ask first. And how about baham? (ns1)?

The last time we rebooted DNS recursors eventbus bailed, but that specific bug has been fixed since then. The high level ticket is T171498. So quick depool should be fine, but keep an eye on Icinga and pybal logs.

Change 378188 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site/dns: tmp remove acamar from resolv.conf overrides

https://gerrit.wikimedia.org/r/378188

Change 378188 merged by Dzahn:
[operations/puppet@production] site/dns: tmp remove acamar from resolv.conf overrides

https://gerrit.wikimedia.org/r/378188

Mentioned in SAL (#wikimedia-operations) [2017-09-14T23:09:58Z] <mutante> acamar - done with upgrade - rebooting - it's depooled and removed from resolv.conf - T162850

14:53 < bblack> 1) Explicitly depool acamar from codfw recdns (you can confirm it in logs and ipvsadm -Ln output on lvs2002, should be the active LVS for it)
14:54 < bblack> 2) Take acamar's IP out of the special-cased resolv.conf overrides (git grep for 208.80.153.12 in site.pp)
14:55 < bblack> (and make sure that puppet change has applied on the relevant hosts (achernar + lvs200x)

^ I did these things before rebooting acamar and pasting it here as docs

Dzahn updated the task description. (Show Details)Sep 14 2017, 11:23 PM

Acamar is done. It's pooled again, the resolv.conf change is reverted. I saw no issues, no failed services/units after reboot.

Only a single host is left to upgrade and reboot, baham. But this is a bit more complicated to coordinate.