Page MenuHomePhabricator

acpi_pad consuming 100% CPU on tin
Closed, ResolvedPublic

Description

5 acpi_pad processes are each consuming 100% CPU on tin. Between that and puppet running, tin is nearly unresponsive, and everything runs very slowly. It's almost impossible to deploy MediaWiki code right now because of this issue. Just tab-completing a directory name takes 1-2 seconds.

Event Timeline

Apparently this previously happened on tin's sister host mira as well: T137647#2791091

greg added a subscriber: greg.Apr 17 2017, 10:35 PM

Mentioned in SAL (#wikimedia-operations) [2017-04-17T22:40:33Z] <mutante> tin - rmmod acpi_pad (T163158)

Dzahn claimed this task.Apr 17 2017, 10:41 PM

Mentioned in SAL (#wikimedia-operations) [2017-04-17T22:42:19Z] <mutante> tin - load average going down, acpi_pad processes gone, cpu usage low again (T163158)

this change that is currently in code review should prevent this from happening again: https://gerrit.wikimedia.org/r/#/c/348197/

Dzahn closed this task as Resolved.Apr 17 2017, 10:52 PM

closing this one as tin is back to normal with the short term fix

as follow-up the change above is already in review and linked to the parent task (formerly known as "tracking task") for acpi_pad issues on multiple hosts.

also removed the module and blacklisted it on all 16 R320 servers now. so this should not happen again. see parent task for more details.

MoritzMuehlenhoff reopened this task as Open.Apr 18 2017, 2:55 PM

The "Improperly owned -0:0- files in /srv/mediawiki-staging" Icinga check was failing on tin, caused by a timeout of completing the check in time. It turns out tin is currently running with approx 200 MHz only:

root@tin:~# cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq
176046

This is probably not caused by rmmoding acpi_pad, but a different aspect of the same bug on those R320 models.

Mentioned in SAL (#wikimedia-operations) [2017-04-18T16:12:27Z] <godog> reboot tin to fix cpu mhz issue and check bios settings - T163158

fgiunchedi closed this task as Resolved.Apr 18 2017, 5:21 PM
fgiunchedi added a subscriber: fgiunchedi.

tin rebooted, I've enabled HT and fixed performance profile to be "performance per watt (OS)", see also the icinga task for alarming on this and parent task

The "Improperly owned -0:0- files in /srv/mediawiki-staging" Icinga check was failing on tin, caused by a timeout of completing the check in time.

The check now has a timeout value, fwiw. https://gerrit.wikimedia.org/r/#/c/348667/ But it wasn't really needed since it doesn't take that long anymore now.