5 acpi_pad processes are each consuming 100% CPU on tin. Between that and puppet running, tin is nearly unresponsive, and everything runs very slowly. It's almost impossible to deploy MediaWiki code right now because of this issue. Just tab-completing a directory name takes 1-2 seconds.
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | Dzahn | T162850 CPU throttling on DELL PowerEdge R320 | |||
| Resolved | Dzahn | T163158 acpi_pad consuming 100% CPU on tin |
Event Timeline
Apparently this previously happened on tin's sister host mira as well: T137647#2791091
Mentioned in SAL (#wikimedia-operations) [2017-04-17T22:40:33Z] <mutante> tin - rmmod acpi_pad (T163158)
Mentioned in SAL (#wikimedia-operations) [2017-04-17T22:42:19Z] <mutante> tin - load average going down, acpi_pad processes gone, cpu usage low again (T163158)
this change that is currently in code review should prevent this from happening again: https://gerrit.wikimedia.org/r/#/c/348197/
closing this one as tin is back to normal with the short term fix
as follow-up the change above is already in review and linked to the parent task (formerly known as "tracking task") for acpi_pad issues on multiple hosts.
also removed the module and blacklisted it on all 16 R320 servers now. so this should not happen again. see parent task for more details.
The "Improperly owned -0:0- files in /srv/mediawiki-staging" Icinga check was failing on tin, caused by a timeout of completing the check in time. It turns out tin is currently running with approx 200 MHz only:
root@tin:~# cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq
176046
This is probably not caused by rmmoding acpi_pad, but a different aspect of the same bug on those R320 models.
Mentioned in SAL (#wikimedia-operations) [2017-04-18T16:12:27Z] <godog> reboot tin to fix cpu mhz issue and check bios settings - T163158
tin rebooted, I've enabled HT and fixed performance profile to be "performance per watt (OS)", see also the icinga task for alarming on this and parent task
The check now has a timeout value, fwiw. https://gerrit.wikimedia.org/r/#/c/348667/ But it wasn't really needed since it doesn't take that long anymore now.