Page MenuHomePhabricator

xenon.eqiad.wmnet: very high cpu utilization
Closed, ResolvedPublic

Description

xenon.eqiad.wmnet became unresponsive this afternoon, and was found to be under extremely high CPU utilization. The culprit was a number of kernel acpi_pad processes.

1USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
2[ ... ]
3root 881 96.5 0.0 0 0 ? R 22:23 74:43 [acpi_pad/0]
4root 882 96.5 0.0 0 0 ? R 22:23 74:41 [acpi_pad/1]
5root 889 96.5 0.0 0 0 ? R 22:23 74:40 [acpi_pad/2]
6root 890 96.5 0.0 0 0 ? R 22:23 74:41 [acpi_pad/3]
7root 895 96.5 0.0 0 0 ? R 22:23 74:39 [acpi_pad/4]
8root 896 96.5 0.0 0 0 ? R 22:23 74:38 [acpi_pad/5]
9root 897 96.5 0.0 0 0 ? R 22:23 74:37 [acpi_pad/6]
10root 898 96.5 0.0 0 0 ? R 22:23 74:37 [acpi_pad/7]
11root 900 96.5 0.0 0 0 ? R 22:23 74:36 [acpi_pad/8]
12root 901 96.5 0.0 0 0 ? R 22:23 74:36 [acpi_pad/9]
13root 902 96.5 0.0 0 0 ? R 22:23 74:36 [acpi_pad/10]
14[ ... ]

After a reboot, the machine came back up in the same state, 11 acpi_pad processes saturating the CPU.

Google turned up no shortage of hits for this, most of which seemed to indicate this could happen after disabling hyperthreading (which doesn't seem to be applicable here). Several recommended removing the acpi_pad kernel module, which I did (ephemerally, ala rmmod acpi_pad), which seems to have done the trick (the load is now normal).

See: T123924: acpi_pad runaway processes on praseodymium

Event Timeline

Eevans triaged this task as High priority.Jul 30 2016, 12:15 AM

yea, HT is enabled on xenon..

it seems to start here, when RT throttling gets activated

2680 Jul 29 22:23:40 xenon kernel: [10997327.180547] sched: RT throttling activated
2681 Jul 29 22:24:36 xenon java[10061]: 2016-07-29 22:24:36,174 [DefaultQuartzScheduler_Worker-6] INFO o.w.c.metrics.service.StatsReporter - Writing internal stats
2682 Jul 29 22:24:36 xenon java[10061]: 2016-07-29 22:24:36,964 [DefaultQuartzScheduler_Worker-5] ERROR o.w.c.metrics.service.Collector - Error executing timed task
2683 Jul 29 22:24:36 xenon java[10061]: org.wikimedia.cassandra.metrics.service.TimedTaskException: Timeout of 60 seconds exceeded

as requested i have turned "logical processor" off and on again in BIOS, like on T123924#1941098

as requested i have turned "logical processor" off and on again in BIOS, like on T123924#1941098

I cannot believe that actually worked.

@Dzahn, @fgiunchedi I'm wondering: Should we go ahead and blacklist the acpi_pad module (as @fgiunchedi suggested in T123924: acpi_pad runaway processes on praseodymium), or close this issue (and T123924) and pretend it never happened?

Probably something in between. I would say close it now and if it never happens again that is it. But if it happens again reopen and continue with the blacklisting.

Eevans claimed this task.

Probably something in between. I would say close it now and if it never happens again that is it. But if it happens again reopen and continue with the blacklisting.

Works for me