xenon.eqiad.wmnet: very high cpu utilization
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Eevans
	Jul 30 2016, 12:15 AM

Description

xenon.eqiad.wmnet became unresponsive this afternoon, and was found to be under extremely high CPU utilization. The culprit was a number of kernel acpi_pad processes.

P3608 (An Untitled Masterwork)

1	USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
2	[ ... ]
3	root 881 96.5 0.0 0 0 ? R 22:23 74:43 [acpi_pad/0]
4	root 882 96.5 0.0 0 0 ? R 22:23 74:41 [acpi_pad/1]
5	root 889 96.5 0.0 0 0 ? R 22:23 74:40 [acpi_pad/2]
6	root 890 96.5 0.0 0 0 ? R 22:23 74:41 [acpi_pad/3]
7	root 895 96.5 0.0 0 0 ? R 22:23 74:39 [acpi_pad/4]
8	root 896 96.5 0.0 0 0 ? R 22:23 74:38 [acpi_pad/5]
9	root 897 96.5 0.0 0 0 ? R 22:23 74:37 [acpi_pad/6]
10	root 898 96.5 0.0 0 0 ? R 22:23 74:37 [acpi_pad/7]
11	root 900 96.5 0.0 0 0 ? R 22:23 74:36 [acpi_pad/8]
12	root 901 96.5 0.0 0 0 ? R 22:23 74:36 [acpi_pad/9]
13	root 902 96.5 0.0 0 0 ? R 22:23 74:36 [acpi_pad/10]
14	[ ... ]

After a reboot, the machine came back up in the same state, 11 acpi_pad processes saturating the CPU.

Google turned up no shortage of hits for this, most of which seemed to indicate this could happen after disabling hyperthreading (which doesn't seem to be applicable here). Several recommended removing the acpi_pad kernel module, which I did (ephemerally, ala rmmod acpi_pad), which seems to have done the trick (the load is now normal).

See: T123924: acpi_pad runaway processes on praseodymium

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Dzahn	T162850 CPU throttling on DELL PowerEdge R320
		Resolved		Eevans	T141675 xenon.eqiad.wmnet: very high cpu utilization

Event Timeline

Eevans created this task.Jul 30 2016, 12:15 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 30 2016, 12:15 AM

Eevans triaged this task as High priority.Jul 30 2016, 12:15 AM

yea, HT is enabled on xenon..

it seems to start here, when RT throttling gets activated

2680 Jul 29 22:23:40 xenon kernel: [10997327.180547] sched: RT throttling activated
2681 Jul 29 22:24:36 xenon java[10061]: 2016-07-29 22:24:36,174 [DefaultQuartzScheduler_Worker-6] INFO o.w.c.metrics.service.StatsReporter - Writing internal stats
2682 Jul 29 22:24:36 xenon java[10061]: 2016-07-29 22:24:36,964 [DefaultQuartzScheduler_Worker-5] ERROR o.w.c.metrics.service.Collector - Error executing timed task
2683 Jul 29 22:24:36 xenon java[10061]: org.wikimedia.cassandra.metrics.service.TimedTaskException: Timeout of 60 seconds exceeded

Eevans updated the task description. (Show Details)Jul 30 2016, 12:56 AM

as requested i have turned "logical processor" off and on again in BIOS, like on T123924#1941098

Eevans updated the task description. (Show Details)Aug 1 2016, 6:54 PM

In T141675#2512799, @Dzahn wrote:

as requested i have turned "logical processor" off and on again in BIOS, like on T123924#1941098

I cannot believe that actually worked.

@Dzahn, @fgiunchedi I'm wondering: Should we go ahead and blacklist the acpi_pad module (as @fgiunchedi suggested in T123924: acpi_pad runaway processes on praseodymium), or close this issue (and T123924) and pretend it never happened?

Eevans moved this task from Backlog to Blocked on the Cassandra board.Aug 8 2016, 3:53 PM

Probably something in between. I would say close it now and if it never happens again that is it. But if it happens again reopen and continue with the blacklisting.

In T141675#2534202, @Dzahn wrote:

Probably something in between. I would say close it now and if it never happens again that is it. But if it happens again reopen and continue with the blacklisting.

Works for me

Dzahn added a parent task: T162850: CPU throttling on DELL PowerEdge R320.Apr 12 2017, 11:44 PM

Dzahn mentioned this in T162850: CPU throttling on DELL PowerEdge R320.Apr 13 2017, 5:23 PM

xenon.eqiad.wmnet: very high cpu utilizationClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

xenon.eqiad.wmnet: very high cpu utilization
Closed, ResolvedPublic
Actions

Related Objects
Search...