Page MenuHomePhabricator

Instances CPU being stuck on at least a couple instances
Closed, ResolvedPublic

Description

Some instances are suffering from CPU lockdown / slowness that make them rather unresponsive.

An example is deployment-parsoid05 which runs on virt1007 Ganglia view. The server has spikes of WIO but is otherwise low on CPU usage.

A dmesg T97421#1244842 shows a BUG: soft lockup - CPU#1 stuck for 23s! [nodejs:27358] error.

@Petrb reported a similar issue on huggle-d2 which runs on labvirt1005 (ganglia view).

Event Timeline

hashar raised the priority of this task from to Needs Triage.
hashar updated the task description. (Show Details)
hashar added projects: Cloud-VPS, Cloud-Services.
hashar added subscribers: hashar, Petrb.

This is what I see in dmesg:

[Wed Apr 29 08:40:45 2015] intel_rapl: RAPL domain package detection failed
[Wed Apr 29 08:40:45 2015] intel_rapl: RAPL domain core detection failed
[Wed Apr 29 08:40:45 2015] intel_rapl: RAPL domain uncore detection failed
[Wed Apr 29 08:40:45 2015] intel_rapl: RAPL domain dram detection failed
...
[Wed Apr 29 10:05:43 2015] hrtimer: interrupt took 11107333 ns

I recommend to check the CPU utilization on KVM level and make sure that all virtualization features of CPU are turned ON in BIOS etc.

hashar set Security to None.
hashar updated the task description. (Show Details)

From Coren:

b26e5c79-7190-431c-9fc9-e12bf05c0cd6deployment-parsoid05labvirt1005ACTIVE

deployment-parsoid05 is back up and working lightning fast.

hashar claimed this task.

deployment-paroisd05 instance has been migrated from labvirt1005 which is suffering from memory issue (T97521). It is now on labvirt1001 and running fine as far as I can tell.

I guess huggle-d2 will be migrated as part of T97521: labvirt1005 memory errors so there is no point in keeping this task open.