Page MenuHomePhabricator

labvirt1006 super busy right now
Closed, ResolvedPublic


this needs some attention and/or rebalancing

Event Timeline

Andrew created this task.May 19 2017, 1:13 PM
Restricted Application added a project: Cloud-Services. · View Herald TranscriptMay 19 2017, 1:13 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Graphs over 24 hours:

CPU % x 2 1 day moving average

Load graph shows it is at roughly 40 load.

Assuming the server has 24 real CPU it is potentially a bit overcrowed :]

I moved two tools instances off of 1006. No obvious change in cpu metrics so far.

Mentioned in SAL (#wikimedia-labs) [2017-05-31T14:07:54Z] <andrewbogott> migrating tools-exec-1409 to labvirt1009 to reduce CPU load on labvirt1006 (T165753)

Andrew closed this task as Resolved.May 31 2017, 3:26 PM

I moved one more away -- CPU usage is high now but not so high that I'm worried.

Slightly less loaded indeed. Thank you for the rebalance.

hashar reopened this task as Open.Jun 12 2017, 9:30 PM

labvirt1006 still seems heavy loaded. Specially the disk I/O seems very high based on Grafana ( 6 months views ).

Andrew would you mind checking whether a process could be using too much disk io? Maybe it just a single instance acting strangely, else I am tempted to say the host itself has issues.

The prime offender here is deployment-ms-be04.deployment-prep.eqiad.wmflabs, which is doing some kind of giant Swift operation. I don't know if this is on purpose or in error... hoping @fgiunchedi can chime in.

So that would be T160990 (high IO on swift instances). deployment-ms-be03.deployment-prep.eqiad.wmflabs should cause the same issue on whatever labvirt it is running on. Would you mind checkin that other labvirt?

I am going to tune the Swift configuration and follow up on T160990.

ms-be03 is on labvirt1001. It's not the biggest CPU user on that host, but it /is/ the second biggest.

Andrew triaged this task as Normal priority.Jun 13 2017, 8:09 PM
hashar closed this task as Resolved.Jun 27 2017, 8:55 PM

Seems load went down on June 21th in the afternoon (UTC) which is when the lab* hosts have been rebooted and I possibly instance reshuffled around.

That show up nicely on a 7 days graph:

Thank you @Andrew