this needs some attention and/or rebalancing
Description
Related Objects
- Mentioned In
- T161118: Investigate instances with high "steal" CPU
T160990: deployment-ms-be03.deployment-prep and deployment-ms-be04.deployment-prep have high load / system CPU - Mentioned Here
- T160990: deployment-ms-be03.deployment-prep and deployment-ms-be04.deployment-prep have high load / system CPU
Event Timeline
Graphs over 24 hours:
CPU % x 2 1 day moving average
Load graph shows it is at roughly 40 load.
Assuming the server has 24 real CPU it is potentially a bit overcrowed :]
Mentioned in SAL (#wikimedia-labs) [2017-05-31T14:07:54Z] <andrewbogott> migrating tools-exec-1409 to labvirt1009 to reduce CPU load on labvirt1006 (T165753)
labvirt1006 still seems heavy loaded. Specially the disk I/O seems very high based on Grafana ( 6 months views ).
Andrew would you mind checking whether a process could be using too much disk io? Maybe it just a single instance acting strangely, else I am tempted to say the host itself has issues.
The prime offender here is deployment-ms-be04.deployment-prep.eqiad.wmflabs, which is doing some kind of giant Swift operation. I don't know if this is on purpose or in error... hoping @fgiunchedi can chime in.
ms-be03 is on labvirt1001. It's not the biggest CPU user on that host, but it /is/ the second biggest.
Seems load went down on June 21th in the afternoon (UTC) which is when the lab* hosts have been rebooted and I possibly instance reshuffled around.
That show up nicely on a 7 days graph:
Thank you @Andrew
