Page MenuHomePhabricator

Labvirt1001 has insanely slow IO
Closed, ResolvedPublic

Description

The VMs running on labvirt1001 seem mostly OK. And yet, actual behavior on the box itself is strange. Creating new VMs is very slow, and puppet runs take 10x longer than on equivalent labvirts.

Actual atop stats for disk IO look fine, so something interesting is going on :(

Event Timeline

Might well be related to T161006 which suggest the Scheduler prioritize mostly based on RAM usage. So we end up with Nodepool instances spawning mostly on the same host which most probably overload the CPUs.

@hashar, it's nothing to do with load. there are no VMs running on labvirt1001 and it still has the problem.

The one symptom I'm fixating on is puppet runs. A puppet run on labvirt1001 takes 811.99.

The same run on labvirt1002 (which is actually doing useful things) takes 37.36 seconds.

Mentioned in SAL (#wikimedia-operations) [2017-03-22T21:18:47Z] <andrewbogott> rebooting labvirt1001 because it is being terrible. https://phabricator.wikimedia.org/T159835

what about doing puppet agent -tv --debug --verbose to see what it is taking so long on?

@Paladox what kind of things should I be looking for when running puppet agent -tv --debug --verbose?

And post-reboot it's fast again dammit

@yuvipanda hi, For example running it on gerrit-test3 returns P5113

So maybe it will tell us what bit it gets stuck on the longest.

It will go through bit by bit.

@Paladox thank you. Do you know how to get timing information out of it?

This comment was removed by Paladox.

or doing "If you have reports=true in your puppet.conf on the agent, you can see the time spent on each resource type. Reports are stored on the agent in /var/lib/puppet/reports."

actually the command is puppet agent -tv --debug --verbose --evaltrace -td

The current state of this is: I rebooted labvirt1001 and it got better. I've migrated a handful of tools exec nodes back to labvirt1001 and I'm going to keep an eye on it for a few weeks. If everything is still good come mid-April, we can just shrug and repool it.

Change 347887 had a related patch set uploaded (by Andrew Bogott):
[operations/puppet@production] Repool labvirt1001.

https://gerrit.wikimedia.org/r/347887

Change 347887 merged by Andrew Bogott:
[operations/puppet@production] Repool labvirt1001

https://gerrit.wikimedia.org/r/347887

Repooled, seems fine.