Page MenuHomePhabricator

Labvirt1001 has insanely slow IO
Closed, ResolvedPublic

Description

The VMs running on labvirt1001 seem mostly OK. And yet, actual behavior on the box itself is strange. Creating new VMs is very slow, and puppet runs take 10x longer than on equivalent labvirts.

Actual atop stats for disk IO look fine, so something interesting is going on :(

Details

Related Gerrit Patches:
operations/puppet : productionRepool labvirt1001

Event Timeline

Andrew created this task.Mar 7 2017, 3:23 PM
Andrew added projects: Operations, ops-eqiad.
chasemp triaged this task as High priority.Mar 17 2017, 12:36 PM
hashar added a subscriber: hashar.Mar 22 2017, 10:00 AM

Might well be related to T161006 which suggest the Scheduler prioritize mostly based on RAM usage. So we end up with Nodepool instances spawning mostly on the same host which most probably overload the CPUs.

@hashar, it's nothing to do with load. there are no VMs running on labvirt1001 and it still has the problem.

The one symptom I'm fixating on is puppet runs. A puppet run on labvirt1001 takes 811.99.

The same run on labvirt1002 (which is actually doing useful things) takes 37.36 seconds.

Mentioned in SAL (#wikimedia-operations) [2017-03-22T21:18:47Z] <andrewbogott> rebooting labvirt1001 because it is being terrible. https://phabricator.wikimedia.org/T159835

what about doing puppet agent -tv --debug --verbose to see what it is taking so long on?

@Paladox what kind of things should I be looking for when running puppet agent -tv --debug --verbose?

And post-reboot it's fast again dammit

@yuvipanda hi, For example running it on gerrit-test3 returns P5113

So maybe it will tell us what bit it gets stuck on the longest.

It will go through bit by bit.

@Paladox thank you. Do you know how to get timing information out of it?

This comment was removed by Paladox.

or doing "If you have reports=true in your puppet.conf on the agent, you can see the time spent on each resource type. Reports are stored on the agent in /var/lib/puppet/reports."

actually the command is puppet agent -tv --debug --verbose --evaltrace -td

The current state of this is: I rebooted labvirt1001 and it got better. I've migrated a handful of tools exec nodes back to labvirt1001 and I'm going to keep an eye on it for a few weeks. If everything is still good come mid-April, we can just shrug and repool it.

bd808 moved this task from Triage to OpenStack on the Cloud-Services board.Mar 26 2017, 9:02 PM

Change 347887 had a related patch set uploaded (by Andrew Bogott):
[operations/puppet@production] Repool labvirt1001.

https://gerrit.wikimedia.org/r/347887

Change 347887 merged by Andrew Bogott:
[operations/puppet@production] Repool labvirt1001

https://gerrit.wikimedia.org/r/347887

Andrew closed this task as Resolved.Apr 13 2017, 6:41 PM

Repooled, seems fine.