Labvirt1001 has insanely slow IO
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Andrew
	Mar 7 2017, 3:23 PM

Description

The VMs running on labvirt1001 seem mostly OK. And yet, actual behavior on the box itself is strange. Creating new VMs is very slow, and puppet runs take 10x longer than on equivalent labvirts.

Actual atop stats for disk IO look fine, so something interesting is going on :(

Details

	Subject	Repo	Branch	Lines +/-
	Repool labvirt1001	operations/puppet	production	+1 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Andrew	T159721 labvirt1001 and 1002 cannot launch new VMs
		Resolved		Andrew	T159835 Labvirt1001 has insanely slow IO

Event Timeline

Andrew created this task.Mar 7 2017, 3:23 PM

Andrew added projects: SRE, ops-eqiad.

• chasemp triaged this task as High priority.Mar 17 2017, 12:36 PM

Might well be related to T161006 which suggest the Scheduler prioritize mostly based on RAM usage. So we end up with Nodepool instances spawning mostly on the same host which most probably overload the CPUs.

@hashar, it's nothing to do with load. there are no VMs running on labvirt1001 and it still has the problem.

The one symptom I'm fixating on is puppet runs. A puppet run on labvirt1001 takes 811.99.

The same run on labvirt1002 (which is actually doing useful things) takes 37.36 seconds.

Mentioned in SAL (#wikimedia-operations) [2017-03-22T21:18:47Z] <andrewbogott> rebooting labvirt1001 because it is being terrible. https://phabricator.wikimedia.org/T159835

what about doing puppet agent -tv --debug --verbose to see what it is taking so long on?

@Paladox what kind of things should I be looking for when running puppet agent -tv --debug --verbose?

And post-reboot it's fast again dammit

@yuvipanda hi, For example running it on gerrit-test3 returns P5113

So maybe it will tell us what bit it gets stuck on the longest.

It will go through bit by bit.

@Paladox thank you. Do you know how to get timing information out of it?

@yuvipanda puppet agent -tv --debug --verbose --evaltrace

https://ask.puppet.com/question/2755/howto-trace-execution-time-of-components-of-agent-run/

Paladox added a comment.Mar 22 2017, 9:30 PM

This comment was removed by Paladox.

or doing "If you have reports=true in your puppet.conf on the agent, you can see the time spent on each resource type. Reports are stored on the agent in /var/lib/puppet/reports."

actually the command is puppet agent -tv --debug --verbose --evaltrace -td

The current state of this is: I rebooted labvirt1001 and it got better. I've migrated a handful of tools exec nodes back to labvirt1001 and I'm going to keep an eye on it for a few weeks. If everything is still good come mid-April, we can just shrug and repool it.

bd808 moved this task from Triage to OpenStack on the Cloud-Services board.Mar 26 2017, 9:02 PM

Andrew mentioned this in T159721: labvirt1001 and 1002 cannot launch new VMs.Mar 28 2017, 2:36 PM

Change 347887 had a related patch set uploaded (by Andrew Bogott):
[operations/puppet@production] Repool labvirt1001.

https://gerrit.wikimedia.org/r/347887

gerritbot added a project: Patch-For-Review.Apr 12 2017, 5:28 PM

Change 347887 merged by Andrew Bogott:
[operations/puppet@production] Repool labvirt1001

https://gerrit.wikimedia.org/r/347887

Repooled, seems fine.

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:37 PM