labvirt1015 crashes
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Andrew
	Jul 24 2017, 1:52 PM

Description

Labvirt 1015 has now crashed twice. The first time it happened I rebooted it from mgmt and the console showed an endless stream of gibberish -- after a second reboot it came up again and appeared healthy but then died again a few days later.

Timeline

2017-07-24: Opened ticket to track crashes issue after 2 crashes
2017-08-14: "drained flea power" & cleared system log
2017-08-16: Crashed
2017-08-30: Crashed
2017-09-11: RMA for CPU replacement
2017-09-12: CPU in slot 2 replaced
2017-09-29: Moved 9 VMs to host
2017-10-01: Crashed
2017-10-18: Swapped CPU1 and CPU2
2017-10-20: Stress test, crashed
2017-10-23: RMA for CPU + mainboard replacement
2017-10-25: Mainboard declined; CPU approved
2017-11-03: CPU replaced
2017-11-03: Stress test; no crash
2017-11-05: Stress test; no crash
2017-11-07: Stress test; no crash
2017-11-08: Stress test; no crash

Related Objects
Search...

		Status	Subtype	Assigned	Task
					Unknown Object (Task)
		Resolved		• Cmjohnson	T171473 labvirt1015 crashes

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Restricted Application added a project: SRE. · View Herald TranscriptJul 24 2017, 1:52 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

(I should note that there's no data of interest on that box -- reimaging is just fine)

This could be a h/w issue. The h/w system event log shows this

ecord: 1
Date/Time: 04/28/2017 19:37:34
Source: system
Severity: Ok

Description: Log cleared.

Record: 2
Date/Time: 07/20/2017 19:09:56
Source: system
Severity: Critical

Description: CPU 1 machine check error detected.

Record: 3
Date/Time: 07/20/2017 19:09:56
Source: system
Severity: Ok

Description: An OEM diagnostic event occurred.

Record: 4
Date/Time: 07/20/2017 19:09:56
Source: system
Severity: Ok

Description: An OEM diagnostic event occurred.

Record: 5
Date/Time: 07/20/2017 19:09:56
Source: system
Severity: Ok

Description: An OEM diagnostic event occurred.

Record: 6
Date/Time: 07/20/2017 19:09:56
Source: system
Severity: Ok

Description: An OEM diagnostic event occurred.

Record: 7
Date/Time: 07/20/2017 19:09:56
Source: system
Severity: Ok

Description: An OEM diagnostic event occurred.

Record: 8
Date/Time: 07/20/2017 19:09:56
Source: system
Severity: Ok

Description: An OEM diagnostic event occurred.

Record: 9
Date/Time: 07/20/2017 19:09:56
Source: system
Severity: Ok

Description: An OEM diagnostic event occurred.

Record: 10
Date/Time: 07/20/2017 19:09:56
Source: system
Severity: Ok

Description: An OEM diagnostic event occurred.

Record: 11
Date/Time: 07/20/2017 19:09:56
Source: system
Severity: Ok

Description: An OEM diagnostic event occurred.

Record: 12
Date/Time: 07/20/2017 19:09:56
Source: system
Severity: Ok

Description: An OEM diagnostic event occurred.

Record: 13
Date/Time: 07/20/2017 19:09:57
Source: system
Severity: Ok

Description: An OEM diagnostic event occurred.

Record: 14
Date/Time: 07/20/2017 19:09:57
Source: system
Severity: Ok

Description: An OEM diagnostic event occurred.

Record: 15
Date/Time: 07/20/2017 19:11:38
Source: system
Severity: Ok

Description: A problem was detected related to the previous server boot.

Record: 16
Date/Time: 07/20/2017 19:11:38
Source: system
Severity: Critical

Description: CPU 1 machine check error detected.

Record: 17
Date/Time: 07/20/2017 19:11:38
Source: system
Severity: Ok
Description: An OEM diagnostic event occurred.

Thank you, Chris! This is new hardware and we can live without it... can we leave this in your hands to follow up with Dell? Is there any additional info you need?

• MoritzMuehlenhoff assigned this task to • Cmjohnson.Jul 25 2017, 12:35 PM

• Cmjohnson moved this task from Backlog to Up next on the ops-eqiad board.Jul 25 2017, 3:10 PM

@Andrew Today, I drained flea power (first step in all Dell troubleshooting) and cleared the system log. Let's let it go and see if you have anymore problems.

The host is down again (since 1d 8h 34m 8s)

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=labvirt1015

@Cmjohnson seems like a definite hardware failure to me man, we haven't even put this back in service. Next steps?

• chasemp edited projects, added cloud-services-team (Kanban); removed cloud-services-team.Aug 17 2017, 3:56 PM

• chasemp moved this task from Inbox to Needs discussion on the cloud-services-team (Kanban) board.

@chasemp can you share some logs, I need to take this back to Dell

• chasemp triaged this task as High priority.Aug 17 2017, 4:10 PM

Syslog jumps from Aug 15 15:53:01 to Aug 21 18:40:37 with no indication of trouble:

Aug 15 15:53:01 labvirt1015 CRON[136739]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Aug 21 18:40:37 labvirt1015 rsyslogd: [origin software="rsyslogd" swVersion="7.4.4" x-pid="1678" x-info="http://www.rsyslog.com"] start

There's rather a lot on mcelog, though, which I would otherwise expect to be empty:

mcelog38 KBDownload

For example:

CPU 0 BANK 18 TSC b270de9d648a 
RIP !INEXACT! 10:ffffffff8146c0e8
MISC c0fe2010821cc086 ADDR 3f62282b00 
TIME 1502812434 Tue Aug 15 15:53:54 2017
MCG status:RIPV MCIP 
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS be200000000b110a MCGSTATUS 5
MCGCAP 7000816 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 79

h/w log shows

Record: 16
Date/Time: 08/15/2017 15:55:29
Source: system
Severity: Critical

Description: CPU 1 machine check error detected.

Swapping cpu1 with cpu2 to see if error follows

So far the error at least in the h/w log on the server has not returned.....keeping this open to monitor.

It's down again, icinga says since 2017-08-30 16:14:18

It appears to be the cpu. Creating a task with Dell to replace.

Record: 16
Date/Time: 08/30/2017 16:13:51
Source: system
Severity: Critical

Description: CPU 2 machine check error detected.

Dzahn unsubscribed.Sep 6 2017, 5:29 PM

You have successfully submitted request SR953656459.

The CPU in slot 2 has been replaced and racadm log cleared. Please let me know if additional problems pop up.

Return shipping info of old part
USPS 9202 3946 5301 2436 6349 52
FEDEX 9611918 2393026 73384976

Resolving this, if you have any further issues please reopen the task

bd808 moved this task from Needs discussion to Done on the cloud-services-team (Kanban) board.Sep 28 2017, 3:54 PM

@Andrew moved 9 VMs to this host on 2017-09-29. On 2017-10-01 we found it non-responsive to ssh and with this output on the management console:

Screen Shot 2017-10-01 at 2.26.59 PM.png (454×1 px, 16 KB)

We are power cycling and hoping we can get it running long enough to evacuate the vms.

Console logging on boot:

labvirt1015 login: [   48.163451] kvm [3714]: vcpu0 unhandled rdmsr: 0x611
[   48.169003] kvm [3714]: vcpu0 unhandled rdmsr: 0x639
[   48.174567] kvm [3714]: vcpu0 unhandled rdmsr: 0x641
[   48.180121] kvm [3714]: vcpu0 unhandled rdmsr: 0x619
[   48.446780] kvm [3787]: vcpu0 unhandled rdmsr: 0x611
[   48.452379] kvm [3787]: vcpu0 unhandled rdmsr: 0x639
[   48.457930] kvm [3787]: vcpu0 unhandled rdmsr: 0x641
[   48.463467] kvm [3787]: vcpu0 unhandled rdmsr: 0x619
[   48.494249] kvm [3787]: vcpu0 unhandled rdmsr: 0x611
[   48.499798] kvm [3787]: vcpu0 unhandled rdmsr: 0x639

Note to self: fix cold-migrate to handle already shut down instances

Entirety of labvirt1015 console during crash https://usercontent.irccloud-cdn.com/file/mwxQTBO0/Screen%20Shot%202017-10-01%20at%202.26.59%20PM.png

The last syslog before reboot was at Oct 1 01:21:01. It was down for many hours and didn't page because I downtimed it during the hardware replacement and didn't clear the downtime before putting it back into service :( There's nothing in the syslog or kernel log to indicate distress.

Here's the latest mcelog. Without timestamps it's hard to correlate this to the failures but still seems bad.

mcelog19 KBDownload

In T171473#3649602, @Andrew wrote:

Here's the latest mcelog. Without timestamps it's hard to correlate this to the failures but still seems bad.

mcelog19 KBDownload

Those times are around the same time that the instances went down.

my first notification was

[02:22:35] <icinga2-wm> PROBLEM - puppet on gerrit-test3.git.eqiad.wmflabs is CRITICAL: connect to address 10.68.22.16 port 5666: No route to hostconnect to host gerrit-test3.git.eqiad.wmflabs port 5666: No route to host PROBLEM - check users on gerrit-test3.git.eqiad.wmflabs is CRITICAL: connect to address 10.68.22.16 port 5666: No route to hostconnect to host gerrit-test3.git.eqiad.wmflabs port 5666: No route to host

(bst time).

Final status of etherpad we were using to coordinate migrations off labvirt1015 for posterity

https://phabricator.wikimedia.org/T177164
https://phabricator.wikimedia.org/T171473

Remains:

andrewserver <==== deleted


In progress


Done:
tools-clushmaster-01
search-jessie
gerrit-test3
phab-01
puppet-phabricator
integration-slave-jessie-1003
integration-slave-jessie-1004
wdqs-deploy

The CPU failed again over the weekend,

Record: 2
Date/Time: 10/01/2017 01:21:53
Source: system
Severity: Critical
Description: CPU 2 machine check error detected.

@Cmjohnson what's our next step here? Do we have enough info to request additional replacement parts from Dell? This poor box has been a bit of a dud since it was racked. We can probably figure out how to rig up a load test on it to further stress the CPUs & backplane if we need to document more failures.

@bd808 I swapped the CPU's to see if the error follows the CPU. The replacement that I put in there was refurbished so there is a possibility it was bad. I cleared the racadm log. Let's monitor and see if the error persists.

In T171473#3694183, @Cmjohnson wrote:

Let's monitor and see if the error persists.

We probably need to find a way to put load on this system rather than just letting it sit mostly idle and wondering if an error will reoccur.

@Andrew scheduled 20 instances to this server and 4 think they came up and the rest failed.

2017-10-18 18:13:49.714 2530 ERROR nova.compute.manager     macs=jsonutils.to_primitive(macs))
2017-10-18 18:13:49.714 2530 ERROR nova.compute.manager   File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 158, in call
2017-10-18 18:13:49.714 2530 ERROR nova.compute.manager     retry=self.retry)
2017-10-18 18:13:49.714 2530 ERROR nova.compute.manager   File "/usr/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 90, in _send
2017-10-18 18:13:49.714 2530 ERROR nova.compute.manager     timeout=timeout, retry=retry)
2017-10-18 18:13:49.714 2530 ERROR nova.compute.manager   File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 462, in send
2017-10-18 18:13:49.714 2530 ERROR nova.compute.manager     retry=retry)
2017-10-18 18:13:49.714 2530 ERROR nova.compute.manager   File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 451, in _send
2017-10-18 18:13:49.714 2530 ERROR nova.compute.manager     result = self._waiter.wait(msg_id, timeout)
2017-10-18 18:13:49.714 2530 ERROR nova.compute.manager   File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 348, in wait
2017-10-18 18:13:49.714 2530 ERROR nova.compute.manager     message = self.waiters.get(msg_id, timeout=timeout)
2017-10-18 18:13:49.714 2530 ERROR nova.compute.manager   File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 253, in get
2017-10-18 18:13:49.714 2530 ERROR nova.compute.manager     'to message ID %s' % msg_id)
2017-10-18 18:13:49.714 2530 ERROR nova.compute.manager MessagingTimeout: Timed out waiting for a reply to message ID 6306f0571368461da272b44d0eb25055
2017-10-18 18:13:49.714 2530 ERROR nova.compute.manager
2017-10-18 18:13:49.720 2530 ERROR nova.compute.manager [req-4d357a2b-1c9b-48fe-8285-bba22564ead2 novaadmin testlabs - - -] [instance: ead4c5fd-61b1-4975-931b-058ddcab839c] Instance failed to spawn
2017-10-18 18:13:49.720 2530 ERROR nova.compute.manager [instance: ead4c5fd-61b1-4975-931b-058ddcab839c] Traceback (most recent call last):
2017-10-18 18:13:49.720 2530 ERROR nova.compute.manager [instance: ead4c5fd-61b1-4975-931b-058ddcab839c]   File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2156, in _build_resources
2017-10-18 18:13:49.720 2530 ERROR nova.compute.manager [instance: ead4c5fd-61b1-4975-931b-058ddcab839c]     yield resources
2017-10-18 18:13:49.720 2530 ERROR nova.compute.manager [instance: ead4c5fd-61b1-4975-931b-058ddcab839c]   File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2009, in _build_and_run_instance
2017-10-18 18:13:49.720 2530 ERROR nova.compute.manager [instance: ead4c5fd-61b1-4975-931b-058ddcab839c]     block_device_info=block_device_info)
2017-10-18 18:13:49.720 2530 ERROR nova.compute.manager [instance: ead4c5fd-61b1-4975-931b-058ddcab839c]   File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 2531, in spawn
2017-10-18 18:13:49.720 2530 ERROR nova.compute.manager [instance: ead4c5fd-61b1-4975-931b-058ddcab839c]     write_to_disk=True)
2017-10-18 18:13:49.720 2530 ERROR nova.compute.manager [instance: ead4c5fd-61b1-4975-931b-058ddcab839c]   File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 4413, in _get_guest_xml
2017-10-18 18:13:49.720 2530 ERROR nova.compute.manager [instance: ead4c5fd-61b1-4975-931b-058ddcab839c]     network_info_str = str(network_info)
2017-10-18 18:13:49.720 2530 ERROR nova.compute.manager [instance: ead4c5fd-61b1-4975-931b-058ddcab839c]   File "/usr/lib/python2.7/dist-packages/nova/network/model.py", line 517, in __str__
2017-10-18 18:13:49.720 2530 ERROR nova.compute.manager [instance: ead4c5fd-61b1-4975-931b-058ddcab839c]     return self._sync_wrapper(fn, *args, **kwargs)
2017-10-18 18:13:49.720 2530 ERROR nova.compute.manager [instance: ead4c5fd-61b1-4975-931b-058ddcab839c]   File "/usr/lib/python2.7/dist-packages/nova/network/model.py", line 500, in _sync_wrapper
2017-10-18 18:13:49.720 2530 ERROR nova.compute.manager [instance: ead4c5fd-61b1-4975-931b-058ddcab839c]     self.wait()
2017-10-18 18:13:49.720 2530 ERROR nova.compute.manager [instance: ead4c5fd-61b1-4975-931b-058ddcab839c]   File "/usr/lib/python2.7/dist-packages/nova/network/model.py", line 532, in wait
2017-10-18 18:13:49.720 2530 ERROR nova.compute.manager [instance: ead4c5fd-61b1-4975-931b-058ddcab839c]     self[:] = self._gt.wait()

and at the moment they are failing to honor delete/removal and are really screwed up generally. This server is hosed.

I think the VM creation failure was a (mostly? completely?) unrelated issue. I've rescheduled some actually running VMs there, and will see how they do.

Still no sign of failure from the h/w log....it took awhile last time

• Cmjohnson added a comment.Oct 20 2017, 3:30 PM

This comment was removed by • Cmjohnson.

I installed 20 VMs, and ran stress-ng on each of them, like this:

andrew@labpuppetmaster1001:~$ sudo cumin "name:labvirt1015stresstest*" "stress-ng --cpu 1 --io 2 --vm 1 --vm-bytes 1G &"

The system locked up more-or-less immediately.

In T171473#3699830, @Cmjohnson wrote:

yeah, I see that they are failed again. I don't know why...i tried swapping another spare from the decom ms-be servers and it lights up but still shows failed. The R720's have a history of killing off a PSU and every subsequent PSU you replace it with. These are out of warranty and probably should should be replaced.

I don't quite follow this -- this server is more-or-less brand new, how can its components be out of warranty?

@andrew..wrote that in the wrong ticket

@Cmjohnson, can we RMA this back to oblivion yet? :D

@chasemp, no unfortunately it does not work that way. I new CPU and motherboard has been requested through Dell. I believe that will fix the issue. The CPU they sent the first time is refurbished so its possible it may have been bad.

Dell ticket request SR955632952.

OK, thanks @Cmjohnson. We'll hang tight for the new board.

Dell declined the new system board. We are getting another CPU to since that is the part that seems to be broken.

In T171473#3710500, @Cmjohnson wrote:

Dell declined the new system board. We are getting another CPU to since that is the part that seems to be broken.

Timeline?

The CPU was replaced and idrac log cleared.

!log simulate load for labvirt1015 'sudo cumin "name:labvirt1015stresstest*" "stress-ng --cpu 1 --io 2 --vm 1 --vm-bytes 1G &"'

I started our 20 test instances here and issued the same command to generate load and it died pretty much immediately.

labcontrol1001:~# nova list --all-tenants | grep labvirt1015stresstest
| c35f555e-0d56-464b-a080-9bc5733de6ef | labvirt1015stresstest-1            | testlabs                    | ACTIVE  | -          | Running     | public=10.68.16.56                                  |
| 2c0d2316-d7cb-4bbc-ab8e-6b67d93899fa | labvirt1015stresstest-10           | testlabs                    | ACTIVE  | -          | Running     | public=10.68.21.246                                 |
| af57375c-2347-46a9-a8eb-b3cac2a47a59 | labvirt1015stresstest-11           | testlabs                    | ACTIVE  | -          | Running     | public=10.68.23.123                                 |
| 98a6e1a6-6b26-49d5-aade-1442e4577a41 | labvirt1015stresstest-12           | testlabs                    | ACTIVE  | -          | Running     | public=10.68.16.16                                  |
| 73909456-ae13-4be6-8b13-a1f9bb5fd4e0 | labvirt1015stresstest-13           | testlabs                    | ACTIVE  | -          | Running     | public=10.68.19.90                                  |
| 398dc6e8-1950-482c-903e-a938e0a00a5f | labvirt1015stresstest-14           | testlabs                    | ACTIVE  | -          | Running     | public=10.68.16.63                                  |
| 476967ca-7ffb-446c-a57e-7338131db919 | labvirt1015stresstest-15           | testlabs                    | ACTIVE  | -          | Running     | public=10.68.21.165                                 |
| 4c9e8d46-010b-4245-a9a7-388e74c4ae6e | labvirt1015stresstest-16           | testlabs                    | ACTIVE  | -          | Running     | public=10.68.19.111                                 |
| 37dd5fbb-0280-4707-9af9-6f04c297986d | labvirt1015stresstest-17           | testlabs                    | ACTIVE  | -          | Running     | public=10.68.20.182                                 |
| e4dea18c-b31f-4157-bc35-7ce74c6cf16c | labvirt1015stresstest-18           | testlabs                    | ACTIVE  | -          | Running     | public=10.68.18.251                                 |
| ef57b6e3-6cc2-4cbc-a48f-1ae1ed3f173a | labvirt1015stresstest-19           | testlabs                    | ACTIVE  | -          | Running     | public=10.68.18.197                                 |
| a86fbeeb-9e98-4316-bafe-43db41216da9 | labvirt1015stresstest-2            | testlabs                    | ACTIVE  | -          | Running     | public=10.68.22.38                                  |
| 37cbaf64-adfd-4b72-a834-20d7a934d2a2 | labvirt1015stresstest-20           | testlabs                    | ACTIVE  | -          | Running     | public=10.68.19.12                                  |
| f15c5cf4-92f4-4e50-bb74-0660671f33ef | labvirt1015stresstest-3            | testlabs                    | ACTIVE  | -          | Running     | public=10.68.21.85                                  |
| 326cf493-a040-4c2d-bffe-a40b5f6edbca | labvirt1015stresstest-4            | testlabs                    | ACTIVE  | -          | Running     | public=10.68.22.97                                  |
| 94d43e1c-404e-49b9-8452-6bd6309e75d0 | labvirt1015stresstest-5            | testlabs                    | ACTIVE  | -          | Running     | public=10.68.23.149                                 |
| c586bc66-2397-4eed-aa4a-5ce27ac2e9b1 | labvirt1015stresstest-6            | testlabs                    | ACTIVE  | -          | Running     | public=10.68.21.144                                 |
| 21d1a846-dec1-4062-8be2-8c6f488c8cf2 | labvirt1015stresstest-7            | testlabs                    | ACTIVE  | -          | Running     | public=10.68.22.6                                   |
| bd8739e3-5d05-42ec-90d4-c6be7efddf3f | labvirt1015stresstest-8            | testlabs                    | ACTIVE  | -          | Running     | public=10.68.20.58                                  |
| 7ff7a6f2-0524-4fe7-93a7-102af7c5cdf2 | labvirt1015stresstest-9            | testlabs                    | ACTIVE  | -          | Running     | public=10.68.17.76                                  |

sudo cumin "name:labvirt1015stresstest*" "stress-ng --cpu 1 --io 2 --vm 1 --vm-bytes 1G &"

I have to jump into an interview and will try to follow up later to grab any logs

EDIT: ssh is unavailable, @madhuvishy was kind enough to reboot and take a look at logs

Screen Shot 2017-11-02 at 3.53.37 PM.png (606×917 px, 54 KB)

Getting turned back on and then dying from load. This is now feeling like, how can we get this manufacturer to replace their equipment?

• chasemp added a subscriber: • madhuvishy.Nov 2 2017, 8:55 PM

@chasemp FYI if you add the labs project to the cumin query is immediate (as compared to go over all projects) and OpenStack API already does a regex, so the prefix is enough, without any special char. To summarize project:testlabs name:labvirt1015stresstest should do 😉.

@chasemp please try again, I replaced the broken CPU.

Mentioned in SAL (#wikimedia-cloud) [2017-11-03T16:46:40Z] <bd808> Running stress-ng test on labvirt1015stresstest* vms for T171473

How to run a stress test:

$ ssh labcontrol1001.wikimedia.org
$ source <(sudo cat ~root/novaenv.sh)
$ nova list --tenant=testlabs | grep labvirt1015stresstest | awk {'print $2}' | xargs nova start
Request to start server c35f555e-0d56-464b-a080-9bc5733de6ef has been accepted.
Request to start server 2c0d2316-d7cb-4bbc-ab8e-6b67d93899fa has been accepted.
Request to start server af57375c-2347-46a9-a8eb-b3cac2a47a59 has been accepted.
Request to start server 98a6e1a6-6b26-49d5-aade-1442e4577a41 has been accepted.
Request to start server 73909456-ae13-4be6-8b13-a1f9bb5fd4e0 has been accepted.
Request to start server 398dc6e8-1950-482c-903e-a938e0a00a5f has been accepted.
Request to start server 476967ca-7ffb-446c-a57e-7338131db919 has been accepted.
Request to start server 4c9e8d46-010b-4245-a9a7-388e74c4ae6e has been accepted.
Request to start server 37dd5fbb-0280-4707-9af9-6f04c297986d has been accepted.
Request to start server e4dea18c-b31f-4157-bc35-7ce74c6cf16c has been accepted.
Request to start server ef57b6e3-6cc2-4cbc-a48f-1ae1ed3f173a has been accepted.
Request to start server a86fbeeb-9e98-4316-bafe-43db41216da9 has been accepted.
Request to start server 37cbaf64-adfd-4b72-a834-20d7a934d2a2 has been accepted.
Request to start server f15c5cf4-92f4-4e50-bb74-0660671f33ef has been accepted.
Request to start server 326cf493-a040-4c2d-bffe-a40b5f6edbca has been accepted.
Request to start server 94d43e1c-404e-49b9-8452-6bd6309e75d0 has been accepted.
Request to start server c586bc66-2397-4eed-aa4a-5ce27ac2e9b1 has been accepted.
Request to start server 21d1a846-dec1-4062-8be2-8c6f488c8cf2 has been accepted.
Request to start server bd8739e3-5d05-42ec-90d4-c6be7efddf3f has been accepted.
Request to start server 7ff7a6f2-0524-4fe7-93a7-102af7c5cdf2 has been accepted.
$ nova list --tenant=testlabs | grep labvirt1015stresstest
| c35f555e-0d56-464b-a080-9bc5733de6ef | labvirt1015stresstest-1  | testlabs  | ACTIVE | -          | Running     | public=10.68.16.56  |
| 2c0d2316-d7cb-4bbc-ab8e-6b67d93899fa | labvirt1015stresstest-10 | testlabs  | ACTIVE | -          | Running     | public=10.68.21.246 |
| af57375c-2347-46a9-a8eb-b3cac2a47a59 | labvirt1015stresstest-11 | testlabs  | ACTIVE | -          | Running     | public=10.68.23.123 |
| 98a6e1a6-6b26-49d5-aade-1442e4577a41 | labvirt1015stresstest-12 | testlabs  | ACTIVE | -          | Running     | public=10.68.16.16  |
| 73909456-ae13-4be6-8b13-a1f9bb5fd4e0 | labvirt1015stresstest-13 | testlabs  | ACTIVE | -          | Running     | public=10.68.19.90  |
| 398dc6e8-1950-482c-903e-a938e0a00a5f | labvirt1015stresstest-14 | testlabs  | ACTIVE | -          | Running     | public=10.68.16.63  |
| 476967ca-7ffb-446c-a57e-7338131db919 | labvirt1015stresstest-15 | testlabs  | ACTIVE | -          | Running     | public=10.68.21.165 |
| 4c9e8d46-010b-4245-a9a7-388e74c4ae6e | labvirt1015stresstest-16 | testlabs  | ACTIVE | -          | Running     | public=10.68.19.111 |
| 37dd5fbb-0280-4707-9af9-6f04c297986d | labvirt1015stresstest-17 | testlabs  | ACTIVE | -          | Running     | public=10.68.20.182 |
| e4dea18c-b31f-4157-bc35-7ce74c6cf16c | labvirt1015stresstest-18 | testlabs  | ACTIVE | -          | Running     | public=10.68.18.251 |
| ef57b6e3-6cc2-4cbc-a48f-1ae1ed3f173a | labvirt1015stresstest-19 | testlabs  | ACTIVE | -          | Running     | public=10.68.18.197 |
| a86fbeeb-9e98-4316-bafe-43db41216da9 | labvirt1015stresstest-2  | testlabs  | ACTIVE | -          | Running     | public=10.68.22.38  |
| 37cbaf64-adfd-4b72-a834-20d7a934d2a2 | labvirt1015stresstest-20 | testlabs  | ACTIVE | -          | Running     | public=10.68.19.12  |
| f15c5cf4-92f4-4e50-bb74-0660671f33ef | labvirt1015stresstest-3  | testlabs  | ACTIVE | -          | Running     | public=10.68.21.85  |
| 326cf493-a040-4c2d-bffe-a40b5f6edbca | labvirt1015stresstest-4  | testlabs  | ACTIVE | -          | Running     | public=10.68.22.97  |
| 94d43e1c-404e-49b9-8452-6bd6309e75d0 | labvirt1015stresstest-5  | testlabs  | ACTIVE | -          | Running     | public=10.68.23.149 |
| c586bc66-2397-4eed-aa4a-5ce27ac2e9b1 | labvirt1015stresstest-6  | testlabs  | ACTIVE | -          | Running     | public=10.68.21.144 |
| 21d1a846-dec1-4062-8be2-8c6f488c8cf2 | labvirt1015stresstest-7  | testlabs  | ACTIVE | -          | Running     | public=10.68.22.6   |
| bd8739e3-5d05-42ec-90d4-c6be7efddf3f | labvirt1015stresstest-8  | testlabs  | ACTIVE | -          | Running     | public=10.68.20.58  |
| 7ff7a6f2-0524-4fe7-93a7-102af7c5cdf2 | labvirt1015stresstest-9  | testlabs  | ACTIVE | -          | Running     | public=10.68.17.76  |

$ ssh labs-puppetmaster.wikimedia.org
$ sudo cumin "project:testlabs name:labvirt1015stresstest" 'stress-ng --cpu 1 --io 2 --vm 1 --vm-bytes 1G &'
20 hosts will be targeted:
labvirt1015stresstest-[1-20].testlabs.eqiad.wmflabs
Confirm to continue [y/n]? y
PASS:  |                              |   0% (0/20) [00:00<?, ?hosts/s]
FAIL:  |                              |   0% (0/20) [00:00<?, ?hosts/s]

$ ssh labvirt1015.eqiad.wmnet
$ watch w
Every 2.0s: w                                           Fri Nov  3 16:51:37 2017

 16:51:37 up 49 min,  1 user,  load average: 32.86, 24.19, 11.59
USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
bd808    pts/2    bast1001.wikimed 16:39   33.00s  0.08s  0.00s watch w

Mentioned in SAL (#wikimedia-cloud) [2017-11-05T23:57:04Z] <bd808> Running stress-ng test on labvirt1015stresstest* vms for T171473

bd808 mentioned this in T179378: some labvirt servers are at full CPU capacity.Nov 6 2017, 11:00 PM

• Cmjohnson moved this task from High Priority Task to Blocked on the ops-eqiad board.Nov 7 2017, 3:33 PM

Mentioned in SAL (#wikimedia-cloud) [2017-11-07T16:16:13Z] <bd808> Running 3rd round of stress tests on labvirt1015 (T171473)

bd808 updated the task description. (Show Details)Nov 7 2017, 11:44 PM

Mentioned in SAL (#wikimedia-cloud) [2017-11-08T20:51:09Z] <bd808> Running stress-ng test on labvirt1015stresstest* vms for T171473

bd808 updated the task description. (Show Details)Nov 9 2017, 2:21 AM

still seems up and even responsive!

bd808 updated the task description. (Show Details)Nov 14 2017, 7:02 PM

Mentioned in SAL (#wikimedia-operations) [2017-11-14T19:09:27Z] <chasemp> for i in OS_TENANT_NAME=testlabs openstack server list | grep stress | awk '{print $2}'; do echo $i; OS_TENANT_NAME=testlabs openstack server delete $i; sleep 30; done T171473