labvirt1008 rebooted / system was overheated
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	MoritzMuehlenhoff
	Feb 14 2018, 8:00 AM

Description

This morning at 7:20 UTC labvirt1008 rebooted. Hardware log shows that the system overheated:

number=8
severity=Critical
date=02/14/2018
time=07:10
description=Critical Temperature Threshold Exceeded (Temperature Sensor 21, Location System, Temperature 127C)

number=09
severity=Caution
date=02/14/2018
time=07:10
description=System Overheating (Temperature Sensor 21, Location System, Temperature 127C)

number=10
severity=Critical
date=02/14/2018
time=07:11
description=Automatic Operating System Shutdown Initiated Due to Overheat Condition

number=11
severity=Caution
date=02/14/2018
time=07:20
description=POST Error: 1792-Slot X Drive Array - Valid Data Found in Cache Module. Data will automatically be written to drive array.

Not sure about the best mitigation, maybe some fan died or it needs thermal paste?

https://lists.wikimedia.org/pipermail/cloud-announce/2018-February/000023.html
https://wikitech.wikimedia.org/wiki/Incident_documentation/20180214-labvirt1008-failure

Details

Subject	Repo	Branch	Lines +/-
openstack: nova active kvm check to alert multiple groups	operations/puppet	production	+5 -4
openstack: make nova compute kvm monitoring optional	operations/puppet	production	+21 -16
openstack: monitor kvm processes on labvirts	operations/puppet	production	+17 -9
openstack: fix overlapping descriptions for nova-compute check	operations/puppet	production	+2 -2
openstack: alert on down/unreachable nova-compute early	operations/puppet	production	+27 -0
icinga: create a wmcs contact group for some aggressive alerting	operations/puppet	production	+5 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• Cmjohnson	T187292 labvirt1008 rebooted / system was overheated
		Invalid		Andrew	T187317 Evacuate relevant instances off of labvirt1008

Event Timeline

MoritzMuehlenhoff created this task.Feb 14 2018, 8:00 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 14 2018, 8:00 AM

Thanks @moritz

Luckily the Toolforge instances here are a mix we could afford to have down. @Andrew let's sync up on this?

Paladox subscribed.Feb 14 2018, 11:46 AM

nova list --host labvirt1008 --all-tenants | awk '{print $4,$6}' | grep -v 'Name Tenant' | tr " " .

accounts-appserver4.account-creation-assistance
accounts-mwoauth.account-creation-assistance
bastion-02.bastion
bastion-restricted-02.bastion
bf-wmpageview.butterfly
chat-bots.mobile
ci-jessie-wikimedia-965167.contintcloud
ci-jessie-wikimedia-965171.contintcloud
ci-jessie-wikimedia-965176.contintcloud
ci-jessie-wikimedia-965182.contintcloud
ci-jessie-wikimedia-965183.contintcloud
ci-jessie-wikimedia-965184.contintcloud
ci-jessie-wikimedia-965185.contintcloud
client.nonfreewiki
commonsarchive-production.commonsarchive
cxserver2.language
dashboardchat.globaleducation
deployment-changeprop.deployment-prep
deployment-elastic05.deployment-prep
deployment-ircd.deployment-prep
deployment-mathoid.deployment-prep
deployment-sca02.deployment-prep
drmf2016.math
huggle-pg.huggle
incubator-web.incubator
integration-slave-jessie-1001.integration
integration-slave-jessie-1002.integration
k8s-bastion.chasetestproject
language-mleb-master.language
ldfclient.wikidata-query
math-ru.math
mwaas-k8-node-02.scrumbugz
mwoffliner1.mwoffliner
mwv-apt-01.mwv-apt
newsletter-test.newsletter
ores-lb-02.ores
ores-worker-04.ores
overpass-wiki.maps
puppetmaster-keith.puppet
reflex2.design
rel.search
stack.reading-web-staging
tools-docker-builder-05.tools
tools-exec-1413.tools
tools-exec-1442.tools
tools-webgrid-lighttpd-1427.tools
tools-webgrid-lighttpd-1428.tools
torproxy.security-tools
udpmx-01.ircd
video-redis.video
wikidataconcepts.wikidataconcepts
wikiedu-dashboard-staging.globaleducation
wikilabels-experiment.wikilabels
wikilabels-staging-01.wikilabels
wikimetrics-staging.wikimetrics
wikimetrics-test.wikimetrics
wmde-wikidiff2-patched.wikidiff2-wmde-dev
zk1-1.analytics

• chasemp updated the task description. (Show Details)Feb 14 2018, 12:58 PM

• chasemp added a project: Cloud-VPS.

• chasemp updated the task description. (Show Details)Feb 14 2018, 1:11 PM

aborrero subscribed.Feb 14 2018, 1:16 PM

• chasemp triaged this task as High priority.Feb 14 2018, 1:40 PM

• Mholloway subscribed.Feb 14 2018, 2:09 PM

define service {
# --PUPPET_NAME-- labvirt1008 disk_space
	active_checks_enabled          1
	check_command                  nrpe_check!check_disk_space!10
	check_freshness                0
	check_interval                 1
	check_period                   24x7
	contact_groups                 admins,sms,admins
	host_name                      labvirt1008
	is_volatile                    0
	max_check_attempts             3
	notification_interval          240
	notification_options           c,r,f
	notification_period            24x7
	notifications_enabled          1
	passive_checks_enabled         1
	retry_interval                 1
	service_description            Disk space
	servicegroups                  labvirt_eqiad

}

The sms group was never notified to my knowledge

labvirt1008 N/A HOST UP 2018-02-14 07:20:48 irc notify-host-by-irc PING OK - Packet loss = 0%, RTA = 0.20 ms
labvirt1008 N/A HOST DOWN 2018-02-14 07:16:48 irc notify-host-by-irc PING CRITICAL - Packet loss = 100%

February 14, 2018 07:00	

Service Ok[2018-02-14 07:22:39] SERVICE ALERT: labvirt1008;puppet last run;OK;HARD;1;OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures
Service Ok[2018-02-14 07:21:28] SERVICE ALERT: labvirt1008;Disk space;OK;HARD;1;DISK OK
Service Ok[2018-02-14 07:21:28] SERVICE ALERT: labvirt1008;nova-compute process;OK;HARD;1;PROCS OK: 1 process with regex args '^/usr/bin/pytho[n] /usr/bin/nova-compute'
Service Ok[2018-02-14 07:21:08] SERVICE ALERT: labvirt1008;dhclient process;OK;HARD;1;PROCS OK: 0 processes with command name 'dhclient'
Service Ok[2018-02-14 07:20:58] SERVICE ALERT: labvirt1008;configured eth;OK;HARD;1;OK - interfaces up
Host Up[2018-02-14 07:20:48] HOST ALERT: labvirt1008;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 0.20 ms
Service Ok[2018-02-14 07:20:48] SERVICE ALERT: labvirt1008;kvm ssl cert;OK;HARD;1;Cert /etc/ssl/localcerts/labvirt-star.eqiad.wmnet.crt will not expire for at least 30 days.
Service Ok[2018-02-14 07:20:48] SERVICE ALERT: labvirt1008;SSH;OK;HARD;1;SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0)
Service Ok[2018-02-14 07:20:39] SERVICE ALERT: labvirt1008;DPKG;OK;HARD;1;All packages OK
Service Unknown[2018-02-14 07:17:49] SERVICE ALERT: labvirt1008;puppet last run;UNKNOWN;HARD;1;CHECK_NRPE: Socket timeout after 10 seconds.
Host Down[2018-02-14 07:16:48] HOST ALERT: labvirt1008;DOWN;HARD;2;PING CRITICAL - Packet loss = 100%
Service Unknown[2018-02-14 07:16:18] SERVICE ALERT: labvirt1008;dhclient process;UNKNOWN;HARD;1;CHECK_NRPE: Socket timeout after 10 seconds.
Service Unknown[2018-02-14 07:16:08] SERVICE ALERT: labvirt1008;configured eth;UNKNOWN;HARD;1;CHECK_NRPE: Socket timeout after 10 seconds.
Service Unknown[2018-02-14 07:15:49] SERVICE ALERT: labvirt1008;kvm ssl cert;UNKNOWN;HARD;1;CHECK_NRPE: Socket timeout after 10 seconds.
Service Critical[2018-02-14 07:15:49] SERVICE ALERT: labvirt1008;SSH;CRITICAL;HARD;1;CRITICAL - Socket timeout after 10 seconds
Service Unknown[2018-02-14 07:15:38] SERVICE ALERT: labvirt1008;DPKG;UNKNOWN;HARD;1;CHECK_NRPE: Socket timeout after 10 seconds.
Service Unknown[2018-02-14 07:15:28] SERVICE ALERT: labvirt1008;Disk space;UNKNOWN;HARD;1;CHECK_NRPE: Socket timeout after 10 seconds.
Service Unknown[2018-02-14 07:15:28] SERVICE ALERT: labvirt1008;nova-compute process;UNKNOWN;HARD;1;CHECK_NRPE: Socket timeout after 10 seconds.
Host Down[2018-02-14 07:15:28] HOST ALERT: labvirt1008;DOWN;SOFT;1;PING CRITICAL - Packet loss = 100%
Service Unknown[2018-02-14 07:15:18] SERVICE ALERT: labvirt1008;dhclient process;UNKNOWN;SOFT;1;CHECK_NRPE: Socket timeout after 10 seconds.
Service Unknown[2018-02-14 07:15:08] SERVICE ALERT: labvirt1008;configured eth;UNKNOWN;SOFT;1;CHECK_NRPE: Socket timeout after 10 seconds.

Screen Shot 2018-02-14 at 8.38.45 AM.png (345×952 px, 36 KB)

Screen Shot 2018-02-14 at 8.39.00 AM.png (562×925 px, 176 KB)

I've migrated two VMs off of this host:

integration-slave-jessie-1001.integration
integration-slave-jessie-1002.integration

Chris is currently applying thermal paste to 1008.

1008 is back up and I'm restarting all the hosted VMs.

https://phabricator.wikimedia.org/P6697

prometheus seems to have temperature readings

https://grafana-admin.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&var-server=labvirt1008&var-datasource=eqiad%20prometheus%2Fops

In T187292#3972144, @Andrew wrote:

I've migrated two VMs off of this host:

integration-slave-jessie-1001.integration
integration-slave-jessie-1002.integration

Thanks, I can confirm they are back up. We had two other similar instances (1003/1004) so that specific pool just got cut in half which has been barely noticeable during the European day. Still thanks to have migrated them!

You can get rid of the ci-jessie-wikimedia-XXXX.contintcloud instances entirely if they are still around. That is for Nodepool.

Change 410525 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] icinga: create a wmcs contact group for some aggressive alerting

https://gerrit.wikimedia.org/r/410525

Change 410525 merged by Rush:
[operations/puppet@production] icinga: create a wmcs contact group for some aggressive alerting

https://gerrit.wikimedia.org/r/410525

Change 410532 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] openstack: alert on down/unreachable nova-compute early

https://gerrit.wikimedia.org/r/410532

Change 410532 merged by Rush:
[operations/puppet@production] openstack: alert on down/unreachable nova-compute early

https://gerrit.wikimedia.org/r/410532

Change 410540 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] openstack: monitor kvm processes on labvirts

https://gerrit.wikimedia.org/r/410540

• chasemp closed subtask T187317: Evacuate relevant instances off of labvirt1008 as Invalid.Feb 14 2018, 7:58 PM

Change 410551 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] openstack: fix overlapping descriptions for nova-compute check

https://gerrit.wikimedia.org/r/410551

Change 410551 merged by Rush:
[operations/puppet@production] openstack: fix overlapping descriptions for nova-compute check

https://gerrit.wikimedia.org/r/410551

Change 410540 merged by Rush:
[operations/puppet@production] openstack: monitor kvm processes on labvirts

https://gerrit.wikimedia.org/r/410540

• Cmjohnson moved this task from Backlog to Blocked on the ops-eqiad board.Feb 16 2018, 3:49 PM

Change 413452 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] openstack: make nova compute kvm monitoring optional

https://gerrit.wikimedia.org/r/413452

Change 413452 merged by Rush:
[operations/puppet@production] openstack: make nova compute kvm monitoring optional

https://gerrit.wikimedia.org/r/413452

Change 413468 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] openstack: nova active kvm check to alert multiple groups

https://gerrit.wikimedia.org/r/413468

Change 413468 merged by Rush:
[operations/puppet@production] openstack: nova active kvm check to alert multiple groups

https://gerrit.wikimedia.org/r/413468

closing for now

	F13741113: Screen Shot 2018-02-14 at 8.38.45 AM.png
	Feb 14 2018, 2:40 PM

	F13741114: Screen Shot 2018-02-14 at 8.39.00 AM.png
	Feb 14 2018, 2:40 PM

labvirt1008 rebooted / system was overheatedClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

labvirt1008 rebooted / system was overheated
Closed, ResolvedPublic
Actions

Related Objects
Search...