Page MenuHomePhabricator

labvirt1008 rebooted / system was overheated
Closed, ResolvedPublic

Description

This morning at 7:20 UTC labvirt1008 rebooted. Hardware log shows that the system overheated:

number=8
severity=Critical
date=02/14/2018
time=07:10
description=Critical Temperature Threshold Exceeded (Temperature Sensor 21, Location System, Temperature 127C)

number=09
severity=Caution
date=02/14/2018
time=07:10
description=System Overheating (Temperature Sensor 21, Location System, Temperature 127C)

number=10
severity=Critical
date=02/14/2018
time=07:11
description=Automatic Operating System Shutdown Initiated Due to Overheat Condition

number=11
severity=Caution
date=02/14/2018
time=07:20
description=POST Error: 1792-Slot X Drive Array - Valid Data Found in Cache Module. Data will automatically be written to drive array.

Not sure about the best mitigation, maybe some fan died or it needs thermal paste?


https://lists.wikimedia.org/pipermail/cloud-announce/2018-February/000023.html
https://wikitech.wikimedia.org/wiki/Incident_documentation/20180214-labvirt1008-failure

Event Timeline

Thanks @moritz

Luckily the Toolforge instances here are a mix we could afford to have down. @Andrew let's sync up on this?

nova list --host labvirt1008 --all-tenants | awk '{print $4,$6}' | grep -v 'Name Tenant' | tr " " .

accounts-appserver4.account-creation-assistance
accounts-mwoauth.account-creation-assistance
bastion-02.bastion
bastion-restricted-02.bastion
bf-wmpageview.butterfly
chat-bots.mobile
ci-jessie-wikimedia-965167.contintcloud
ci-jessie-wikimedia-965171.contintcloud
ci-jessie-wikimedia-965176.contintcloud
ci-jessie-wikimedia-965182.contintcloud
ci-jessie-wikimedia-965183.contintcloud
ci-jessie-wikimedia-965184.contintcloud
ci-jessie-wikimedia-965185.contintcloud
client.nonfreewiki
commonsarchive-production.commonsarchive
cxserver2.language
dashboardchat.globaleducation
deployment-changeprop.deployment-prep
deployment-elastic05.deployment-prep
deployment-ircd.deployment-prep
deployment-mathoid.deployment-prep
deployment-sca02.deployment-prep
drmf2016.math
huggle-pg.huggle
incubator-web.incubator
integration-slave-jessie-1001.integration
integration-slave-jessie-1002.integration
k8s-bastion.chasetestproject
language-mleb-master.language
ldfclient.wikidata-query
math-ru.math
mwaas-k8-node-02.scrumbugz
mwoffliner1.mwoffliner
mwv-apt-01.mwv-apt
newsletter-test.newsletter
ores-lb-02.ores
ores-worker-04.ores
overpass-wiki.maps
puppetmaster-keith.puppet
reflex2.design
rel.search
stack.reading-web-staging
tools-docker-builder-05.tools
tools-exec-1413.tools
tools-exec-1442.tools
tools-webgrid-lighttpd-1427.tools
tools-webgrid-lighttpd-1428.tools
torproxy.security-tools
udpmx-01.ircd
video-redis.video
wikidataconcepts.wikidataconcepts
wikiedu-dashboard-staging.globaleducation
wikilabels-experiment.wikilabels
wikilabels-staging-01.wikilabels
wikimetrics-staging.wikimetrics
wikimetrics-test.wikimetrics
wmde-wikidiff2-patched.wikidiff2-wmde-dev
zk1-1.analytics
define service {
# --PUPPET_NAME-- labvirt1008 disk_space
	active_checks_enabled          1
	check_command                  nrpe_check!check_disk_space!10
	check_freshness                0
	check_interval                 1
	check_period                   24x7
	contact_groups                 admins,sms,admins
	host_name                      labvirt1008
	is_volatile                    0
	max_check_attempts             3
	notification_interval          240
	notification_options           c,r,f
	notification_period            24x7
	notifications_enabled          1
	passive_checks_enabled         1
	retry_interval                 1
	service_description            Disk space
	servicegroups                  labvirt_eqiad

}

The sms group was never notified to my knowledge

labvirt1008 N/A HOST UP 2018-02-14 07:20:48 irc notify-host-by-irc PING OK - Packet loss = 0%, RTA = 0.20 ms
labvirt1008 N/A HOST DOWN 2018-02-14 07:16:48 irc notify-host-by-irc PING CRITICAL - Packet loss = 100%

February 14, 2018 07:00	

Service Ok[2018-02-14 07:22:39] SERVICE ALERT: labvirt1008;puppet last run;OK;HARD;1;OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures
Service Ok[2018-02-14 07:21:28] SERVICE ALERT: labvirt1008;Disk space;OK;HARD;1;DISK OK
Service Ok[2018-02-14 07:21:28] SERVICE ALERT: labvirt1008;nova-compute process;OK;HARD;1;PROCS OK: 1 process with regex args '^/usr/bin/pytho[n] /usr/bin/nova-compute'
Service Ok[2018-02-14 07:21:08] SERVICE ALERT: labvirt1008;dhclient process;OK;HARD;1;PROCS OK: 0 processes with command name 'dhclient'
Service Ok[2018-02-14 07:20:58] SERVICE ALERT: labvirt1008;configured eth;OK;HARD;1;OK - interfaces up
Host Up[2018-02-14 07:20:48] HOST ALERT: labvirt1008;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 0.20 ms
Service Ok[2018-02-14 07:20:48] SERVICE ALERT: labvirt1008;kvm ssl cert;OK;HARD;1;Cert /etc/ssl/localcerts/labvirt-star.eqiad.wmnet.crt will not expire for at least 30 days.
Service Ok[2018-02-14 07:20:48] SERVICE ALERT: labvirt1008;SSH;OK;HARD;1;SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0)
Service Ok[2018-02-14 07:20:39] SERVICE ALERT: labvirt1008;DPKG;OK;HARD;1;All packages OK
Service Unknown[2018-02-14 07:17:49] SERVICE ALERT: labvirt1008;puppet last run;UNKNOWN;HARD;1;CHECK_NRPE: Socket timeout after 10 seconds.
Host Down[2018-02-14 07:16:48] HOST ALERT: labvirt1008;DOWN;HARD;2;PING CRITICAL - Packet loss = 100%
Service Unknown[2018-02-14 07:16:18] SERVICE ALERT: labvirt1008;dhclient process;UNKNOWN;HARD;1;CHECK_NRPE: Socket timeout after 10 seconds.
Service Unknown[2018-02-14 07:16:08] SERVICE ALERT: labvirt1008;configured eth;UNKNOWN;HARD;1;CHECK_NRPE: Socket timeout after 10 seconds.
Service Unknown[2018-02-14 07:15:49] SERVICE ALERT: labvirt1008;kvm ssl cert;UNKNOWN;HARD;1;CHECK_NRPE: Socket timeout after 10 seconds.
Service Critical[2018-02-14 07:15:49] SERVICE ALERT: labvirt1008;SSH;CRITICAL;HARD;1;CRITICAL - Socket timeout after 10 seconds
Service Unknown[2018-02-14 07:15:38] SERVICE ALERT: labvirt1008;DPKG;UNKNOWN;HARD;1;CHECK_NRPE: Socket timeout after 10 seconds.
Service Unknown[2018-02-14 07:15:28] SERVICE ALERT: labvirt1008;Disk space;UNKNOWN;HARD;1;CHECK_NRPE: Socket timeout after 10 seconds.
Service Unknown[2018-02-14 07:15:28] SERVICE ALERT: labvirt1008;nova-compute process;UNKNOWN;HARD;1;CHECK_NRPE: Socket timeout after 10 seconds.
Host Down[2018-02-14 07:15:28] HOST ALERT: labvirt1008;DOWN;SOFT;1;PING CRITICAL - Packet loss = 100%
Service Unknown[2018-02-14 07:15:18] SERVICE ALERT: labvirt1008;dhclient process;UNKNOWN;SOFT;1;CHECK_NRPE: Socket timeout after 10 seconds.
Service Unknown[2018-02-14 07:15:08] SERVICE ALERT: labvirt1008;configured eth;UNKNOWN;SOFT;1;CHECK_NRPE: Socket timeout after 10 seconds.

I've migrated two VMs off of this host:

integration-slave-jessie-1001.integration
integration-slave-jessie-1002.integration

Chris is currently applying thermal paste to 1008.

1008 is back up and I'm restarting all the hosted VMs.

I've migrated two VMs off of this host:

integration-slave-jessie-1001.integration
integration-slave-jessie-1002.integration

Thanks, I can confirm they are back up. We had two other similar instances (1003/1004) so that specific pool just got cut in half which has been barely noticeable during the European day. Still thanks to have migrated them!

You can get rid of the ci-jessie-wikimedia-XXXX.contintcloud instances entirely if they are still around. That is for Nodepool.

Change 410525 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] icinga: create a wmcs contact group for some aggressive alerting

https://gerrit.wikimedia.org/r/410525

Change 410525 merged by Rush:
[operations/puppet@production] icinga: create a wmcs contact group for some aggressive alerting

https://gerrit.wikimedia.org/r/410525

Change 410532 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] openstack: alert on down/unreachable nova-compute early

https://gerrit.wikimedia.org/r/410532

Change 410532 merged by Rush:
[operations/puppet@production] openstack: alert on down/unreachable nova-compute early

https://gerrit.wikimedia.org/r/410532

Change 410540 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] openstack: monitor kvm processes on labvirts

https://gerrit.wikimedia.org/r/410540

Change 410551 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] openstack: fix overlapping descriptions for nova-compute check

https://gerrit.wikimedia.org/r/410551

Change 410551 merged by Rush:
[operations/puppet@production] openstack: fix overlapping descriptions for nova-compute check

https://gerrit.wikimedia.org/r/410551

Change 410540 merged by Rush:
[operations/puppet@production] openstack: monitor kvm processes on labvirts

https://gerrit.wikimedia.org/r/410540

Change 413452 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] openstack: make nova compute kvm monitoring optional

https://gerrit.wikimedia.org/r/413452

Change 413452 merged by Rush:
[operations/puppet@production] openstack: make nova compute kvm monitoring optional

https://gerrit.wikimedia.org/r/413452

Change 413468 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] openstack: nova active kvm check to alert multiple groups

https://gerrit.wikimedia.org/r/413468

Change 413468 merged by Rush:
[operations/puppet@production] openstack: nova active kvm check to alert multiple groups

https://gerrit.wikimedia.org/r/413468