Page MenuHomePhabricator

Monitor hardware thermal issues
Closed, ResolvedPublic

Description

Sometimes machines overheat due to hardware issues and/or rack airflow issues, etc. The kernel actually warns us with kern.log spam and we can often poll the data from /sys/ as well (hardware/generation -dependent). We should be monitoring and alerting on this stuff somehow, and resolving these issues as they come up instead of waiting for high temps to induce failures and/or performance throttling.

My known quick and dirty commands I've used to audit cache boxes at times:

Count of recent Package Temp alerts from the running kernel (most machines will report zero - machines with thermal issues will often show tens of thousands):

grep -c "Package temp" /var/log/kern.log

Actual temperature (not available on some older machines. also, the proper limit for the varies by hardware/generation. For example some classes of hardware seem to start generating the messages above when they cross 80C, others 85C:

cat /sys/class/thermal/thermal_zone0/temp

There are a handful of cache machines (at esams, ulsfo, and eqiad) that look bad on temp data right now from a random audit, but rather than filing bugs for these individual issues again, I think we should really look at getting monitoring for the fleet (which will probably turn up a lot of cases...).

For reference, see past ticket for eqiad caches here: T103226 .

Details

Related Gerrit Patches:
operations/puppet : productioncheck_ipmi_temp: load ipmi_devintf
operations/puppet : productioncheck_ipmi_temp: turn off sel checking
operations/puppet : productioncheck_ipmi_temp: set check timeout to 60 seconds
operations/puppet : productionRe-enable temperature monitoring via NRPE
operations/puppet : productionbase: blacklist acpi_power_meter
operations/puppet : productionDisable temperature monitoring via NRPE
operations/puppet : productioncheck_ipmi_temp: bump check/retry intervals and timeout
operations/puppet : productionmonitoring/base: add temperature monitoring via NRPE
operations/puppet : productionmonitoring/base: add nagios sudo privs for IPMI sensors
operations/puppet : productionprometheus: add hwmon collector to default set
operations/puppet : productionmonitoring: add check_ipmi_sensor plugin
operations/puppet : productionbase: install 'freeipmi', 'libipc-run-perl' on jessie

Event Timeline

BBlack created this task.Jan 29 2016, 12:21 PM
BBlack raised the priority of this task from to Needs Triage.
BBlack updated the task description. (Show Details)
BBlack added a subscriber: BBlack.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptJan 29 2016, 12:21 PM

For the record, salt on all hosts (which says it hit 1210 machines) gives this list for machines with 4-digit+ kern.log alerts presently:

{'analytics1032.eqiad.wmnet': '2362'}
{'analytics1038.eqiad.wmnet': '1750'}
{'analytics1039.eqiad.wmnet': '15088'}
{'cp1053.eqiad.wmnet': '30460'}
{'cp3031.esams.wmnet': '63282'}
{'cp3032.esams.wmnet': '5927'}
{'cp3049.esams.wmnet': '11476'}
{'cp4008.ulsfo.wmnet': '55630'}
{'cp4011.ulsfo.wmnet': '6926'}
{'cp4012.ulsfo.wmnet': '5396'}
{'labvirt1003.eqiad.wmnet': '1472'}
{'lvs4001.ulsfo.wmnet': '32577'}
{'lvs4002.ulsfo.wmnet': '26227'}

And then 497/1210 don't even have the file /sys/class/thermal/thermal_zone0/temp, so maybe on different hardware and/or kernels we need to poll it a different way. Of those that seem to have legit data there in the units I've seen before, these are the ones presently showing 85C+:

{'cp1053.eqiad.wmnet': '93000'}
{'cp1071.eqiad.wmnet': '86000'}
{'cp3030.esams.wmnet': '87000'}
{'cp3031.esams.wmnet': '92000'}
{'cp3033.esams.wmnet': '85000'}
{'cp3045.esams.wmnet': '86000'}
{'cp3046.esams.wmnet': '85000'}
{'cp3049.esams.wmnet': '88000'}
{'cp4008.ulsfo.wmnet': '93000'}
{'cp4010.ulsfo.wmnet': '87000'}
{'db1070.eqiad.wmnet': '85000'}
{'lvs4001.ulsfo.wmnet': '102000'}
{'lvs4002.ulsfo.wmnet': '101000'}
Andrew triaged this task as High priority.Apr 14 2016, 9:00 PM

From a mail I sent to the ops list, slightly extended:

Icinga has a contrib plugin for IPMI checks:

https://www.thomas-krenn.com/en/wiki/IPMI_Sensor_Monitoring_Plugin

(v3 is Perl, v2 is bash)

That would be a single nrpe check for each plugins covering all alarms, sounds easy to deploy provided IPMI is available on all distro/servers.

There is also some Diamond collector based on IPMI:
https://github.com/BrightcoveOS/Diamond/wiki/collectors-IPMISensorCollector

But for the monitoring part we would have to add a bunch of
check_graphite calls which is a bit cumbersome to setup.

Both seem to recognize sensor thresholds being reached including intrusion.

The Linux SNMP has some support for reporting metrics and reached thresholds by relying on lm_sensors, but we dont have SNMP on our linux afaik. (https://www.opennms.org/wiki/Lm_Sensors_Monitoring_How-To).

Network gear should implement the appropriate sensor MIB already. JunOS definitely does.

Just to stab randomly at things: I installed the packages freeipmi and libipc-run-perl on cp1008 (test/prod cache host) and downloaded the check_ipmi_sensor script directly from https://github.com/thomas-krenn/check_ipmi_sensor_v3/blob/master/check_ipmi_sensor.

A basic test run there basically worked, although I have no idea about the legitimacy of the results, or whether we need to map out special settings (or sensor include/exclude flags?) for different hardware, etc:

root@cp1008:~# ./check_ipmi_sensor -H localhost

IPMI Status: Critical [Presence = Critical, Power Supply 2 Status = Critical, Power Supply 2 Status = Warning, System Board Intrusion = Critical, System Board Intrusion = Critical] | 'Ambient Temp'=18.00;8.00:42.00;3.00:47.00 'FAN MOD 1A RPM'=4440.00;;1920.00: 'FAN MOD 2A RPM'=4440.00;;1920.00: 'FAN MOD 3A RPM'=4440.00;;1920.00: 'FAN MOD 4A RPM'=4440.00;;1920.00: 'FAN MOD 5A RPM'=4440.00;;1920.00: 'FAN MOD 1B RPM'=3120.00;;1920.00: 'FAN MOD 2B RPM'=3120.00;;1920.00: 'FAN MOD 3B RPM'=3120.00;;1920.00: 'FAN MOD 4B RPM'=3120.00;;1920.00: 'FAN MOD 5B RPM'=3000.00;;1920.00: 'Current'=0.28 'Current'=0.08 'Voltage'=210.00 'Voltage'=210.00 'System Level'=98.00;~:917.00;~:966.00

A more-verbose run:

root@cp1008:~# ./check_ipmi_sensor -H localhost -vv
IPMI Status: Critical [Presence = Critical ('Entity Absent'), Power Supply 2 Status = Critical (Power Supply), Power Supply 2 Status = Warning (Power Supply), System Board Intrusion = Critical (Physical Security), System Board Intrusion = Critical (Physical Security)] | 'Ambient Temp'=18.00;8.00:42.00;3.00:47.00 'FAN MOD 1A RPM'=4440.00;;1920.00: 'FAN MOD 2A RPM'=4440.00;;1920.00: 'FAN MOD 3A RPM'=4440.00;;1920.00: 'FAN MOD 4A RPM'=4440.00;;1920.00: 'FAN MOD 5A RPM'=4440.00;;1920.00: 'FAN MOD 1B RPM'=3120.00;;1920.00: 'FAN MOD 2B RPM'=3120.00;;1920.00: 'FAN MOD 3B RPM'=3120.00;;1920.00: 'FAN MOD 4B RPM'=3120.00;;1920.00: 'FAN MOD 5B RPM'=3000.00;;1920.00: 'Current'=0.28 'Current'=0.08 'Voltage'=210.00 'Voltage'=212.00 'System Level'=98.00;~:917.00;~:966.00
Ambient Temp = 18.00 (Status: Nominal)
CMOS Battery = 'OK' (Status: Nominal)
VCORE PG = 'State Deasserted' (Status: Nominal)
1.5V PG = 'State Deasserted' (Status: Nominal)
1.8V PG = 'State Deasserted' (Status: Nominal)
3.3V PG = 'State Deasserted' (Status: Nominal)
5V PG = 'State Deasserted' (Status: Nominal)
HEATSINK PRES = 'Entity Present' (Status: Nominal)
iDRAC6 Ent PRES = 'Entity Present' (Status: Nominal)
USB CABLE PRES = 'Entity Present' (Status: Nominal)
STOR ADAPT PRES = 'Entity Present' (Status: Nominal)
RISER2 PRES = 'Entity Present' (Status: Nominal)
RISER1 PRES = 'Entity Present' (Status: Nominal)
0.75 VTT PG = 'State Deasserted' (Status: Nominal)
MEM PG = 'State Deasserted' (Status: Nominal)
0.9V PG = 'State Deasserted' (Status: Nominal)
VTT PG = 'State Deasserted' (Status: Nominal)
1.8 PLL PG = 'State Deasserted' (Status: Nominal)
8.0V PG = 'State Deasserted' (Status: Nominal)
1.1V PG = 'State Deasserted' (Status: Nominal)
1.0V LOM PG = 'State Deasserted' (Status: Nominal)
1.0V AUX PG = 'State Deasserted' (Status: Nominal)
1.05V PG = 'State Deasserted' (Status: Nominal)
FAN MOD 1A RPM = 4440.00 (Status: Nominal)
FAN MOD 2A RPM = 4440.00 (Status: Nominal)
FAN MOD 3A RPM = 4440.00 (Status: Nominal)
FAN MOD 4A RPM = 4440.00 (Status: Nominal)
FAN MOD 5A RPM = 4440.00 (Status: Nominal)
FAN MOD 1B RPM = 3120.00 (Status: Nominal)
FAN MOD 2B RPM = 3120.00 (Status: Nominal)
FAN MOD 3B RPM = 3120.00 (Status: Nominal)
FAN MOD 4B RPM = 3120.00 (Status: Nominal)
FAN MOD 5B RPM = 3000.00 (Status: Nominal)
Presence = 'Entity Present' (Status: Nominal)
Presence = 'Entity Absent' (Status: Critical)
Presence = 'Entity Present' (Status: Nominal)
Presence = 'Entity Present' (Status: Nominal)
Presence = 'Entity Present' (Status: Nominal)
Status = 'Processor Presence detected' (Status: Nominal)
Status = 'OK' (Status: Nominal)
Status = 'Presence detected' (Status: Nominal)
Status = 'Presence detected' (Status: Nominal)
Riser Config = 'Cable/Interconnect is connected' (Status: Nominal)
OS Watchdog = 'OK' (Status: Nominal)
Intrusion = 'OK' (Status: Nominal)
PS Redundancy = 'Fully Redundant' (Status: Nominal)
Fan Redundancy = 'Fully Redundant' (Status: Nominal)
Drive = 'Drive Presence' (Status: Nominal)
Cable SAS A = 'Cable/Interconnect is connected' (Status: Nominal)
Cable SAS B = 'Cable/Interconnect is connected' (Status: Nominal)
Current = 0.28 (Status: Nominal)
Current = 0.08 (Status: Nominal)
Voltage = 210.00 (Status: Nominal)
Voltage = 212.00 (Status: Nominal)
System Level = 98.00 (Status: Nominal)
Power Optimized = 'Good' (Status: Nominal)
vFlash = 'OK' (Status: Nominal)

Checked with @Cmjohnson and the failures reported above about the power supply are real. So +1 for the ipmi checker :)

14:38 < cmjohnson1> bblack: it's correct...the sys event log in racadm confirms
14:38 < cmjohnson1> Description: PS 2 Status: Power Supply sensor for PS 2, input lost was deasserte

Impressive. The nice thing is that it is a single check to add on all a server, so that limit the load burden on the Icinga server and the target hosts.

If we still have servers that are known to overheat regularly, it might be good to attempt to deploy the plugin on them as a proof of concept.

Dzahn added a subscriber: Dzahn.Sep 13 2016, 6:28 PM

The "freeipmi" package exists in jessie (and xenial) but not in precise or trusty. Suggesting to add it in base on all jessie machines though to start with this.

Change 310369 had a related patch set uploaded (by Dzahn):
base: install 'freeipmi', 'libipc-run-perl' on jessie

https://gerrit.wikimedia.org/r/310369

Change 310379 had a related patch set uploaded (by Dzahn):
monitoring: add check_ipmi_sensor plugin

https://gerrit.wikimedia.org/r/310379

Change 310383 had a related patch set uploaded (by Dzahn):
monitoring/base: add NRPE command to check temperature

https://gerrit.wikimedia.org/r/310383

Change 310369 merged by Dzahn:
base: install 'freeipmi', 'libipc-run-perl' on jessie

https://gerrit.wikimedia.org/r/310369

Change 310379 merged by Dzahn:
monitoring: add check_ipmi_sensor plugin

https://gerrit.wikimedia.org/r/310379

Dzahn added a comment.Sep 14 2016, 5:33 PM

The needed packages and the plugin script should now get installed on all jessie hosts. Waiting for that, then we can add the actual NRPE command or discuss which exact options we want. So far i would just set the sensor type to "Temperature".

elukey added a subscriber: elukey.Oct 19 2016, 12:52 PM
ema added a subscriber: ema.Feb 10 2017, 3:01 PM

@Dzahn, what's the status of this?

Dzahn added a comment.Feb 21 2017, 6:04 PM

check_ipmi_sensor has been installed across the fleet but doesn't work.

running it with options for temperature makes it exit with "CRIT" for _non_-temperature things

root@lead:~# /usr/local/lib/nagios/plugins/check_ipmi_sensor -T temperature Sensor Type(s) temperature Status: Critical [Power Supply 1 Status = Critical, System Board PS Redundancy = Critical, Power Supply 1 Status = Warning, System Board Intrusion = Critical, System Board Intrusion = Critical] | 'Inlet Temp'=23.00;3.00:42.00;-7.00:47.00 'Temp'=51.00

https://gerrit.wikimedia.org/r/#/c/310383/

I gave it a try on contint1001.wikimedia.org and it alarmed out due to some entity not being present. That can be seen in very verbose output (-vv):

$ /usr/local/lib/nagios/plugins/check_ipmi_sensor -vv|grep -v 'Status: Nominal' 2>/dev/null
Presence = 'Entity Absent' (Status: Critical)
Presence = 'Entity Absent' (Status: Critical)

I know nothing about IPMI sensor but that Presence sensor is apparently inexistent. Luckily the monitoring script has a switch to ignore them: --noentityabsent. And as a result:

$ /usr/local/lib/nagios/plugins/check_ipmi_sensor --noentityabsent
IPMI Status: OK
Volans added a subscriber: Volans.Mar 31 2017, 4:15 PM

Change 356163 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] prometheus: add hwmon collector to default set

https://gerrit.wikimedia.org/r/356163

ema added a comment.May 30 2017, 1:55 PM

check_ipmi_sensor has been installed across the fleet but doesn't work.
running it with options for temperature makes it exit with "CRIT" for _non_-temperature things
root@lead:~# /usr/local/lib/nagios/plugins/check_ipmi_sensor -T temperature Sensor Type(s) temperature Status: Critical [Power Supply 1 Status = Critical, System Board PS Redundancy = Critical, Power Supply 1 Status = Warning, System Board Intrusion = Critical, System Board Intrusion = Critical] | 'Inlet Temp'=23.00;3.00:42.00;-7.00:47.00 'Temp'=51.00

The following seems to be doing the Right Thing:

root@cp1008:~# /usr/local/lib/nagios/plugins/check_ipmi_sensor --noentityabsent -T Temperature -ST Temperature
Sensor Type(s) Temperature Status: OK | 'Ambient Temp'=20.00;8.00:42.00;3.00:47.00

Change 356163 merged by Ema:
[operations/puppet@production] prometheus: add hwmon collector to default set

https://gerrit.wikimedia.org/r/356163

ema added a comment.May 30 2017, 3:02 PM

@Dzahn I've amended you patch by calling check_ipmi_sensor with -ST Temperature as mentioned above. I've also removed the if >= jessie guard since freeipmi seems to be installed on all machines at the moment. Do you think that's enough or is there anything else to work on before merging the check?

Dzahn added a comment.May 31 2017, 1:24 AM

@ema Thank you very much! Back when i checked i somehow could not find the working option :) This is great.

Just one thing, we should not be running it on VMs (ganeti or labs), unsurprisingly it will just fail there. (tried on rutherfordium->"ipmi_sdr_cache_create: internal IPMI error"). So i amended one more time to add if str2bool($facts['is_virtual']) == false { around it.

This is the same as existing in modules/profile/manifests/base.pp where we have this condition around class { '::ipmi::monitor': }

Change 310383 merged by Dzahn:
[operations/puppet@production] monitoring/base: add temperature monitoring via NRPE

https://gerrit.wikimedia.org/r/310383

Change 356324 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] monitoring/base: add nagios sudo privs for IPMI sensors

https://gerrit.wikimedia.org/r/356324

Change 356324 merged by Dzahn:
[operations/puppet@production] monitoring/base: add nagios sudo privs for IPMI sensors

https://gerrit.wikimedia.org/r/356324

Dzahn added a comment.EditedMay 31 2017, 3:40 AM

Works now and is OK on > 1000 machines :)

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=IPMI+Temperature

I had to ACK a bunch that were on hosts that were in scheduled downtime, but since this service was new it wasn't covered by it.

As kind of expected with this number of hosts a few special cases stayed:

db2034: UNKNOWN internal IPMI error
cp3006/labsdb1001: UNKNOWN: caching SDR repository information..
labvirt1008: "Sensor Type(s) Temperature Status:"

and then these that look like legit warnings we detected now:

wtp1010, cp1049: _actual_ " System Board Inlet Temp = Critical,"
ms-be2028, aqs1004: actual " Critical [System Board 12 29-LOM = Critical]"

and that is basically it. all others are OK.

The thresholds are not set by us, but they can be seen when running the command with -vvv manually, see the example on

1[wasat:~] $ /usr/local/lib/nagios/plugins/check_ipmi_sensor --noentityabsent -T Temperature -ST Temperature -vvv
2------------- debug output for sel (-vvv is set): ------------
3 /usr/sbin/ipmi-sel was executed with the following parameters:
4 sudo /usr/sbin/ipmi-sel --output-event-state --interpret-oem-data --entity-sensor-names --sensor-types=Temperature
5 output of FreeIPMI:
6------------- debug output for sensors (-vvv is set): ------------
7 script was executed with the following parameters:
8 /usr/local/lib/nagios/plugins/check_ipmi_sensor --noentityabsent -T Temperature -ST Temperature -vvv
9 check_ipmi_sensor version:
10 3.11
11 FreeIPMI version:
12 ipmi-sensors - 1.4.5
13 FreeIPMI was executed with the following parameters:
14 sudo /usr/sbin/ipmi-sensors -g Temperature --quiet-cache --sdr-cache-recreate --interpret-oem-data --output-sensor-state --ignore-not-available-sensors --output-sensor-thresholds
15 FreeIPMI return code: 0
16 output of FreeIPMI:
17ID | Name | Type | State | Reading | Units | Lower NR | Lower C | Lower NC | Upper NC | Upper C | Upper NR | Event
182 | 01-Inlet Ambient | Temperature | Nominal | 23.00 | C | N/A | N/A | N/A | N/A | 42.00 | 46.00 | 'OK'
193 | 02-CPU 1 | Temperature | Nominal | 40.00 | C | N/A | N/A | N/A | N/A | 70.00 | N/A | 'OK'
204 | 03-CPU 2 | Temperature | Nominal | 42.00 | C | N/A | N/A | N/A | N/A | 70.00 | N/A | 'OK'
216 | 05-P1 DIMM 7-12 | Temperature | Nominal | 35.00 | C | N/A | N/A | N/A | N/A | 89.00 | N/A | 'OK'
228 | 07-P2 DIMM 7-12 | Temperature | Nominal | 33.00 | C | N/A | N/A | N/A | N/A | 89.00 | N/A | 'OK'
2311 | 10-Chipset | Temperature | Nominal | 38.00 | C | N/A | N/A | N/A | N/A | 105.00 | N/A | 'OK'
2412 | 11-PS 1 Inlet | Temperature | Nominal | 33.00 | C | N/A | N/A | N/A | N/A | N/A | N/A | 'OK'
2513 | 12-PS 2 Inlet | Temperature | Nominal | 34.00 | C | N/A | N/A | N/A | N/A | N/A | N/A | 'OK'
2614 | 13-VR P1 | Temperature | Nominal | 35.00 | C | N/A | N/A | N/A | N/A | 115.00 | 120.00 | 'OK'
2715 | 14-VR P2 | Temperature | Nominal | 41.00 | C | N/A | N/A | N/A | N/A | 115.00 | 120.00 | 'OK'
2816 | 15-VR P1 Mem | Temperature | Nominal | 29.00 | C | N/A | N/A | N/A | N/A | 115.00 | 120.00 | 'OK'
2917 | 16-VR P1 Mem | Temperature | Nominal | 29.00 | C | N/A | N/A | N/A | N/A | 115.00 | 120.00 | 'OK'
3018 | 17-VR P2 Mem | Temperature | Nominal | 34.00 | C | N/A | N/A | N/A | N/A | 115.00 | 120.00 | 'OK'
3119 | 18-VR P2 Mem | Temperature | Nominal | 32.00 | C | N/A | N/A | N/A | N/A | 115.00 | 120.00 | 'OK'
3220 | 19-PS 1 Internal | Temperature | Nominal | 40.00 | C | N/A | N/A | N/A | N/A | N/A | N/A | 'OK'
3321 | 20-PS 2 Internal | Temperature | Nominal | 40.00 | C | N/A | N/A | N/A | N/A | N/A | N/A | 'OK'
3428 | 27-Front Ambient | Temperature | Nominal | 24.00 | C | N/A | N/A | N/A | N/A | 65.00 | N/A | 'OK'
3529 | 28-P/S 2 Zone | Temperature | Nominal | 36.00 | C | N/A | N/A | N/A | N/A | 75.00 | N/A | 'OK'
3630 | 29-Battery Zone | Temperature | Nominal | 32.00 | C | N/A | N/A | N/A | N/A | 75.00 | 80.00 | 'OK'
3731 | 30-iLO Zone | Temperature | Nominal | 34.00 | C | N/A | N/A | N/A | N/A | 90.00 | 95.00 | 'OK'
3832 | 31-PCI 1 Zone | Temperature | Nominal | 28.00 | C | N/A | N/A | N/A | N/A | 70.00 | 75.00 | 'OK'
3933 | 32-PCI 2 Zone | Temperature | Nominal | 30.00 | C | N/A | N/A | N/A | N/A | 70.00 | 75.00 | 'OK'
4036 | 35-I/O Zone | Temperature | Nominal | 31.00 | C | N/A | N/A | N/A | N/A | 75.00 | 80.00 | 'OK'
4138 | 37-Fuse | Temperature | Nominal | 34.00 | C | N/A | N/A | N/A | N/A | 100.00 | N/A | 'OK'
42
43--------------------- end of debug output ---------------------
44Sensor Type(s) Temperature Status: OK | '01-Inlet Ambient'=23.00;;~:42.00 '02-CPU 1'=40.00;;~:70.00 '03-CPU 2'=42.00;;~:70.00 '05-P1 DIMM 7-12'=35.00;;~:89.00 '07-P2 DIMM 7-12'=33.00;;~:89.00 '10-Chipset'=38.00;;~:105.00 '11-PS 1 Inlet'=33.00 '12-PS 2 Inlet'=34.00 '13-VR P1'=35.00;;~:115.00 '14-VR P2'=41.00;;~:115.00 '15-VR P1 Mem'=29.00;;~:115.00 '16-VR P1 Mem'=29.00;;~:115.00 '17-VR P2 Mem'=34.00;;~:115.00 '18-VR P2 Mem'=32.00;;~:115.00 '19-PS 1 Internal'=40.00 '20-PS 2 Internal'=40.00 '27-Front Ambient'=24.00;;~:65.00 '28-P/S 2 Zone'=36.00;;~:75.00 '29-Battery Zone'=32.00;;~:75.00 '30-iLO Zone'=34.00;;~:90.00 '31-PCI 1 Zone'=28.00;;~:70.00 '32-PCI 2 Zone'=30.00;;~:70.00 '35-I/O Zone'=31.00;;~:75.00 '37-Fuse'=34.00;;~:100.00
4501-Inlet Ambient = 23.00 (Status: Nominal)
4602-CPU 1 = 40.00 (Status: Nominal)
4703-CPU 2 = 42.00 (Status: Nominal)
4805-P1 DIMM 7-12 = 35.00 (Status: Nominal)
4907-P2 DIMM 7-12 = 33.00 (Status: Nominal)
5010-Chipset = 38.00 (Status: Nominal)
5111-PS 1 Inlet = 33.00 (Status: Nominal)
5212-PS 2 Inlet = 34.00 (Status: Nominal)
5313-VR P1 = 35.00 (Status: Nominal)
5414-VR P2 = 41.00 (Status: Nominal)
5515-VR P1 Mem = 29.00 (Status: Nominal)
5616-VR P1 Mem = 29.00 (Status: Nominal)
5717-VR P2 Mem = 34.00 (Status: Nominal)
5818-VR P2 Mem = 32.00 (Status: Nominal)
5919-PS 1 Internal = 40.00 (Status: Nominal)
6020-PS 2 Internal = 40.00 (Status: Nominal)
6127-Front Ambient = 24.00 (Status: Nominal)
6228-P/S 2 Zone = 36.00 (Status: Nominal)
6329-Battery Zone = 32.00 (Status: Nominal)
6430-iLO Zone = 34.00 (Status: Nominal)
6531-PCI 1 Zone = 28.00 (Status: Nominal)
6632-PCI 2 Zone = 30.00 (Status: Nominal)
6735-I/O Zone = 31.00 (Status: Nominal)
6837-Fuse = 34.00 (Status: Nominal)

Dzahn added a comment.May 31 2017, 3:55 AM

special case:

cp1049 - "System Board 1 Inlet Temp = Critical"

ID | Name         | Type        | State    | Reading    | Units | Lower NR   | Lower C    | Lower NC   | Upper NC   | Upper C    | Upper NR   | Event
18 | Inlet Temp   | Temperature | Nominal  | 18.00      | C     | N/A        | -7.00      | 3.00       | 42.00      | 47.00      | N/A        | 'OK'
19 | Exhaust Temp | Temperature | Nominal  | 37.00      | C     | N/A        | 3.00       | 8.00       | 70.00      | 75.00      | N/A        | 'OK'

labsdb1003 - "ipmi_sel_parse: internal IPMI error" on second attempt

Change 356350 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] check_ipmi_temp: bump check/retry intervals and timeout

https://gerrit.wikimedia.org/r/356350

Change 356350 merged by Ema:
[operations/puppet@production] check_ipmi_temp: bump check/retry intervals and timeout

https://gerrit.wikimedia.org/r/356350

ema added a comment.May 31 2017, 10:41 AM

The check seems to be OK at the moment except for a few hosts where the BMC just doesn't seem to want to behave properly (eg: cp3006, db1020).

For the record, an interesting alternative approach would be querying the hwmon sysctl interface: https://www.kernel.org/doc/Documentation/hwmon/sysfs-interface. That's what the hwmon prometheus collector does, and I found this icinga check following that route: https://github.com/bzed/pkg-nagios-plugins-contrib/blob/master/check_lm_sensors/check_lm_sensors-4.1.1/check_lm_sensors

Change 356378 had a related patch set uploaded (by Faidon Liambotis; owner: Faidon Liambotis):
[operations/puppet@production] Disable temperature monitoring via NRPE

https://gerrit.wikimedia.org/r/356378

Change 356378 merged by Faidon Liambotis:
[operations/puppet@production] Disable temperature monitoring via NRPE

https://gerrit.wikimedia.org/r/356378

Change 356422 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] base: blacklist acpi_power_meter

https://gerrit.wikimedia.org/r/356422

Change 356422 merged by Filippo Giunchedi:
[operations/puppet@production] base: blacklist acpi_power_meter

https://gerrit.wikimedia.org/r/356422

Change 356567 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Re-enable temperature monitoring via NRPE

https://gerrit.wikimedia.org/r/356567

Change 356567 merged by Ema:
[operations/puppet@production] Re-enable temperature monitoring via NRPE

https://gerrit.wikimedia.org/r/356567

Change 357010 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] check_ipmi_temp: set check timeout to 60 seconds

https://gerrit.wikimedia.org/r/357010

Change 357010 merged by Ema:
[operations/puppet@production] check_ipmi_temp: set check timeout to 60 seconds

https://gerrit.wikimedia.org/r/357010

Change 357361 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] check_ipmi_temp: turn off sel checking

https://gerrit.wikimedia.org/r/357361

Change 357361 merged by Ema:
[operations/puppet@production] check_ipmi_temp: turn off sel checking

https://gerrit.wikimedia.org/r/357361

Change 357617 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] check_ipmi_temp: load ipmi_devintf on trusty

https://gerrit.wikimedia.org/r/357617

Change 357617 merged by Ema:
[operations/puppet@production] check_ipmi_temp: load ipmi_devintf

https://gerrit.wikimedia.org/r/357617

faidon lowered the priority of this task from High to Medium.Jul 10 2017, 12:57 PM
faidon removed a project: Patch-For-Review.

So the IPMI checks have been deployed for a while. Quite a few hosts had BMC issues (some of them are fixed), and it remains to be seen whether the IPMI checks are going to be reliable enough for our uses.

Regardless of that, I'm not sure if this is going to address the "Package Temp" alerts that @BBlack was aiming for when he filed this task. Shall we resolve this, or should we keep it for adding /sys/class/thermal checks too? Thoughts?

faidon moved this task from Inbox to Up next on the observability board.Jul 10 2017, 12:58 PM
faidon closed this task as Resolved.Jul 27 2017, 12:35 AM
faidon claimed this task.

So I thought about it a little bit and think we can resolve this after all. I don't know of any cases where temperatures are an issue but one that the current IPMI check doesn't catch. Writing yet another thermal check is more work for dubious gains at this point -- and it also means that we'll be checking the same values twice, from two different places, and get unnecessarily spammed on failure.