Page MenuHomePhabricator

Several hosts return "internal IPMI error" in the check_ipmi_temp check
Closed, ResolvedPublic

Description

The icinga check check_ipmi_temp is failing on the following hosts with an internal IPMI error:

  • db2034
  • db2042
  • ms-be2014
  • ocg1002
  • sodium

ipmi_sdr_cache_create: internal IPMI error

Running the command in debug mode (/usr/sbin/ipmi-sensors -g Temperature --quiet-cache --sdr-cache-recreate --interpret-oem-data --output-sensor-state --ignore-not-available-sensors --output-sensor-thresholds --debug) I got the following output for db2034/db2042/ms-be2014/ocg1002:

=====================================================                                          
Get SDR Repository Info Request                                                                
=====================================================
KCS Header:
------------
[               0h] = lun[ 2b]
[               Ah] = net_fn[ 6b]
IPMI Command Data:
------------------
[              20h] = cmd[ 8b]
ipmi_sdr_cache_create: internal IPMI error

sodium instead fails in a slightly different way

=====================================================                                          
Get SDR Repository Info Request                                                                
=====================================================
[              20h] = cmd[ 8b]
=====================================================
Get SDR Repository Info Response
=====================================================
[              20h] = cmd[ 8b]
[              D5h] = comp_code[ 8b]
ipmi_sdr_cache_create: internal IPMI error

Event Timeline

ema created this task.Jun 6 2017, 11:46 AM
Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptJun 6 2017, 11:46 AM
ema updated the task description. (Show Details)Jun 6 2017, 1:26 PM
jcrespo added a subscriber: jcrespo.Jun 6 2017, 7:13 PM

It could be related to T141756#3320207

ema updated the task description. (Show Details)Jun 7 2017, 8:38 AM
ema updated the task description. (Show Details)
ema added a subscriber: fgiunchedi.Jun 7 2017, 10:02 AM

@fgiunchedi just rebooted ms-be2014 for unrelated reasons and the reboot alone fixed the issue. Perhaps IPMI can end up in some weird state and that gets fixed upon reboot?

ema added a comment.Jun 7 2017, 11:25 AM

According to the Freeipmi FAQs, /dev/ipmi0 should be there or races can cause frequent 'internal IPMI errors'. On jessie systems /dev/ipmi0 is present, while on trusty it's not.

Loading the ipmi_devintf kernel module on trusty makes the file appear and the issue disappear (tested on ocg1002).

Change 357617 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] check_ipmi_temp: load ipmi_devintf on trusty

https://gerrit.wikimedia.org/r/357617

jcrespo added a comment.EditedJun 7 2017, 2:23 PM

db2042 is jessie, BTW. But it has an old BIOS version.

ema added a comment.Jun 7 2017, 2:30 PM

db2042 is jessie, BTW. But it has an old BIOS version.

Uh, nice catch. Indeed only some of the Jessie hosts have ipmi_devintf loaded (all those I've checked when I concluded it's a jessie vs. trusty thing of course, eh). Thanks.

Change 357617 merged by Ema:
[operations/puppet@production] check_ipmi_temp: load ipmi_devintf

https://gerrit.wikimedia.org/r/357617

faidon renamed this task from internal IPMI error to Several hosts return "internal IPMI error" in the check_ipmi_temp check.Jul 10 2017, 12:54 PM
faidon closed this task as Resolved.
faidon claimed this task.

I just checked the list above one by one. All of them work now, with the exception of sodium, which has an entirely unresponsive iDRAC (cf. T169360). This can be resolved for now, sodium we can track as part of the other task.

ayounsi reopened this task as Open.Aug 1 2017, 5:04 PM
ayounsi added a subscriber: ayounsi.

db2040 is alerting as unknown as well:

root@db2040:~# ipmi-sensors
ID | Name | Type | Reading    | Units | Event
root@db2040:~# /usr/local/lib/nagios/plugins/check_ipmi_sensor --noentityabsent -T Temperature -ST Temperature --nosel
Sensor Type(s) Temperature Status: 
 FreeIPMI returned an empty header map (first line) FreeIPMI could not find any sensors for the given sensor type (option '-T').

Should we try restarting it or upgrading firmware?

Change 376218 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool db2040 for reboot and upgrade

https://gerrit.wikimedia.org/r/376218

Change 376218 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool db2040 for reboot and upgrade

https://gerrit.wikimedia.org/r/376218

jcrespo closed this task as Resolved.Sep 6 2017, 1:57 PM

I think the reboot and/or upgrade fixed it (db2040).

jcrespo reopened this task as Open.Sep 6 2017, 2:41 PM

checking es1019

Change 376276 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool es1019 for maintenance

https://gerrit.wikimedia.org/r/376276

Change 376276 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool es1019 for maintenance

https://gerrit.wikimedia.org/r/376276

es1019 seems to have rebroken T155691. I have depooled it, but it will take days to get effective (because backups do not respect the configuration)- so I have to wait to restart the server.

jcrespo reassigned this task from faidon to Cmjohnson.Sep 7 2017, 3:29 PM

Hi, @Cmjohnson

We definitely a "drain flea power" on es1019, it does not reboot and is unresponsive to mgmt connect (which is not suprising given the context of this task).

Hey, Chris, did you see this^. This is not an emergency, but the service affected (External Store) is relatively important (all wikitext content) and it is running with reduced redundancy, plus the more time it is offline, the more time it will take to recover (it requires an up-to-date replication) so I want to be sure you at least are aware of the issue, at least to discard the server being unable to boot completely (fried).

@jcrespo sorry I missed that Friday...I will take a look

@jcrespo es1019 is back up. It seemed to be stuck in some weird state, The fans were blowing but nothing else worked. Pulled the power and rebooted. The server h/w log does not show any issues.

@Cmjohnson As I said, this is not an emergency- I will make the server catch up but after that I would like to do a more detailed check, making sure the IPMI is working fine or it needs an upgrade or something else. This is an in warranty server, and I want to make sure it doesn't create any future problems. This can wait for later.

I will update this task after I do some OS-level checks to see what is its state and what should be the followups.

jcrespo closed this task as Resolved.Sep 11 2017, 5:20 PM
jcrespo reassigned this task from Cmjohnson to faidon.

IPMI seems responsive again, both the programatic calls and the SSH interface, I will consider this back as resolved.

Change 377307 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] MariaDB: Repool db1019 with low load after maintenance

https://gerrit.wikimedia.org/r/377307

Change 377307 merged by jenkins-bot:
[operations/mediawiki-config@master] MariaDB: Repool es1019 with low load after maintenance

https://gerrit.wikimedia.org/r/377307

ayounsi removed a subscriber: ayounsi.Jan 10 2019, 4:31 PM