Page MenuHomePhabricator

ms-fe1014 hardware fault (may need new disk controller?)
Closed, ResolvedPublic

Description

ms-fe1014 was reporting EIO on any attempt to login or indeed reboot via the cookbook

mvernon@cumin2002:~$ sudo cookbook sre.hosts.reboot-single -r 'EIO on login' --depool ms-fe1014.eqiad.wmnet
Exception raised while initializing the Cookbook sre.hosts.reboot-single:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 205, in run
    runner = self.instance.get_runner(args)
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/reboot-single.py", line 37, in get_runner
    return RebootSingleHostRunner(args, self.spicerack)
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/reboot-single.py", line 86, in __init__
    self.puppet.check_enabled()
  File "/usr/lib/python3/dist-packages/spicerack/puppet.py", line 177, in check_enabled
    disabled = self._get_disabled()[True]
  File "/usr/lib/python3/dist-packages/spicerack/puppet.py", line 365, in _get_disabled
    result = bool(int(output.message().decode().strip()))
ValueError: invalid literal for int() with base 10: 'bash: /etc/bash.bashrc: Input/output error\nxargs: echo: Input/output error\nxargs: echo: Input/output error\nxargs: echo: Input/output error\n0'

I hard-rebooted it from the iDRAC, and it refuses to come up, with an alarming message about the state of the disk controller:

ms-fe1014_sad.png (1×1 px, 238 KB)

Host is depooled, please feel free to work on this system at your earliest convenience without needing further input from myself.

Related Objects

Event Timeline

MatthewVernon triaged this task as High priority.
MatthewVernon updated the task description. (Show Details)

@Papaul is this host likely to get some attention soon, please?

upgrade BIOS and IDRAC on the server, Server is back up, I will leave the task open for now to see if we do have the same error again .

Papaul claimed this task.

checking the server again today all looks good. I am closing this task we can still re-open if we do see the same issue. Thanks