an-worker1168 in a weird statue, possibly due to I/O errors
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	akosiaris
	Mar 21 2024, 7:51 AM

Description

Since 2024-03-10 19:46:40 an-worker1168 is alerting for different alerts

PROCS CRITICAL: 0 processes with command name 'java', args 'org.apache.hadoop.yarn.server.nodemanager.NodeManager'

and

ERROR ferm input drop default policy not set, ferm might not have been started correctly

and

NRPE: Unable to read output

and

Failed to execute ['/usr/local/lib/nagios/plugins/get-raid-status-perccli']: KeyError 'System Overview'

and

PROCS CRITICAL: 0 processes with command name 'java', args 'org.apache.hadoop.hdfs.server.datanode.DataNode'

I tried to ssh in and got

ssh an-worker1168.eqiad.wmnet 
-bash: [: : integer expression expected
-bash: /etc/profile.d/bash_completion.sh: Input/output error
-bash: /usr/bin/tput: Input/output error
-bash: /usr/bin/tput: Input/output error
-bash: /usr/bin/tput: Input/output error
-bash: /usr/bin/tput: Input/output error
-bash: /usr/lib/systemd/user-environment-generators/30-systemd-environment-d-generator: Input/output error
-bash: /usr/bin/dircolors: Input/output error
-bash: /etc/bash_completion: Input/output error
Connection to an-worker1168.eqiad.wmnet closed.

iDRAC console shows after console com2 is issued

[918641.688106] EXT4-fs (dm-1): I/O error while writing superblock
[918641.694150] EXT4-fs error (device dm-1): __ext4_find_entry:1583: inode #787528: comm bash: reading directory lblock 0
[918641.704870] Buffer I/O error on dev dm-1, logical block 0, lost sync page write
[918641.712347] EXT4-fs (dm-1): I/O error while writing superblock
[918645.811770] blk_update_request: I/O error, dev sdl, sector 23035688 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[918645.822493] blk_update_request: I/O error, dev sdl, sector 23035688 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[918651.986146] blk_update_request: I/O error, dev sdl, sector 116504432 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[918651.996952] blk_update_request: I/O error, dev sdl, sector 116504432 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[918653.356753] blk_update_request: I/O error, dev sdl, sector 51154752 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[918653.367580] blk_update_request: I/O error, dev sdl, sector 114158864 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[918653.378778] blk_update_request: I/O error, dev sdl, sector 114158776 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[918653.389598] blk_update_request: I/O error, dev sdl, sector 114158776 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[918653.400460] blk_update_request: I/O error, dev sdl, sector 114158776 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0

in a loop

Something has probably gone really bad with the underlying disk. Judging from the many days already this alerts is ongoing and not handled, I assume it's not critical so I am ACKing it in alerts.wikimedia.org pointing to this task.

Can someone from data-engineering investigate more please?

Related Objects

Mentioned Here: T348036: sre.hardware.upgrade-firmware cookbook: product slug parsing

Event Timeline

akosiaris created this task.Mar 21 2024, 7:51 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 21 2024, 7:51 AM

akosiaris updated the task description. (Show Details)Mar 21 2024, 7:54 AM

Related alerts in alerts.wikimedia.org have been silenced from 30 days (chosen arbitrarily) with a comment pointing to this task.

akosiaris renamed this task from an-worker1168 in a weird statue, possiblye due to I/O errors to an-worker1168 in a weird statue, possibly due to I/O errors.Mar 21 2024, 11:12 AM

Gehel triaged this task as High priority.Mar 21 2024, 11:22 AM

Gehel edited projects, added Data-Platform-SRE (2024.03.04 - 2024.03.24); removed Data-Engineering.

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.03.04 - 2024.03.24) board.Mar 22 2024, 8:45 AM

Gehel edited projects, added Data-Platform-SRE (2024.03.25 - 2024.04.14); removed Data-Platform-SRE (2024.03.04 - 2024.03.24).Mar 22 2024, 8:47 AM

BTullis claimed this task.Mar 22 2024, 10:22 AM

I can see from the iDrac that there seems to be an intermittent communication issue with the RAID controller in slot 1.

I'll try a cold boot first, to see if this clears it.

Icinga downtime and Alertmanager silence (ID=c18f8d41-c90b-45af-91e9-dfc6487a6424) set by btullis@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Investigating disk errors

an-worker1168.eqiad.wmnet

Also, I investigated the iDrac firmware version, but the cookbook threw an error.

btullis@cumin1002:~$ sudo cookbook sre.hardware.upgrade-firmware an-worker1168.eqiad.wmnet
Acquired lock for key /spicerack/locks/cookbooks/sre.hardware.upgrade-firmware: {'concurrency': 20, 'created': '2024-03-22 10:36:22.535337', 'owner': 'btullis@cumin1002 [2462780]', 'ttl': 1800}
START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-worker1168.eqiad.wmnet
Acquired lock for key /spicerack/locks/custom/sre.hardware.upgrade-firmware:an-worker1168: {'concurrency': 1, 'created': '2024-03-22 10:36:22.606393', 'owner': 'btullis@cumin1002 [2462780]', 'ttl': 3600}
Management Password: 
an-worker1168.eqiad.wmnet (Gen 15): starting
an-worker1168.eqiad.wmnet (IDRAC): update
an-worker1168.eqiad.wmnet (IDRAC): current version: 7.0.30.0
power: picking DellDriverCategory.IDRAC update file
Released lock for key /spicerack/locks/custom/sre.hardware.upgrade-firmware:an-worker1168: {'concurrency': 1, 'created': '2024-03-22 10:36:22.606393', 'owner': 'btullis@cumin1002 [2462780]', 'ttl': 3600}
Exception raised while executing cookbook sre.hardware.upgrade-firmware:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 250, in _run
    raw_ret = runner.run()
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 968, in run
    failures += self._run_host(hostname)
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 998, in _run_host
    self.update_idrac(redfish_host, netbox_host)
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 693, in update_idrac
    target_version, job_id = self._update(
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 596, in _update
    target_version, firmware_file = getattr(self, select_firmwarefile)(
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 557, in _cached_select_firmwarefile
    return self._select_firmwarefile(*args, **kargs)
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 538, in _select_firmwarefile
    return self.get_latest(product_slug, driver_type, driver_category)
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 270, in get_latest
    raise RuntimeError(f"unable to find any drivers for: {product_slug}\n"
RuntimeError: unable to find any drivers for: power
Please ensure that the slug is correct.
Released lock for key /spicerack/locks/cookbooks/sre.hardware.upgrade-firmware: {'concurrency': 20, 'created': '2024-03-22 10:36:22.535337', 'owner': 'btullis@cumin1002 [2462780]', 'ttl': 1800}
END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts an-worker1168.eqiad.wmnet

I will look into this or report it, depending on the outcome of this ticket.

Mentioned in SAL (#wikimedia-analytics) [2024-03-22T10:44:07Z] <btullis> shut down an-worker1168 to investigate disk controller failure for T360594

Issued a cold boot command to the BMC.

ipmitool> bmc reset cold
Sent cold reset command to MC
ipmitool>

Now booting the host.

The hosts has now booted cleanly and has completed a puppet run. Seems to be OK. I added a note about the firware upgrade failure to T348036: sre.hardware.upgrade-firmware cookbook: product slug parsing

	F43030320: image.png
	Mar 22 2024, 10:30 AM

an-worker1168 in a weird statue, possibly due to I/O errorsClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

an-worker1168 in a weird statue, possibly due to I/O errors
Closed, ResolvedPublic
Actions