Maniphest T209029

cloudelastic1004: SMART/disk error
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	aborrero
	Nov 8 2018, 9:37 AM

Description

There seems to be some issue with the /dev/sda disk on the cloudelastic1004.wikimedia.org server:

aborrero@cloudelastic1004:~ $ sudo smartctl -H /dev/sda
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-7-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
Failed Attributes:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
201 Unknown_SSD_Attribute   0x0033   001   001   010    Pre-fail  Always   FAILING_NOW 163735568957

Disk may need replacement. Netbox link: https://netbox.wikimedia.org/dcim/devices/266/

Event Timeline

aborrero created this task.Nov 8 2018, 9:37 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 8 2018, 9:37 AM

aborrero updated the task description. (Show Details)Nov 8 2018, 9:38 AM

aborrero added a project: ops-eqiad.

Restricted Application added a project: SRE. · View Herald TranscriptNov 8 2018, 9:38 AM

colewhite triaged this task as Medium priority.Nov 9 2018, 6:22 PM

RobH moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.Dec 12 2018, 11:36 PM

a ticket has been opened with Dell

You have successfully submitted request SR984761946.

the ssd is on-site, it's /dev/sda...the disk will need to be failed before I can replace. This server may need a reinstall if /dev/sda does not rebuild

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 10:38 PM

@aborrero can this server be re-installed.....there is a risk that removing /dev/sda will kill the OS.

In T209029#4919182, @Cmjohnson wrote:

@aborrero can this server be re-installed.....there is a risk that removing /dev/sda will kill the OS.

I believe the server can be reinstalled right away.

Just checked:

cloudelastic1004 is a Unused spare system (spare::system)

The disk has been replaced, @aborrero the OS will need to be re-installed. Until then the raid is out of whack because I removed /dev/sda.

In T209029#4935906, @Cmjohnson wrote:

The disk has been replaced, @aborrero the OS will need to be re-installed. Until then the raid is out of whack because I removed /dev/sda.

Thanks, I will reimage it and will discuss the team how to proceed with this server.

WMCS needs discussion: what do we want to do with this server? can it live with spare::system for now?

Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts:

cloudelastic1004.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201902081304_aborrero_22060_cloudelastic1004_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudelastic1004.eqiad.wmnet']

Of which those FAILED:

['cloudelastic1004.eqiad.wmnet']

Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts:

cloudelastic1004.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201902081305_aborrero_22235_cloudelastic1004_eqiad_wmnet.log.

Mentioned in SAL (#wikimedia-operations) [2019-02-08T13:05:55Z] <arturo> T209029 reimaging cloudelastic1004

Completed auto-reimage of hosts:

['cloudelastic1004.eqiad.wmnet']

Of which those FAILED:

['cloudelastic1004.eqiad.wmnet']

Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts:

cloudelastic1004.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201902081334_aborrero_27096_cloudelastic1004_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cloudelastic1004.eqiad.wmnet']

Of which those FAILED:

['cloudelastic1004.eqiad.wmnet']

Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts:

cloudelastic1004.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201902081335_aborrero_28639_cloudelastic1004_eqiad_wmnet.log.

I'm having troubles reimaging the server:

aborrero@cumin1001:~$ sudo -i wmf-auto-reimage-host -p T209029 --no-verify --no-downtime --no-reboot cloudelastic1004.eqiad.wmnet
13:35:30 | cloudelastic1004.eqiad.wmnet | REIMAGE START | To monitor the full log and cumin output:
sudo tail -F /var/log/wmf-auto-reimage/201902081335_aborrero_28639_cloudelastic1004_eqiad_wmnet.log
sudo tail -F /var/log/wmf-auto-reimage/201902081335_aborrero_28639_cloudelastic1004_eqiad_wmnet_cumin.out
IPMI Password: 
13:35:44 | cloudelastic1004.eqiad.wmnet | Removed from Puppet
13:35:44 | cloudelastic1004.eqiad.wmnet | WARNING: Unable to remove from Debmonitor, got: 404
13:35:44 | cloudelastic1004.eqiad.wmnet | Set Boot Device to pxe
13:35:45 | cloudelastic1004.eqiad.wmnet | Power cycling
13:35:45 | cloudelastic1004.eqiad.wmnet | Chassis Power Control: Cycle
13:39:45 | cloudelastic1004.eqiad.wmnet | Still waiting for reboot after 5.0 minutes
13:44:46 | cloudelastic1004.eqiad.wmnet | Still waiting for reboot after 10.0 minutes
13:49:46 | cloudelastic1004.eqiad.wmnet | Still waiting for reboot after 15.0 minutes
13:54:46 | cloudelastic1004.eqiad.wmnet | Still waiting for reboot after 20.0 minutes
13:59:47 | cloudelastic1004.eqiad.wmnet | Still waiting for reboot after 25.0 minutes
[...]

The debian installer completes, but I can't log in because apparently the first puppet run isn't completed and I can't use any login methods (ssh or direct console access).

The debian installer completes, but I can't log in because apparently the first puppet run isn't completed and I can't use any login methods (ssh or direct console access).

This should be cloudelastic1004.wikimedia.org, not cloudelastic1004.eqiad.wmnet.

Completed auto-reimage of hosts:

['cloudelastic1004.eqiad.wmnet']

Of which those FAILED:

['cloudelastic1004.eqiad.wmnet']

In T209029#4937948, @aborrero wrote:

WMCS needs discussion: what do we want to do with this server? can it live with spare::system for now?

My understanding was that the Analytics team was anxiously waiting to get access to their hypervisors. Is this one of them?

Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts:

cloudelastic1004.wikimedia.org

The log can be found in /var/log/wmf-auto-reimage/201902081618_aborrero_73970_cloudelastic1004_wikimedia_org.log.

Completed auto-reimage of hosts:

['cloudelastic1004.wikimedia.org']

Of which those FAILED:

['cloudelastic1004.wikimedia.org']

Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts:

cloudelastic1004.wikimedia.org

The log can be found in /var/log/wmf-auto-reimage/201902081619_aborrero_74286_cloudelastic1004_wikimedia_org.log.

Completed auto-reimage of hosts:

['cloudelastic1004.wikimedia.org']

and were ALL successful.

Thanks @Cmjohnson and @MoritzMuehlenhoff, the server seems fine now:

aborrero@cloudelastic1004:~ $ sudo smartctl -H /dev/sda
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-8-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

cloudelastic1004: SMART/disk errorClosed, ResolvedPublicActions

Description

Event Timeline

cloudelastic1004: SMART/disk error
Closed, ResolvedPublic
Actions