Page MenuHomePhabricator

mw1083's sda disk is dying
Closed, ResolvedPublic

Description

Today mw1083's sda disk (the only disk) started spewing errors:

Oct 21 03:10:38 mw1083 kernel: [27802921.887805] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Oct 21 03:10:38 mw1083 kernel: [27802921.894514] ata1.00: BMDMA stat 0x24
Oct 21 03:10:38 mw1083 kernel: [27802921.898347] ata1.00: failed command: READ DMA
Oct 21 03:10:38 mw1083 kernel: [27802921.903011] ata1.00: cmd c8/00:08:20:28:c1/00:00:00:00:00/ee tag 0 dma 4096 in
Oct 21 03:10:38 mw1083 kernel: [27802921.903011]          res 51/40:07:21:28:c1/00:00:00:00:00/ee Emask 0x9 (media error)
Oct 21 03:10:38 mw1083 kernel: [27802921.918511] ata1.00: status: { DRDY ERR }
Oct 21 03:10:38 mw1083 kernel: [27802921.922821] ata1.00: error: { UNC }
Oct 21 03:10:38 mw1083 kernel: [27802921.958959] ata1.00: configured for UDMA/133
Oct 21 03:10:38 mw1083 kernel: [27802921.958977] sd 0:0:0:0: [sda] Unhandled sense code
Oct 21 03:10:38 mw1083 kernel: [27802921.958980] sd 0:0:0:0: [sda]  
Oct 21 03:10:38 mw1083 kernel: [27802921.958983] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 21 03:10:38 mw1083 kernel: [27802921.958985] sd 0:0:0:0: [sda]  
Oct 21 03:10:38 mw1083 kernel: [27802921.958988] Sense Key : Medium Error [current] [descriptor]
Oct 21 03:10:38 mw1083 kernel: [27802921.958992] Descriptor sense data with sense descriptors (in hex):
Oct 21 03:10:38 mw1083 kernel: [27802921.958994]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
Oct 21 03:10:38 mw1083 kernel: [27802921.959004]         0e c1 28 21 
Oct 21 03:10:38 mw1083 kernel: [27802921.959009] sd 0:0:0:0: [sda]  
Oct 21 03:10:38 mw1083 kernel: [27802921.959012] Add. Sense: Unrecovered read error - auto reallocate failed
Oct 21 03:10:38 mw1083 kernel: [27802921.959015] sd 0:0:0:0: [sda] CDB: 
Oct 21 03:10:38 mw1083 kernel: [27802921.959016] Read(10): 28 00 0e c1 28 20 00 00 08 00
Oct 21 03:10:38 mw1083 kernel: [27802921.959026] end_request: I/O error, dev sda, sector 247539745
Oct 21 03:10:38 mw1083 kernel: [27802921.965036] ata1: EH complete
Oct 21 03:10:38 mw1083 kernel: [27802921.965078] EXT4-fs error (device sda1): ext4_find_entry:1309: inode #7733743: comm hhvm: reading directory lblock 0

and then

[27802921.976174] Aborting journal on device sda1-8.
[27802921.981297] EXT4-fs (sda1): Remounting filesystem read-only
[27802921.981465] EXT4-fs error (device sda1): ext4_journal_check_start:56: Detected aborted journal
[27802921.981467] EXT4-fs (sda1): Remounting filesystem read-only

Let's replace the disk and reimage the box. It has been already depooled

Details

Related Gerrit Patches:
operations/puppet : productionmw1083: add back to dsh group
operations/puppet : productionmw1083: depooled, remove from dsh

Related Objects

Event Timeline

akosiaris raised the priority of this task from to Needs Triage.
akosiaris updated the task description. (Show Details)
akosiaris added a project: ops-eqiad.
akosiaris added a subscriber: akosiaris.
Restricted Application added a project: acl*sre-team. · View Herald TranscriptOct 21 2015, 3:42 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
akosiaris triaged this task as Medium priority.Oct 21 2015, 3:42 PM
akosiaris set Security to None.
Restricted Application added a subscriber: Matanya. · View Herald TranscriptOct 21 2015, 3:42 PM

Change 247859 had a related patch set uploaded (by Alexandros Kosiaris):
mw1083: depooled, remove from dsh

https://gerrit.wikimedia.org/r/247859

Change 247859 merged by Alexandros Kosiaris:
mw1083: depooled, remove from dsh

https://gerrit.wikimedia.org/r/247859

I will shutdown this machine now so it does not query the mysql servers with an outdated configuration.

The physical disk has been replaced....requires fresh install

Change 251537 had a related patch set uploaded (by Dzahn):
mw1083: add back to dsh group

https://gerrit.wikimedia.org/r/251537

Change 251537 merged by Dzahn:
mw1083: add back to dsh group

https://gerrit.wikimedia.org/r/251537

Dzahn added a subscriber: Dzahn.Nov 11 2015, 1:44 AM

looks like it has been reinstalled but is not pooled yet

Dzahn closed this task as Resolved.Nov 11 2015, 1:49 AM
Dzahn claimed this task.

repooled in pybal. getting traffic again.