Page MenuHomePhabricator

mw1083's sda disk is dying
Closed, ResolvedPublic

Description

Today mw1083's sda disk (the only disk) started spewing errors:

Oct 21 03:10:38 mw1083 kernel: [27802921.887805] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Oct 21 03:10:38 mw1083 kernel: [27802921.894514] ata1.00: BMDMA stat 0x24
Oct 21 03:10:38 mw1083 kernel: [27802921.898347] ata1.00: failed command: READ DMA
Oct 21 03:10:38 mw1083 kernel: [27802921.903011] ata1.00: cmd c8/00:08:20:28:c1/00:00:00:00:00/ee tag 0 dma 4096 in
Oct 21 03:10:38 mw1083 kernel: [27802921.903011]          res 51/40:07:21:28:c1/00:00:00:00:00/ee Emask 0x9 (media error)
Oct 21 03:10:38 mw1083 kernel: [27802921.918511] ata1.00: status: { DRDY ERR }
Oct 21 03:10:38 mw1083 kernel: [27802921.922821] ata1.00: error: { UNC }
Oct 21 03:10:38 mw1083 kernel: [27802921.958959] ata1.00: configured for UDMA/133
Oct 21 03:10:38 mw1083 kernel: [27802921.958977] sd 0:0:0:0: [sda] Unhandled sense code
Oct 21 03:10:38 mw1083 kernel: [27802921.958980] sd 0:0:0:0: [sda]  
Oct 21 03:10:38 mw1083 kernel: [27802921.958983] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 21 03:10:38 mw1083 kernel: [27802921.958985] sd 0:0:0:0: [sda]  
Oct 21 03:10:38 mw1083 kernel: [27802921.958988] Sense Key : Medium Error [current] [descriptor]
Oct 21 03:10:38 mw1083 kernel: [27802921.958992] Descriptor sense data with sense descriptors (in hex):
Oct 21 03:10:38 mw1083 kernel: [27802921.958994]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
Oct 21 03:10:38 mw1083 kernel: [27802921.959004]         0e c1 28 21 
Oct 21 03:10:38 mw1083 kernel: [27802921.959009] sd 0:0:0:0: [sda]  
Oct 21 03:10:38 mw1083 kernel: [27802921.959012] Add. Sense: Unrecovered read error - auto reallocate failed
Oct 21 03:10:38 mw1083 kernel: [27802921.959015] sd 0:0:0:0: [sda] CDB: 
Oct 21 03:10:38 mw1083 kernel: [27802921.959016] Read(10): 28 00 0e c1 28 20 00 00 08 00
Oct 21 03:10:38 mw1083 kernel: [27802921.959026] end_request: I/O error, dev sda, sector 247539745
Oct 21 03:10:38 mw1083 kernel: [27802921.965036] ata1: EH complete
Oct 21 03:10:38 mw1083 kernel: [27802921.965078] EXT4-fs error (device sda1): ext4_find_entry:1309: inode #7733743: comm hhvm: reading directory lblock 0

and then

[27802921.976174] Aborting journal on device sda1-8.
[27802921.981297] EXT4-fs (sda1): Remounting filesystem read-only
[27802921.981465] EXT4-fs error (device sda1): ext4_journal_check_start:56: Detected aborted journal
[27802921.981467] EXT4-fs (sda1): Remounting filesystem read-only

Let's replace the disk and reimage the box. It has been already depooled

Related Objects

StatusSubtypeAssignedTask
ResolvedJoe
ResolvedDzahn

Event Timeline

akosiaris raised the priority of this task from to Needs Triage.
akosiaris updated the task description. (Show Details)
akosiaris added a project: ops-eqiad.
akosiaris added a subscriber: akosiaris.
Restricted Application added a project: acl*sre-team. · View Herald TranscriptOct 21 2015, 3:42 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
akosiaris triaged this task as Medium priority.Oct 21 2015, 3:42 PM
akosiaris set Security to None.
Restricted Application added a subscriber: Matanya. · View Herald TranscriptOct 21 2015, 3:42 PM

Change 247859 had a related patch set uploaded (by Alexandros Kosiaris):
mw1083: depooled, remove from dsh

https://gerrit.wikimedia.org/r/247859

Change 247859 merged by Alexandros Kosiaris:
mw1083: depooled, remove from dsh

https://gerrit.wikimedia.org/r/247859

I will shutdown this machine now so it does not query the mysql servers with an outdated configuration.

The physical disk has been replaced....requires fresh install

Change 251537 had a related patch set uploaded (by Dzahn):
mw1083: add back to dsh group

https://gerrit.wikimedia.org/r/251537

Change 251537 merged by Dzahn:
mw1083: add back to dsh group

https://gerrit.wikimedia.org/r/251537

Dzahn added a subscriber: Dzahn.Nov 11 2015, 1:44 AM

looks like it has been reinstalled but is not pooled yet

Dzahn closed this task as Resolved.Nov 11 2015, 1:49 AM
Dzahn claimed this task.

repooled in pybal. getting traffic again.