Page MenuHomePhabricator

Disk #5 (count starts at #0) of db1111 has corrupted sectors
Closed, ResolvedPublic

Description

root@db1111:/usr/local/sbin$ smartctl -a -d sat+megaraid,5 /dev/sda
...
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000e   130   130   039    Old_age   Always       -       535776880
  5 Reallocated_Sector_Ct   0x0033   099   099   001    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       2081
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       29
 13 Read_Soft_Error_Rate    0x001e   090   088   000    Old_age   Always       -       136138114158192
170 Unknown_Attribute       0x0033   100   100   010    Pre-fail  Always       -       0
174 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       26
179 Used_Rsvd_Blk_Cnt_Tot   0x0033   100   100   010    Pre-fail  Always       -       0
180 Unused_Rsvd_Blk_Cnt_Tot 0x0032   100   100   000    Old_age   Always       -       17120
181 Program_Fail_Cnt_Total  0x003a   100   100   000    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x003a   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       24
195 Hardware_ECC_Recovered  0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       1007
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
201 Unknown_SSD_Attribute   0x0033   100   100   010    Pre-fail  Always       -       193362999070
202 Unknown_SSD_Attribute   0x0027   100   100   000    Pre-fail  Always       -       0
225 Unknown_SSD_Attribute   0x0032   100   100   000    Old_age   Always       -       7824
226 Unknown_SSD_Attribute   0x0032   100   100   000    Old_age   Always       -       102400
227 Unknown_SSD_Attribute   0x0032   100   100   000    Old_age   Always       -       0
228 Power-off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       360912880
232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always       -       7824
234 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
241 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       7824
242 Total_LBAs_Read         0x0032   100   100   000    Old_age   Always       -       34537
245 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       100
...

This is a new, just-procured host, and I would expect it to be 0. "Current_Pending_Sector value indicates how many of the hard disk's sectors can no longer be read and are waiting for re-mapping". Could you contact hardware provider?

Details

Related Gerrit Patches:
operations/puppet : productioninstall_server: Allow reimage db1111

Event Timeline

jcrespo created this task.Feb 16 2018, 9:45 AM
Restricted Application added a project: Operations. · View Herald TranscriptFeb 16 2018, 9:45 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
jcrespo added subscribers: Marostegui, Cmjohnson.
Cmjohnson moved this task from Backlog to Up next on the ops-eqiad board.Feb 16 2018, 3:47 PM

A ticket has been created with Dell .

You have successfully submitted request SR961176970.

MoritzMuehlenhoff triaged this task as Medium priority.Feb 26 2018, 11:25 AM

The ssd was replaced, @Marostegui please confirm and resolve after rebuild

Return shipping informaitn
USPS 9202 3946 5301 2438 0714 10
FEDEX 9611918 2393026 74821432

@Cmjohnson looks like storage crashed and the FS became read-only.
We are investigating why...

This is all we have from the HW logs:

/admin1/system1/logs1/log1-> show record3

	properties
		CreationTimestamp = 20180226161220.000000-360
		ElementName = System Event Log Entry
		RecordData = Drive 5 is removed from disk drive bay 1.
		RecordFormat = string Description
		RecordID = 2

Mentioned in SAL (#wikimedia-operations) [2018-02-26T16:33:43Z] <marostegui> Reboot db1111 storage crashed - T187526

Looks like the new disk has not been added automatically to the RAID.
I have been digging around the PERC menu, but it is terribly slow from here, so maybe it needs to be added manually by @Cmjohnson on site, because remotely it is unusable :(

This doesn't mean the FS will be in a good state anyways, that will need to be checked once the server boots up normally.

Change 414958 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] install_server: Allow reimage db1111

https://gerrit.wikimedia.org/r/414958

Change 414958 merged by Marostegui:
[operations/puppet@production] install_server: Allow reimage db1111

https://gerrit.wikimedia.org/r/414958

Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts:

['db1111.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201802270745_marostegui_15652.log.

Completed auto-reimage of hosts:

['db1111.eqiad.wmnet']

Of which those FAILED:

['db1111.eqiad.wmnet']

Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts:

['db1111.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201802270802_marostegui_19847.log.

Completed auto-reimage of hosts:

['db1111.eqiad.wmnet']

and were ALL successful.

Marostegui closed this task as Resolved.Feb 27 2018, 9:52 AM
Marostegui claimed this task.

The server was reimaged and all the data transferred back from db1112 and it is now fully back online.