Page MenuHomePhabricator

Disk issue on db1028
Closed, ResolvedPublic

Description

There has been a write slowdown on db1028.eqiad.wmnet, as seen by extra replication lag. SMART alert related? We should investigate.

                Device Present
                ================
Virtual Drives    : 1 
  Degraded        : 0 
  Offline         : 0 
Physical Devices  : 14 
  Disks           : 12 
  Critical Disks  : 1 
  Failed Disks    : 0
PD: 1 Information
Enclosure Device ID: 32
Slot Number: 10
Drive's position: DiskGroup: 0, Span: 5, Arm: 1
Enclosure position: N/A
Device Id: 10
WWN: 5000C500324105C8
Sequence Number: 2
Media Error Count: 70715
Other Error Count: 17
Predictive Failure Count: 199
Last Predictive Failure Event Seq Number: 190747
PD Type: SAS

Raw Size: 279.396 GB [0x22ecb25c Sectors]
Non Coerced Size: 278.896 GB [0x22dcb25c Sectors]
Coerced Size: 278.875 GB [0x22dc0000 Sectors]
Sector Size:  0
Firmware state: Online, Spun Up
Device Firmware Level: ES64
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x5000c500324105c9
SAS Address(1): 0x0
Connected Port Number: 0(path0) 
Inquiry Data: SEAGATE ST3300657SS     ES646SJ0FEYL            
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None 
Device Speed: 6.0Gb/s 
Link Speed: 6.0Gb/s 
Media Type: Hard Disk Device
Drive Temperature :33C (91.40 F)
PI Eligibility:  No 
Drive is formatted for PI information:  No
PI: No PI
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s 
Port-1 :
Port status: Active
Port's Linkspeed: Unknown 
Drive has flagged a S.M.A.R.T alert : Yes

Event Timeline

jcrespo raised the priority of this task from to Needs Triage.
jcrespo updated the task description. (Show Details)
jcrespo added projects: ops-eqiad, DBA.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Setting temporarily SET GLOBAL innodb_flush_log_at_trx_commit = 0; to help with replication.

Log is full of self-correcting notices of 32-10:

seqNum: 0x00036853
Time: Mon Jun 22 09:24:25 2015

Code: 0x0000006e
Class: 0
Locale: 0x02
Event Description: Corrected medium error during recovery on PD 0a(e0x20/s10) at 929fa60
Event Data:
===========
Device ID: 10
Enclosure Index: 32
Slot Number: 10
LBA: 153746016

The actual raid controller shows all virtual/physical disks as optimal/online.

Trying to poll the error log on the controller seems to result in blank output, or it simply takes a very long time to poll. Fear of slowing down the system even more gave me pause, so I'll share the command here so perhaps one of the folks handling the DBA load can better anticipate if it will cause an issue.

megacli -AdpEventLog -GetEvents -f events.log -aALL && cat events.log

@RobH This was detected after an unusual lag on replication.

As you can see on the events log I just generated:

db1028:/home/jynus/events.log

This has gone from 1 every seconds to 1 every 3 hours or more and lag is no longer an issue. I will keep this open for a week and/or leave a comment on icinga on the sql lag service, and close it if it gets confirmed as a non-issue.

Setting SET GLOBAL innodb_flush_log_at_trx_commit = 1; to go back to the status quo.

jcrespo moved this task from Triage to In progress on the DBA board.

Created again:

db1028:/home/jynus/events_again.log

There is definitely some issue there, but it is not affecting the normal performance, so I will set this to resolved.

This issue happened again, again the same offender (lots of log entries at 8:46 today):

Code: 0x0000005d
Class: 0
Locale: 0x02
Event Description: Patrol Read corrected medium error on PD 0a(e0x20/s10) at e4cbcea
Event Data:
===========
Device ID: 10
Enclosure Index: 32
Slot Number: 10
LBA: 239910122


seqNum: 0x00056865
Time: Sat Aug 29 08:46:54 2015

Code: 0x00000071
Class: 0
Locale: 0x02
Event Description: Unexpected sense: PD 0a(e0x20/s10) Path 5000c500324105c9, CDB: 2f 00 0e 4c bc eb 00 10 00 00, Sense: 3/11/00
Event Data:
===========
Device ID: 10
Enclosure Index: 32
Slot Number: 10
CDB Length: 10
CDB Data:
002f 0000 000e 004c 00bc 00eb 0000 0010 0000 0000 0000 0000 0000 0000 0000 0000 Sense Length: 18
Sense Data:
00f0 0000 0003 000e 004c 00bc 00eb 000a 0000 0000 0000 0000 0011 0000 0081 0080 0000 0097 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 

seqNum: 0x00056866
Time: Sat Aug 29 08:46:54 2015

Code: 0x0000005d
Class: 0
Locale: 0x02
Event Description: Patrol Read corrected medium error on PD 0a(e0x20/s10) at e4cbceb
Event Data:
===========
Device ID: 10
Enclosure Index: 32
Slot Number: 10
LBA: 239910123

This disk is not in a good state, and it makes it worse (lag) by not failing completely.

@Cmjohnson This one uses 279.396 GB disks, Can we replace the above disk with one of the spares?

swapped disk with one from a recently decom'd server....in rebuild state

nclosure Device ID: 32
Slot Number: 10
Drive's position: DiskGroup: 0, Span: 5, Arm: 1
Enclosure position: N/A
Device Id: 10
WWN: 5000C50028EA0C30
Sequence Number: 11
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS

Raw Size: 279.396 GB [0x22ecb25c Sectors]
Non Coerced Size: 278.896 GB [0x22dcb25c Sectors]
Coerced Size: 278.875 GB [0x22dc0000 Sectors]
Sector Size: 0
Firmware state: Rebuild
Device Firmware Level: ES64
Shield Counter: 0
Successful diagnostics completion on : N/A
SAS Address(0): 0x5000c50028ea0c31
SAS Address(1): 0x0
Connected Port Number: 0(path0)
Inquiry Data: SEAGATE ST3300657SS ES643SJ3EVH9
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Hard Disk Device
Drive Temperature :23C (73.40 F)
PI Eligibility: No
Drive is formatted for PI information: No
PI: No PI
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Port-1 :
Port status: Active
Port's Linkspeed: Unknown
Drive has flagged a S.M.A.R.T alert : No

Firmware state: Online, Spun Up

After the rebuild, no extra strange logs:

Time: Wed Sep  2 19:14:41 2015

Code: 0x00000051
Class: 0
Locale: 0x01
Event Description: State change on VD 00/0 from DEGRADED(2) to OPTIMAL(3)
Event Data:
===========
Target Id: 0
Previous state: 2
New state: 3


seqNum: 0x00058d29
Time: Wed Sep  2 19:14:41 2015

Code: 0x000000f9
Class: 0
Locale: 0x01
Event Description: VD 00/0 is now OPTIMAL
Event Data:
===========
Target Id: 0

And after the rebuild, lag had a maximum of 1 seconds. Repooling the node and resolving this task.

Change 235698 had a related patch set uploaded (by Jcrespo):
Repool db1028 after maintenance

https://gerrit.wikimedia.org/r/235698

Change 235698 merged by Jcrespo:
Repool db1028 after maintenance

https://gerrit.wikimedia.org/r/235698