db1082 MySQL crashed
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Marostegui
	Feb 15 2017, 2:32 PM

Description

db1082 paged for replication lag, which was actually a replication being stopped.
The reason for it was that MySQL crashed

170215 13:56:41 [ERROR] InnoDB: Tried to read 16384 bytes at offset 28488318976. Was only able to read 0.
2017-02-15 13:56:41 7fdf53d9d700  InnoDB: Operating system error number 5 in a file operation.
InnoDB: Error number 5 means 'Input/output error'.
InnoDB: Some operating system error numbers are described at
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/operating-system-error-codes.html
170215 13:56:41 [ERROR] InnoDB: File (unknown): 'read' returned OS error 105. Cannot continue operation
170215 13:56:56 mysqld_safe Number of processes running now: 0
170215 13:56:56 mysqld_safe mysqld restarted

That happens a bit after this:

Feb 15 13:56:41 db1082 kernel: [11085714.884288] sd 0:1:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Feb 15 13:56:41 db1082 kernel: [11085714.884426] sd 0:1:0:0: [sda] tag#0 Sense Key : Aborted Command [current]
Feb 15 13:56:41 db1082 kernel: [11085714.884571] sd 0:1:0:0: [sda] tag#0 Add. Sense: Information unit iuCRC error detected
Feb 15 13:56:41 db1082 kernel: [11085714.884574] sd 0:1:0:0: [sda] tag#0 CDB: Read(16) 88 00 00 00 00 01 13 cd da a0 00 00 00 20 00 00
Feb 15 13:56:41 db1082 kernel: [11085714.884722] blk_update_request: I/O error, dev sda, sector 4627225248

Which looks storage related.
There are no logs on the ILO
Server is depooled

The plan is to:

Reboot the server
Check if everything is ok and/or logs are generated

Details

Subject	Repo	Branch	Lines +/-
db-eqiad.php: Restore db1082 original load	operations/mediawiki-config	master	+3 -4
db-eqiad.php: Increase load db1082	operations/mediawiki-config	master	+1 -1
Repool db1082 with low load after crash	operations/mediawiki-config	master	+1 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Marostegui	T145533 Investigate db1082 crash
		Resolved		Marostegui	T158188 db1082 MySQL crashed

Event Timeline

Marostegui created this task.Feb 15 2017, 2:32 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 15 2017, 2:32 PM

Mentioned in SAL (#wikimedia-operations) [2017-02-15T14:34:46Z] <marostegui> Stop MySQL and shutdown db1082 - T158188

Server rebooted fine it showed this on dmesg which I am not completely aware of what it means :

[   32.823256] hpsa 0000:08:00.0: Acknowledging event: 0xc0000000 (HP SSD Smart Path configuration change)
[   32.862418] hpsa 0000:08:00.0: scsi 0:1:0:0: updated Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap+ En+ Exp=1

The disks and RAID card are reporting to be all fine:

root@db1082:~# hpssacli controller slot=1 pd all show status

   physicaldrive 1I:1:1 (port 1I:box 1:bay 1, 800 GB): OK
   physicaldrive 1I:1:2 (port 1I:box 1:bay 2, 800 GB): OK
   physicaldrive 1I:1:3 (port 1I:box 1:bay 3, 800 GB): OK
   physicaldrive 1I:1:4 (port 1I:box 1:bay 4, 800 GB): OK
   physicaldrive 1I:1:5 (port 1I:box 1:bay 5, 800 GB): OK
   physicaldrive 1I:1:6 (port 1I:box 1:bay 6, 800 GB): OK
   physicaldrive 1I:1:7 (port 1I:box 1:bay 7, 800 GB): OK
   physicaldrive 1I:1:8 (port 1I:box 1:bay 8, 800 GB): OK
   physicaldrive 2I:2:1 (port 2I:box 2:bay 1, 800 GB): OK
   physicaldrive 2I:2:2 (port 2I:box 2:bay 2, 800 GB): OK

root@db1082:~# hpssacli controller all show status

Smart Array P840 in Slot 1
   Controller Status: OK
   Cache Status: Not Configured
   Battery/Capacitor Status: OK

I have started MySQL and as it went fine, I have also started replication.

Change 338037 had a related patch set uploaded (by Jcrespo):
Repool db1082 with low load after crash

https://gerrit.wikimedia.org/r/338037

gerritbot added a project: Patch-For-Review.Feb 16 2017, 12:53 AM

Change 338037 merged by Jcrespo:
Repool db1082 with low load after crash

https://gerrit.wikimedia.org/r/338037

Change 338067 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Increase load db1082

https://gerrit.wikimedia.org/r/338067

Change 338067 merged by jenkins-bot:
db-eqiad.php: Increase load db1082

https://gerrit.wikimedia.org/r/338067

Mentioned in SAL (#wikimedia-operations) [2017-02-16T07:43:12Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Increase load db1082 - T158188 (duration: 00m 42s)

Change 338078 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Restore db1082 original load

https://gerrit.wikimedia.org/r/338078

Change 338078 merged by jenkins-bot:
db-eqiad.php: Restore db1082 original load

https://gerrit.wikimedia.org/r/338078

Mentioned in SAL (#wikimedia-operations) [2017-02-16T08:44:54Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Restore origina db1082 weight - T158188 (duration: 00m 41s)

I will close this ticket after restoring the original weight for this server. Also added a parent task, which is the first crash this server had back in September (T145533). It will be easier this way to keep tracking of (hopefully not) any future crashes.

Marostegui closed this task as Resolved.Feb 16 2017, 8:52 AM

Marostegui claimed this task.

Marostegui mentioned this in T178460: db1082 storage crashed.Oct 18 2017, 9:48 AM

Marostegui mentioned this in T258336: db1082 crashed.Jul 20 2020, 11:37 AM

db1082 MySQL crashedClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

db1082 MySQL crashed
Closed, ResolvedPublic
Actions

Related Objects
Search...