Page MenuHomePhabricator

db1082 MySQL crashed
Closed, ResolvedPublic

Description

db1082 paged for replication lag, which was actually a replication being stopped.
The reason for it was that MySQL crashed

170215 13:56:41 [ERROR] InnoDB: Tried to read 16384 bytes at offset 28488318976. Was only able to read 0.
2017-02-15 13:56:41 7fdf53d9d700  InnoDB: Operating system error number 5 in a file operation.
InnoDB: Error number 5 means 'Input/output error'.
InnoDB: Some operating system error numbers are described at
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/operating-system-error-codes.html
170215 13:56:41 [ERROR] InnoDB: File (unknown): 'read' returned OS error 105. Cannot continue operation
170215 13:56:56 mysqld_safe Number of processes running now: 0
170215 13:56:56 mysqld_safe mysqld restarted

That happens a bit after this:

Feb 15 13:56:41 db1082 kernel: [11085714.884288] sd 0:1:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Feb 15 13:56:41 db1082 kernel: [11085714.884426] sd 0:1:0:0: [sda] tag#0 Sense Key : Aborted Command [current]
Feb 15 13:56:41 db1082 kernel: [11085714.884571] sd 0:1:0:0: [sda] tag#0 Add. Sense: Information unit iuCRC error detected
Feb 15 13:56:41 db1082 kernel: [11085714.884574] sd 0:1:0:0: [sda] tag#0 CDB: Read(16) 88 00 00 00 00 01 13 cd da a0 00 00 00 20 00 00
Feb 15 13:56:41 db1082 kernel: [11085714.884722] blk_update_request: I/O error, dev sda, sector 4627225248

Which looks storage related.
There are no logs on the ILO
Server is depooled

The plan is to:

  • Reboot the server
  • Check if everything is ok and/or logs are generated

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2017-02-15T14:34:46Z] <marostegui> Stop MySQL and shutdown db1082 - T158188

Server rebooted fine it showed this on dmesg which I am not completely aware of what it means :

[   32.823256] hpsa 0000:08:00.0: Acknowledging event: 0xc0000000 (HP SSD Smart Path configuration change)
[   32.862418] hpsa 0000:08:00.0: scsi 0:1:0:0: updated Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap+ En+ Exp=1

The disks and RAID card are reporting to be all fine:

root@db1082:~# hpssacli controller slot=1 pd all show status

   physicaldrive 1I:1:1 (port 1I:box 1:bay 1, 800 GB): OK
   physicaldrive 1I:1:2 (port 1I:box 1:bay 2, 800 GB): OK
   physicaldrive 1I:1:3 (port 1I:box 1:bay 3, 800 GB): OK
   physicaldrive 1I:1:4 (port 1I:box 1:bay 4, 800 GB): OK
   physicaldrive 1I:1:5 (port 1I:box 1:bay 5, 800 GB): OK
   physicaldrive 1I:1:6 (port 1I:box 1:bay 6, 800 GB): OK
   physicaldrive 1I:1:7 (port 1I:box 1:bay 7, 800 GB): OK
   physicaldrive 1I:1:8 (port 1I:box 1:bay 8, 800 GB): OK
   physicaldrive 2I:2:1 (port 2I:box 2:bay 1, 800 GB): OK
   physicaldrive 2I:2:2 (port 2I:box 2:bay 2, 800 GB): OK

root@db1082:~# hpssacli controller all show status

Smart Array P840 in Slot 1
   Controller Status: OK
   Cache Status: Not Configured
   Battery/Capacitor Status: OK

I have started MySQL and as it went fine, I have also started replication.

Change 338037 had a related patch set uploaded (by Jcrespo):
Repool db1082 with low load after crash

https://gerrit.wikimedia.org/r/338037

Change 338037 merged by Jcrespo:
Repool db1082 with low load after crash

https://gerrit.wikimedia.org/r/338037

Change 338067 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Increase load db1082

https://gerrit.wikimedia.org/r/338067

Change 338067 merged by jenkins-bot:
db-eqiad.php: Increase load db1082

https://gerrit.wikimedia.org/r/338067

Mentioned in SAL (#wikimedia-operations) [2017-02-16T07:43:12Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Increase load db1082 - T158188 (duration: 00m 42s)

Change 338078 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Restore db1082 original load

https://gerrit.wikimedia.org/r/338078

Change 338078 merged by jenkins-bot:
db-eqiad.php: Restore db1082 original load

https://gerrit.wikimedia.org/r/338078

Mentioned in SAL (#wikimedia-operations) [2017-02-16T08:44:54Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Restore origina db1082 weight - T158188 (duration: 00m 41s)

I will close this ticket after restoring the original weight for this server. Also added a parent task, which is the first crash this server had back in September (T145533). It will be easier this way to keep tracking of (hopefully not) any future crashes.

Marostegui claimed this task.