Page MenuHomePhabricator

db2057 storage crashed
Closed, ResolvedPublic

Description

db2057 has crashed:

03:47 <+icinga-wm> PROBLEM - mysqld processes on db2057 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld                                                                                                                                           
03:47 <+icinga-wm> PROBLEM - Disk space on db2057 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error                                                                                                                                           
03:47 <+icinga-wm> PROBLEM - Check systemd state on db2057 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.                                                                                                                
03:47 <+icinga-wm> PROBLEM - MariaDB disk space on db2057 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error                                                                                                                                   
03:47 <+icinga-wm> PROBLEM - MariaDB Slave SQL: s3 on db2057 is CRITICAL: CRITICAL slave_sql_state could not connect                                                                                                                                                
03:48 <+icinga-wm> PROBLEM - MariaDB Slave IO: s3 on db2057 is CRITICAL: CRITICAL slave_io_state could not connect

The storage crashed:

root@db2057:~# dmesg
-bash: /bin/dmesg: Input/output error
/system1/log1/record5
  Targets
  Properties
    number=5
    severity=Critical
    date=12/19/2018
    time=03:20
    description=Drive Array Controller Failure (Slot 0)
  Verbs

Event Timeline

Marostegui triaged this task as Medium priority.Dec 19 2018, 6:16 AM
Marostegui moved this task from Triage to In progress on the DBA board.

Change 480698 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Depool db2057

https://gerrit.wikimedia.org/r/480698

Change 480699 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2057: Disable notifications

https://gerrit.wikimedia.org/r/480699

Change 480699 merged by Marostegui:
[operations/puppet@production] db2057: Disable notifications

https://gerrit.wikimedia.org/r/480699

Change 480698 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Depool db2057

https://gerrit.wikimedia.org/r/480698

Mentioned in SAL (#wikimedia-operations) [2018-12-19T06:27:14Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Depool db2057 - storage crashed T212275 (duration: 01m 08s)

Change 480700 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] install_server: Allow reimage db2057

https://gerrit.wikimedia.org/r/480700

Change 480700 merged by Marostegui:
[operations/puppet@production] install_server: Allow reimage db2057

https://gerrit.wikimedia.org/r/480700

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['db2057.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201812190655_marostegui_161799.log.

Change 480701 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Depool db2050

https://gerrit.wikimedia.org/r/480701

Change 480701 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Depool db2050

https://gerrit.wikimedia.org/r/480701

Mentioned in SAL (#wikimedia-operations) [2018-12-19T07:15:08Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Depool db2050 to clone db2057 T212275 (duration: 00m 52s)

Completed auto-reimage of hosts:

['db2057.codfw.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2018-12-19T07:22:33Z] <marostegui> Stop MySQL on db2050 to clone db2057 - T212275

db2057 has been reimaged and recloned.

Mentioned in SAL (#wikimedia-operations) [2018-12-19T11:25:59Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Repool db2050 after recloning db2057 T212275 (duration: 00m 52s)