Page MenuHomePhabricator

db1247 crash or restart - 15:29 on 2025-05-07
Closed, ResolvedPublic

Description

db1247 crashed at ~ 15:29. As expected, it returned in read-only mode:

08:29:55 <+icinga-wm> PROBLEM - Host db1247 #page is DOWN: PING CRITICAL - Packet loss = 100%
08:32:13 <+icinga-wm> RECOVERY - Host db1247 #page is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms
08:33:41 <+icinga-wm> PROBLEM - mysqld processes #page on db1247 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
08:33:41 <+icinga-wm> PROBLEM - MariaDB Replica IO: s4 #page on db1247 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
08:33:48 <+icinga-wm> PROBLEM - MariaDB read only s4 on db1247 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
08:34:07 <+icinga-wm> PROBLEM - MariaDB Replica SQL: s4 #page on db1247 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica

It was depooled by @CDanis at 15:32.

Event Timeline

Icinga downtime and Alertmanager silence (ID=e3d5979e-507b-425a-a630-61f52b855331) set by swfrench@cumin2002 for 2 days, 0:00:00 on 1 host(s) and their services with reason: Host has crashed - T393612

db1247.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2025-05-07T15:42:59Z] <swfrench@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1247.eqiad.wmnet with reason: Host has crashed - T393612

May 07 15:28:19 db1247 systemd-logind[1119]: Power key pressed short.

Scott_French renamed this task from db1247 crash - 15:29 on 2025-05-07 to db1247 crash or restart - 15:29 on 2025-05-07.May 7 2025, 3:43 PM

FYI, the downtime I've applied is only 2 days, on the suspicion that the host is fine (e.g., just needs a clean bill of health before being returned to service). This may need extended if that's not the case.

 		2025-05-07 16:26:48 	SYS1005 	The server power action is initiated because the management controller initiated a power-down operation.	
		2025-05-07 16:26:43 	RAC1195 	User root via IP 10.64.48.98 requested state / configuration change to Power Control using GUI.	
 		2025-05-07 16:26:43 	RAC0702 	Requested system powercycle.	
 		2025-05-07 16:23:35 	USR0030 	Successfully logged in using root, from 10.64.48.98 and GUI.

Icinga downtime and Alertmanager silence (ID=7dfebe9e-7bcf-4f08-a0b7-e78fb3399da3) set by swfrench@cumin2002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Host has crashed - T393612

db1247.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2025-05-08T21:35:01Z] <swfrench@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1247.eqiad.wmnet with reason: Host has crashed - T393612

I've extended the downtime to 7 days (from now), as it's unlikely this host will be returned to service before the original one would have expired tomorrow.

Change #1143992 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1247: Disable notifications

https://gerrit.wikimedia.org/r/1143992

Change #1143992 merged by Marostegui:

[operations/puppet@production] db1247: Disable notifications

https://gerrit.wikimedia.org/r/1143992

FCeratto-WMF triaged this task as High priority.

Started cloning db1238.eqiad.wmnet to db1247.eqiad.wmnet - fceratto@cumin1002

Completed depool of db1238 - Depool db1238.eqiad.wmnet to then clone it to db1247.eqiad.wmnet - fceratto@cumin1002 - fceratto@cumin1002

Started cloning db1238.eqiad.wmnet to db1247.eqiad.wmnet - fceratto@cumin1002

Start pool of db1238 gradually with 4 steps - Pool db1238.eqiad.wmnet in after cloning - fceratto@cumin1002

Completed pool of db1238 gradually with 4 steps - Pool db1238.eqiad.wmnet in after cloning - fceratto@cumin1002

Finished cloning db1238.eqiad.wmnet to db1247.eqiad.wmnet - fceratto@cumin1002

Icinga downtime and Alertmanager silence (ID=d1a7fcac-b844-48c3-a182-fc858c38b901) set by fceratto@cumin1002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: To be set up in a few days

db1247.eqiad.wmnet

db1247 looks healthy; nothing strange on Grafana, no significant errors in dmesg, uptime 12 days, icinga is green.
Removing downtime on icinga.

Change #1148291 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/puppet@production] db1247.yaml: Enabling notifications after cloning

https://gerrit.wikimedia.org/r/1148291

Change #1148291 merged by Federico Ceratto:

[operations/puppet@production] db1247.yaml: Enabling notifications after cloning

https://gerrit.wikimedia.org/r/1148291

CR puppet-merged. Pooling in.

The notifications still show as disabled, preventing pool-in:

alert1002:~$ /usr/local/bin/icinga-status -j "db1247"
{"db1247": {"name": "db1247", "state": "UP", "optimal": true, "downtimed": false, "notifications_enabled": false, "failed_services": []}}

They need a puppet run on the host and then a puppet run on icinga host.

Start pool of db1247* gradually with 4 steps - Pooling in after cloning - fceratto@cumin1002

Completed pool of db1247* gradually with 4 steps - Pooling in after cloning - fceratto@cumin1002

Pooling in completed.