Page MenuHomePhabricator

db2197 rebooted itself
Closed, ResolvedPublic

Description

[06:45:15]  <+icinga-wm> PROBLEM - mysqld processes on db2197 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[06:45:19]  <+icinga-wm> PROBLEM - MariaDB Replica IO: x1 on db2197 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[06:45:19]  <+icinga-wm> PROBLEM - MariaDB Replica SQL: s2 on db2197 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[06:45:19]  <+icinga-wm> PROBLEM - MariaDB Replica IO: s2 on db2197 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[06:45:19]  <+icinga-wm> PROBLEM - MariaDB Replica IO: s6 on db2197 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[06:45:19]  <+icinga-wm> PROBLEM - MariaDB Replica SQL: s6 on db2197 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[06:45:20]  <+icinga-wm> PROBLEM - MariaDB Replica SQL: x1 on db2197 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[06:45:27]  <+icinga-wm> PROBLEM - MariaDB read only x1 on db2197 is CRITICAL: Could not connect to localhost:3320 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[06:45:27]  <+icinga-wm> PROBLEM - MariaDB read only s6 on db2197 is CRITICAL: Could not connect to localhost:3316 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[06:45:27]  <+icinga-wm> PROBLEM - MariaDB read only s2 on db2197 is CRITICAL: Could not connect to localhost:3312 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
root@db2197:~# w
 05:01:03 up 12:12,  1 user,  load average: 0.11, 0.10, 0.09
USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
marosteg pts/0    208.80.153.110   05:01    3.00s  0.08s  0.04s sshd: marostegui [priv]

A quick look at getsel show nothing on the logs.

Related Objects

Event Timeline

Icinga downtime and Alertmanager silence (ID=6a8f12d5-f08e-4ee3-8f41-beb71f97524a) set by marostegui@cumin1002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Long schema change

db2197.codfw.wmnet

The downtime message above is wrong, it was downtimed cause it crashed :)

Yeah, That I think is sorta my fault. Papaul was asking if he can reboot it for some maintenance late in the day your time. I checked and saw it was a backup source and gave the greenlight. I assume this is done?

I didn't bring anything back up as I wasn't aware of what was going on

Ah, my bad.My brain assumed when the host gets rebooted, service comes online automatically. I keep forgetting that's the not the case with mariadb. Is that okay if I start mariadb? or do you want to check something beforehand?

My apologies.

If it was stopped correctly it should be fine to start again and resume replication too

@Ladsgroup Can you please also fix and clean up the backups that failed because of it?

https://wm-bot.wmflabs.org/libera_logs/%23wikimedia-sre/20240621.txt

image.png (912×2 px, 251 KB)

(the 3 instances need recovering from backups as they were shutdown uncleanly) After checking they may have been stopped cleanly, but may need 100% confirmation on instance start.

They were stopped cleanly:

-- Boot d724c2c02dac4c488327bf49465f6aa1 --
Jun 21 16:41:29 db2197 systemd[1]: mariadb@s6.service: Consumed 1w 2d 3h 2min 56.123s CPU time.
Jun 21 16:41:29 db2197 systemd[1]: Stopped mariadb@s6.service - mariadb database server.
Jun 21 16:41:29 db2197 systemd[1]: mariadb@s6.service: Deactivated successfully.
Jun 21 16:41:28 db2197 mysqld[773472]: 2024-06-21 16:41:28 0 [Note] /opt/wmf-mariadb106/bin/mysqld: Shutdown complete
Jun 21 16:41:28 db2197 mysqld[773472]: 2024-06-21 16:41:28 0 [Note] InnoDB: Shutdown completed; log sequence number <redacted>; transaction id <redacted>
Jun 21 16:41:28 db2197 mysqld[773472]: 2024-06-21 16:41:28 0 [Note] InnoDB: Removed temporary tablespace data file: "./ibtmp1"
Jun 21 16:40:58 db2197 mysqld[773472]: 2024-06-21 16:40:58 0 [Note] InnoDB: Buffer pool(s) dump completed at 240621 16:40:58
Jun 21 16:40:57 db2197 mysqld[773472]: 2024-06-21 16:40:57 0 [Note] InnoDB: Restricted to 2676960 pages due to innodb_buf_pool_dump_pct=25
Jun 21 16:40:57 db2197 mysqld[773472]: 2024-06-21 16:40:57 0 [Note] InnoDB: Dumping buffer pool(s) to /srv/sqldata.s6/ib_buffer_pool
Jun 21 16:40:57 db2197 mysqld[773472]: 2024-06-21 16:40:57 0 [Note] InnoDB: Starting shutdown...
Jun 21 16:40:57 db2197 mysqld[773472]: 2024-06-21 16:40:57 0 [Note] InnoDB: FTS optimize thread exiting.
Jun 21 16:40:57 db2197 mysqld[773472]: 2024-06-21 16:40:57 7083681 [Note] Slave I/O thread exiting, read up to log 'db2129-bin.004766', position 541213342; GTID position 171974883-171974883-1921892293,0-1803591>
Jun 21 16:40:57 db2197 mysqld[773472]: 2024-06-21 16:40:57 7083682 [Note] Slave SQL thread exiting, replication stopped in log 'db2129-bin.004766' at position 541213342; GTID position '0-180359184-3070029963,17>
Jun 21 16:40:57 db2197 mysqld[773472]: 2024-06-21 16:40:57 7083682 [Note] Error reading relay log event: slave SQL thread was killed
Jun 21 16:40:57 db2197 mysqld[773472]: 2024-06-21 16:40:57 0 [Note] Event Scheduler: Purging the queue. 4 events
Jun 21 16:40:57 db2197 mysqld[773472]: 2024-06-21 16:40:57 0 [Note] Event Scheduler: Stopped
Jun 21 16:40:57 db2197 mysqld[773472]: 2024-06-21 16:40:57 0 [Note] Event Scheduler: Waiting for the scheduler thread to reply
Jun 21 16:40:57 db2197 mysqld[773472]: 2024-06-21 16:40:57 0 [Note] Event Scheduler: Killing the scheduler thread, thread id 1
Jun 21 16:40:57 db2197 mysqld[773472]: 2024-06-21 16:40:57 0 [Note] /opt/wmf-mariadb106/bin/mysqld (initiated by: unknown): Normal shutdown
Jun 21 16:40:57 db2197 systemd[1]: Stopping mariadb@s6.service - mariadb database server...

They are back online and catching up.

jcrespo claimed this task.
jcrespo reassigned this task from jcrespo to Ladsgroup.