Page MenuHomePhabricator

dbstore2002 stopped providing mysql service despite the process being running
Closed, ResolvedPublic

Description

suddenly these popped up

18:30 < icinga-wm> PROBLEM - MariaDB Slave IO: m3 on dbstore2002 is CRITICAL: CRITICAL slave_io_state could not connect
18:30 < icinga-wm> PROBLEM - MariaDB Slave IO: s1 on dbstore2002 is CRITICAL: CRITICAL slave_io_state could not connect
18:31 < icinga-wm> PROBLEM - MariaDB Slave SQL: s4 on dbstore2002 is CRITICAL: CRITICAL slave_sql_state could not connect
18:31 < icinga-wm> PROBLEM - MariaDB Slave IO: s3 on dbstore2002 is CRITICAL: CRITICAL slave_io_state could not connect
18:31 < icinga-wm> PROBLEM - MariaDB Slave SQL: s5 on dbstore2002 is CRITICAL: CRITICAL slave_sql_state could not connect
18:31 < icinga-wm> PROBLEM - MariaDB Slave SQL: s6 on dbstore2002 is CRITICAL: CRITICAL slave_sql_state could not connect
18:31 < icinga-wm> PROBLEM - MariaDB Slave IO: s6 on dbstore2002 is CRITICAL: CRITICAL slave_io_state could not connect
18:31 < icinga-wm> PROBLEM - MariaDB Slave SQL: m2 on dbstore2002 is CRITICAL: CRITICAL slave_sql_state could not connect
18:31 < icinga-wm> PROBLEM - MariaDB Slave IO: s7 on dbstore2002 is CRITICAL: CRITICAL slave_io_state could not connect
18:31 < icinga-wm> PROBLEM - MariaDB Slave SQL: x1 on dbstore2002 is CRITICAL: CRITICAL slave_sql_state could not connect
18:31 < icinga-wm> PROBLEM - MariaDB Slave SQL: m3 on dbstore2002 is CRITICAL: CRITICAL slave_sql_state could not connect
18:31 < icinga-wm> PROBLEM - MariaDB Slave IO: x1 on dbstore2002 is CRITICAL: CRITICAL slave_io_state could not connect
18:32 < icinga-wm> PROBLEM - MariaDB Slave IO: m2 on dbstore2002 is CRITICAL: CRITICAL slave_io_state could not connect
18:32 < icinga-wm> PROBLEM - MariaDB Slave SQL: s1 on dbstore2002 is CRITICAL: CRITICAL slave_sql_state could not connect
18:32 < icinga-wm> PROBLEM - MariaDB Slave SQL: s2 on dbstore2002 is CRITICAL: CRITICAL slave_sql_state could not connect
18:32 < icinga-wm> PROBLEM - MariaDB Slave SQL: s3 on dbstore2002 is CRITICAL: CRITICAL slave_sql_state could not connect
18:32 < icinga-wm> PROBLEM - MariaDB Slave IO: s2 on dbstore2002 is CRITICAL: CRITICAL slave_io_state could not connect
18:33 < icinga-wm> PROBLEM - MariaDB Slave SQL: s7 on dbstore2002 is CRITICAL: CRITICAL slave_sql_state could not connect
18:33 < icinga-wm> PROBLEM - MariaDB Slave IO: s4 on dbstore2002 is CRITICAL: CRITICAL slave_io_state could not connect
18:33 < icinga-wm> PROBLEM - MariaDB Slave IO: s5 on dbstore2002 is CRITICAL: CRITICAL slave_io_state could not connect

but dbstore2002 was up and running

18:33 < mutante> that may look scary but it's all the same server, dbstore2002 and that is up with mysql running and it has --skip-slave-start
18:33 < mutante> i guess that can explain it. but doesnt mean i know why
18:33 < mutante> jynus: 
18:37 < mutante> i would call if it was the actual db servers, but with just one of the dbstores i think not
root@dbstore2002:~# ps aux | grep maria
root      3427  0.0  0.0   4440   656 ?        S    Mar18   0:00 /bin/sh /opt/wmf-mariadb10/bin/mysqld_safe --datadir=/srv/sqldata --pid-file=/srv/sqldata/dbstore2002.pid --skip-slave-start
root      4220  0.0  0.0  11868   924 pts/2    S+   01:35   0:00 grep maria
mysql    40350  212 51.0 146328192 67414708 ?  Sl   Mar18 429639:04 /opt/wmf-mariadb10/bin/mysqld --basedir=/opt/wmf-mariadb10 --datadir=/srv/sqldata --plugin-dir=/opt/wmf-mariadb10/lib/plugin --user=mysql --skip-slave-start --log-error=/srv/sqldata/dbstore2002.err --open-files-limit=400000 --pid-file=/srv/sqldata/dbstore2002.pid --socket=/tmp/mysql.sock --port=3306

Monitoring stopped collecting information at that time, too:

Screenshot from 2016-08-06 11-39-15.png (643×899 px, 79 KB)

Edit: (--skip-slave-start is normal, it is just like that when it is manually started)

Event Timeline

Dzahn added a subscriber: jcrespo.

First of all; thanks for attending it. It didn't page, which means it was not urgent. Both for being being a dbstore and not being at the primary datacenter. So you made a good call. If you had called me, however, it would also not be a problem; you wouldn't have woken me up.

Look at the error, though: CRITICAL slave_io_state could not connect it doesn't say io/SQL is stopped, it says it could not connect. That usually means that the server is down, or up but the query timed out. For example, it happens if a sever crashes- where the process is up, but the database is still recovering and not accepting new connections. However, checking the uptime, the server itself didn't crash.

Another reason where I saw this happening is if there is a long-running write transaction and STOP SLAVE is executed. In that case, SHOW SLAVE STATUS queries fails as they have to wait for the transaction to finish. I have only see this happening on TokuDB hosts (T109069#2518719 which we are getting rid of).

I do not see any special event on the log at that time; however, it seems that not only SHOW SLAVE STATUS our monitoring at the time also stopped answering between 1:33 and 2:13. With that in mind, I have 3 potential explanations:

  • Hardware glitch that made the system unavailable
  • Network glitch that make that specific port unavailable
  • Software glitch (If I had to put my money in some place, it has to be TokuDB, as it showed to be very unstable for main s* replication- it is ok for eventlogging)

Those dbstore hosts have lot on their plate (replicating from 8+ hosts + eventlogging), but in general they are not critical for the operation.

--skip-slave-start has nothing to do with this issue; when a sever is started manually (for example, after an upgrade), that options is always used because we cannot restart replication until we execute mysql_upgrade. So we start the server with replication stopped; then we start the slave manually with start slave. This cannot cause any issue, because in the event of a crash/restart, this option is not there by default, and replication would start automatically. To give you an idea of how common this is; all servers, including those in production will have this on the run options, except those that have crashed or restarted after upgrade.

Again thanks for responding. I have one last question; according to the logs; mysql started responding a few minutes later. Did you run any command that could have done that, or did it start providing service again on its own?

jcrespo renamed this task from dbstore2002 , --skip-slave-start and Icinga alerts slave_sql_lag to dbstore2002 stopped providing mysql service despite the process being running.Aug 6 2016, 9:41 AM
jcrespo updated the task description. (Show Details)

The fact that there was not slave lag after the fact happened probably means mysql server itself was up and replicating without problems (not frozen) and it was a mere query pile-up pointing more and more strongly towards TokuDB (it did strange things like that in the past).

I would keep this ticket open for a while, in case it happens to be a malfunctioning disk or related to prometheus rollout; but right now all points to the "slowly replacing TokuDB with InnoDB compressed solution".

Thanks for the detailed explanation. No, i did not run any command, it must have started by itself again.

Disk replacement is handled on a separate task; no more to do here.