Page MenuHomePhabricator

Add icinga check for all MySQL/MariaDB hosts to check they have the right read_only value
Closed, ResolvedPublic

Description

It is very common to misconfigure or change the read_only variable during a failover, staying with the wrong value.

If they are masters, and there is load balancing, they could end up with read_only = 1, and we may not notice it immediately due to applications retrying on another host (this has happened on parser cache hosts).

If they are slaves, they could end up with read_only = 0, and potentially create out of band changes, like in T110115.

We do not want to do this automatized on puppet because there are a million of edge cases (failovers, slaves that are also masters, etc.), but we want it detected automatically. It should be only a warning, not an error. We may also want to start all servers in read_only = 1 by default, and only manually change it, and this would help avoid forgetting changing it on start.

Event Timeline

jcrespo claimed this task.
jcrespo raised the priority of this task from to Needs Triage.
jcrespo updated the task description. (Show Details)
jcrespo added projects: acl*sre-team, DBA.
jcrespo subscribed.
jcrespo triaged this task as Medium priority.Sep 8 2015, 9:41 AM
jcrespo moved this task from Triage to Backlog on the DBA board.

out of curiosity I tried asking the following questions via prometheus for eqiad

mysql_global_variables_read_only{role="slave"} == 0
Element	Value
mysql_global_variables_read_only{instance="dbstore1001:9104",job="mysql-dbstore",role="slave",shard="multi"}	0
mysql_global_variables_read_only{instance="labsdb1001:9104",job="mysql-labs",role="slave",shard="multi"}	0
mysql_global_variables_read_only{instance="labsdb1010:9104",job="mysql-labs",role="slave",shard="multi"}	0
mysql_global_variables_read_only{instance="db1047:9104",job="mysql-misc",role="slave",shard="multi"}	0
mysql_global_variables_read_only{instance="labsdb1011:9104",job="mysql-labs",role="slave",shard="multi"}	0
mysql_global_variables_read_only{instance="labsdb1003:9104",job="mysql-labs",role="slave",shard="multi"}	0
mysql_global_variables_read_only{instance="labsdb1008:9104",job="mysql-labs",role="slave",shard="multi"}	0
mysql_global_variables_read_only{instance="dbstore1002:9104",job="mysql-dbstore",role="slave",shard="multi"}	0
mysql_global_variables_read_only{instance="labsdb1009:9104",job="mysql-labs",role="slave",shard="multi"}	0
jcrespo closed this task as Resolved.EditedOct 1 2016, 4:40 PM

Sadly, those, except dbstore1001, are exceptions to the rule: slaves that are also masters due to analytics and labs particularities.

The only place where this was needed was in production (it happened on the parsercaches), but that has been fixed by putting them in multi-master mode and fixing puppet defaults. If it happened in production (non-parsercaches) the 500 alarms would be immediately noticed.

This is not an issue anymore. I would not say that "the problem is technically fixed", but with the new puppet configuration and topology, this is hardly an issue, I think.

@fgiunchedi the functionality itself is cool, and I will use it to check other things, though. For example, soft alerts on threads_connected or failed connections.