Page MenuHomePhabricator

Monitor read_only on all databases, make it page on masters
Closed, ResolvedPublic

Description

This task is an actionable as a result of past master crashes and mistakes about its writable status, as well as split-brain scenarios due to more than one server on a replication topology being writtable.

Database masters are a single point of failure, normally if a master crashes it is restarted automatically, but it is on read_only mode until it is checked. Page when it happens (either after a crash, or after it is detected on read-only mode, or both).

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+19 -0
operations/puppetproduction+6 -0
operations/puppetproduction+15 -11
operations/puppetproduction+6 -0
operations/puppetproduction+6 -0
operations/puppetproduction+5 -0
operations/puppetproduction+25 -20
operations/puppetproduction+1 -1
operations/puppetproduction+4 -0
operations/puppetproduction+1 -1
operations/puppetproduction+2 -1
operations/puppetproduction+0 -1
operations/puppetproduction+32 -32
operations/puppetproduction+3 -1
operations/puppetproduction+2 -0
operations/puppetproduction+24 -0
Show related patches Customize query in gerrit

Event Timeline

jcrespo renamed this task from Monitor read_only variable and/or uptime on atabase masters, make it page to Monitor read_only variable and/or uptime on database masters, make it page.Aug 4 2017, 8:32 AM

Copied from T171928:

$ check_mariadb.py -h db1052 --slave-status --primary-dc=eqiad
{"datetime": 1501777331.898183, "ssl_expiration": 1619276854.0, "connection": "ok", "connection_latency": 0.07626748085021973, "ssl": true, "total_queries": 15981662418, "heartbeat": {"s1": 0.400536}, "uptime": 16474250, "version": "10.0.28-MariaDB", "query_latency": 0.001131296157836914, "read_only": false, "threads_connected": 49}

So complete coverage is now available for all hosts- including connection checking (with 1 second timeout), ssl (including expiration time), slave status & heartbeat lag, QPS, read only mode, concurrency, uptime/recent restart, version, query latency and connection latency.

$ check_mariadb.py -h labsdb1009 --slave-status --primary-dc=eqiad
{"total_queries": 431519763, "read_only": false, "query_latency": 0.0015988349914550781, "threads_connected": 3, "version": "10.1.25-MariaDB", "ssl": true, "replication": {"s6": {"Last_IO_Error": null, "Seconds_Behind_Master": 0, "Slave_SQL_Running": "Yes", "Last_SQL_Error": null, "Slave_IO_Running": "Yes"}, "db1095": {"Last_IO_Error": null, "Seconds_Behind_Master": 0, "Slave_SQL_Running": "Yes", "Last_SQL_Error": null, "Slave_IO_Running": "Yes"}, "s7": {"Last_IO_Error": null, "Seconds_Behind_Master": 0, "Slave_SQL_Running": "Yes", "Last_SQL_Error": null, "Slave_IO_Running": "Yes"}, "s2": {"Last_IO_Error": null, "Seconds_Behind_Master": 0, "Slave_SQL_Running": "Yes", "Last_SQL_Error": null, "Slave_IO_Running": "Yes"}}, "datetime": 1501777578.464523, "connection": "ok", "ssl_expiration": 1626257725.0, "connection_latency": 0.05619382858276367, "uptime": 537221, "heartbeat": {"s3": 0.0, "s6": 0.0, "s7": 0.0, "s5": 0.0, "s1": 0.0, "s4": 0.0, "s2": 0.0}}

$ check_mariadb.py -h db1052 --slave-status --primary-dc=eqiad --icinga --check_read_only=0
Version 10.0.28-MariaDB, Uptime 16474706s, read_only: False, s1 lag: 0.00s, 39 client(s), 481.09 QPS, connection latency: 0.071849s, query latency: 0.001223s

The pending steps is how to use the tools to minimize downtime in the future.

This plugin does exactly what we need for now: https://github.com/Napsty/check_mysql_readonly/blob/master/check_mysql_readonly.sh so we'd only need to create the puppet module for it and we'd have it done pretty much.

Note that doesn't have support for socket, mine above does (plus it adds some extra checks such us TLS, TLS expiration, etc.).

Example:

root@neodymium:~$ ./check_mariadb.py -h db1067 --slave-status --primary-dc=eqiad --icinga --check_read_only=0
Version 10.1.34-MariaDB, Uptime 2590876s, read_only: False, s1 lag: 0.00s, 57 client(s), 1941.02 QPS, connection latency: 0.045309s, query latency: 0.000949s
root@neodymium:~$ ./check_mariadb.py -h db1067 --slave-status --primary-dc=eqiad --icinga --check_read_only=1
CRIT: read_only: "False", expected "True"; OK: Version 10.1.34-MariaDB, Uptime 2590881s, s1 lag: 0.17s, 58 client(s), 2176.45 QPS, connection latency: 0.048023s, query latency: 0.001163s

Note that doesn't have support for socket, mine above does (plus it adds some extra checks such us TLS, TLS expiration, etc.).

Yeah, I had to modify it to support socket (just one line).
I know yours does more things, I was trying to reduce the scope for what we just needed for this task (read only ON/OFF) as this script also detects automatically whether the host is a slave or a master and returns OK or CRITICAL based on that and we'd not need to implement more logic to auto-detect whether it is a master or not (although it should be easy once we've got etcd in place)

detects automatically whether the host is a slave or a master and returns OK or CRITICAL based on that

That is actually another reason not to use it: right now our masters can and do have replication running (s1-eqiad is not running, but will run again in a few months, etc.).

So, not against implementing this, but I would prefer if you could help me make check_mariadb.py work well, even if we need a rewrite (and you do that), and only use it for the read only parts if you want. I just added etcd support to the perl script https://gerrit.wikimedia.org/r/345346 - it would be nice to add it to the python one ;-)

I thought it would be a fast way of implementing this (as I still think it is important to monitor read_only flag and can bite us again) but I didn't think of the fact that replication codfw -> eqiad will be enabled in a couple of months, so that is a no-go for this script, so it is better not to spend any time on it. Instead, relaying on check_mariadb.py and etcd is the way to move forward

I am actyally ok with doing this, it is your call, but it will not be quick because we still need etcd integration. I would be happy if you could do both (implementing this AND etcd- but one will not work without the other). Later, we could improve it in python (at a later time), but the work on etcd would stay.

Change 450199 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Introduce a read_only check for multiinstance hosts

https://gerrit.wikimedia.org/r/450199

Change 450199 merged by Jcrespo:
[operations/puppet@production] mariadb: Introduce a read_only check for multiinstance hosts

https://gerrit.wikimedia.org/r/450199

Change 450205 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Force as read only dbstore_multiinstance and santarium_mi

https://gerrit.wikimedia.org/r/450205

Change 450206 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Force nagios as the user if check is not running as root

https://gerrit.wikimedia.org/r/450206

Change 450205 merged by Jcrespo:
[operations/puppet@production] mariadb: Force as read only dbstore_multiinstance and sanitarium_m.i.

https://gerrit.wikimedia.org/r/450205

Change 450206 merged by Jcrespo:
[operations/puppet@production] mariadb: Force nagios as the user if check is not running as root

https://gerrit.wikimedia.org/r/450206

Change 450209 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] check_mariadb: Disable for now checks that are not read-only

https://gerrit.wikimedia.org/r/450209

Change 450209 merged by Jcrespo:
[operations/puppet@production] check_mariadb: Disable for now checks that are not read-only

https://gerrit.wikimedia.org/r/450209

Change 450211 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-sanitarium: Remove duplicate log-slave-updates on config

https://gerrit.wikimedia.org/r/450211

Change 450211 merged by Jcrespo:
[operations/puppet@production] mariadb-sanitarium: Remove duplicate log-slave-updates on config

https://gerrit.wikimedia.org/r/450211

Change 450213 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-check: Allow duplicate definitions on /etc/my.cnf

https://gerrit.wikimedia.org/r/450213

Change 450213 merged by Jcrespo:
[operations/puppet@production] mariadb-check: Allow duplicate definitions on /etc/my.cnf

https://gerrit.wikimedia.org/r/450213

Change 450220 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-check: Get password from key clientlabsdb and not labsdb

https://gerrit.wikimedia.org/r/450220

Change 450220 merged by Jcrespo:
[operations/puppet@production] mariadb-check: Get password from key clientlabsdb and not labsdb

https://gerrit.wikimedia.org/r/450220

Change 450228 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Enable read_only monitoring on core mariadb hosts

https://gerrit.wikimedia.org/r/450228

Change 451304 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] wikireplicas: Monitor wikireplicas are always in read only

https://gerrit.wikimedia.org/r/451304

Change 451304 merged by Jcrespo:
[operations/puppet@production] wikireplicas: Monitor wikireplicas are always in read only

https://gerrit.wikimedia.org/r/451304

I think we should not include the monitoring of the uptime on this task, but just the read_only scope.
Don't think monitoring uptime will give us anything that read_only + monitor the number of mysql process isn't giving us already - maybe just more noise.

I think we should not include the monitoring of the uptime on this task, but just the read_only scope.

Who said the uptime was in scope? The check creation predates this task, that is why it has a lot of half-assed extra checks, it was not created for this (it existed before the ticket was created).

"jcrespo renamed this task from Monitor read_only variable and/or uptime on atabase masters, make it page to Monitor read_only variable and/or uptime on database masters, make it page."

I think the task should be renamed to "Monitor read_only on all databases, make it page for masters"

That is fine by me - I just wanted to give my opinion about what the current task description says.

Change 451307 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Make wikireplicas obey the current mariadb read_only config

https://gerrit.wikimedia.org/r/451307

jcrespo renamed this task from Monitor read_only variable and/or uptime on database masters, make it page to Monitor read_only on all databases, make it page on masters.Aug 8 2018, 12:57 PM
jcrespo updated the task description. (Show Details)

Change 451307 merged by Jcrespo:
[operations/puppet@production] mariadb: Make wikireplicas obey the current mariadb read_only config

https://gerrit.wikimedia.org/r/451307

Change 451629 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] wikireplicas: Add SUPER privileges to cloud admin accounts

https://gerrit.wikimedia.org/r/451629

Change 451629 merged by Jcrespo:
[operations/puppet@production] wikireplicas: Add SUPER privileges to cloud admin accounts

https://gerrit.wikimedia.org/r/451629

Change 450228 merged by Jcrespo:
[operations/puppet@production] mariadb: Enable read_only monitoring on core mariadb hosts

https://gerrit.wikimedia.org/r/450228

@jcrespo is this done after today's merge or still missing things?

jcrespo changed the task status from Open to Stalled.Sep 19 2018, 1:07 PM

The initial scope is not fulfilled:

Monitor read_only on all databases, make it page on masters

It has to be enabled, with lower priority on other db roles (misc, pc), and we have to think of a paging schema that works well- but I don't want to enable paging at least after eqiad is again the primary dc for testing purposes.

So most of it is done, the hard part, but it will be stalled until the latest things are tweaked.

jcrespo triaged this task as Medium priority.Sep 19 2018, 1:08 PM

Change 488504 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Set read_only monitoring for core_test hosts

https://gerrit.wikimedia.org/r/488504

Change 488504 merged by Jcrespo:
[operations/puppet@production] mariadb: Set read_only monitoring for core_test hosts

https://gerrit.wikimedia.org/r/488504

Change 593527 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: enable read_only monitoring in parsercaches

https://gerrit.wikimedia.org/r/593527

Change 593527 merged by Jcrespo:
[operations/puppet@production] mariadb: enable read_only monitoring in parsercaches

https://gerrit.wikimedia.org/r/593527

Change 594885 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Enable read_only check to page on primary masters

https://gerrit.wikimedia.org/r/594885

jcrespo changed the task status from Stalled to Open.May 7 2020, 7:03 AM
jcrespo claimed this task.
jcrespo moved this task from Backlog to In progress on the DBA board.

Change 594905 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Enable read_only monitoring on misc hosts

https://gerrit.wikimedia.org/r/594905

Mentioned in SAL (#wikimedia-operations) [2020-05-07T13:04:05Z] <jynus> disabling puppet on all db hosts to control deployment of new paging alert T172489

Change 594885 merged by Jcrespo:
[operations/puppet@production] mariadb: Enable read_only check to page on primary masters

https://gerrit.wikimedia.org/r/594885

This has been now fullfilled within the scope of the title. If a master server now crashes and comes back in read only (desired outcome to prevent further corruption) a page will be sent to all SREs of this degraded state (a wiki, or multiple ones- if it is a central service like wikidata, commons or centralauth), cannot process edits or user account changes/creations.

The response to it should be one of the following:

  • Wake up a DBA, data issues are not easy
  • Check data/server status, if everything is correct (mariadb runs with safe anti-crash options), set the master back into read-write
  • In particular, parsercache (included on paging) can be blindly set in read write (it boots like that by default), as the service it provides is more important than the transient data it holds. But it has to be in read-write, otherwise mediawiki will have very high latencies (close to, or outage-like)
  • If the server is unstable or there is data loss, failover master service to a different host
    • Performs a complete data check after failover. If there is further issues, recover from backup.

A followup will be handled for misc services, with no paging, at T237927.

Change 594905 merged by Jcrespo:
[operations/puppet@production] mariadb: Enable read_only monitoring on misc hosts

https://gerrit.wikimedia.org/r/594905

Change 595143 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Add read_only monitoring to other misc dbs: tendril, phab, event

https://gerrit.wikimedia.org/r/595143

Change 595143 merged by Jcrespo:
[operations/puppet@production] mariadb: Add read_only monitoring to other misc dbs: tendril, phab, event

https://gerrit.wikimedia.org/r/595143