The parsercache hosts doesn't have replication checks installed which already caused some headache, let's fix this problem (Let's see T206740)
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
mariadb: enable replication check on Parsercache hosts | operations/puppet | production | +12 -1 |
Related Objects
Event Timeline
I think we could also consider adding an alert based on the hit ratio of the parsercache caches (we already have the data in grafana)
Change 467959 had a related patch set uploaded (by Banyek; owner: Banyek):
[operations/puppet@production] mariadb: enable replication check on Parsercache hosts
Normally put them on the Gerrit patch as a comment, so people reviewing the patch can see the links in the comments and keep discussing there
Mentioned in SAL (#wikimedia-operations) [2018-10-18T08:40:12Z] <banyek> adding replication monitoring checks to parsercache hosts (T206992)
Mentioned in SAL (#wikimedia-operations) [2018-10-18T08:41:03Z] <banyek> disabling puppet on parser caches (T206992)
Change 467959 merged by Banyek:
[operations/puppet@production] mariadb: enable replication check on Parsercache hosts
Mentioned in SAL (#wikimedia-operations) [2018-10-18T08:56:36Z] <banyek> enabling replication monitor check on pc2004 (T206992)
Mentioned in SAL (#wikimedia-operations) [2018-10-18T09:01:03Z] <banyek> enabling replication monitor check on pc1004 (T206992)
Mentioned in SAL (#wikimedia-operations) [2018-10-18T09:20:32Z] <banyek> enabling replication monitor check on pc1005 pc1006 pc2005 pc2006 (T206992)
I am not sure how useful is this, honestly- this alert would have not prevented the issue at all:
MariaDB Slave IO: pc1 OK 2018-10-18 12:26:14 0d 3h 7m 42s 1/3 OK slave_io_state not a slave
I did check the parsercache hosts before the failover, to make sure they were all green - I would have seen that check and I would have remembered that they have to have replication enabled.
I did check the parsercache hosts before the failover, to make sure they were all green - I would have seen that check and I would have remembered that they have to have replication enabled.
That is my point- the check is green now, even if replication isn't working. Even if it wasn't (there may be a parameter for that), the error is not the replication, but the "freshness" of the data (hit ratio if it was active). We stop replication all the time- we need to check replication is working and recent - eg. maybe add it to the replication heartbeat checks on switchover (but not the read only phase) + "which % of the data is not expired" kind of check, in addition to this check.
I added the step of checking replication a few days in advance to our DC failover checklist so we can also remember that for next failover :https://wikitech.wikimedia.org/w/index.php?title=Switch_Datacenter%2Fplanned_db_maintenance&type=revision&diff=1805377&oldid=1805376
I added the step of checking replication a few days in advance to our DC failover checklist so we can also remember that for next failover :https://wikitech.wikimedia.org/w/index.php?title=Switch_Datacenter%2Fplanned_db_maintenance&type=revision&diff=1805377&oldid=1805376
That is good, I just don't think it is enough. I would add it as a switchdc recipe- and rollback the change if it fails.