Page MenuHomePhabricator

Create replication icinga check for the Parsercache hosts
Closed, ResolvedPublic

Description

The parsercache hosts doesn't have replication checks installed which already caused some headache, let's fix this problem (Let's see T206740)

Event Timeline

Marostegui triaged this task as Medium priority.Oct 16 2018, 5:42 AM
Marostegui added a project: Wikimedia-Incident.

I think we could also consider adding an alert based on the hit ratio of the parsercache caches (we already have the data in grafana)

Change 467959 had a related patch set uploaded (by Banyek; owner: Banyek):
[operations/puppet@production] mariadb: enable replication check on Parsercache hosts

https://gerrit.wikimedia.org/r/467959

The check is ready to be deployed

Where is the compiler run link?

Normally put them on the Gerrit patch as a comment, so people reviewing the patch can see the links in the comments and keep discussing there

Mentioned in SAL (#wikimedia-operations) [2018-10-18T08:40:12Z] <banyek> adding replication monitoring checks to parsercache hosts (T206992)

Mentioned in SAL (#wikimedia-operations) [2018-10-18T08:41:03Z] <banyek> disabling puppet on parser caches (T206992)

Change 467959 merged by Banyek:
[operations/puppet@production] mariadb: enable replication check on Parsercache hosts

https://gerrit.wikimedia.org/r/467959

Mentioned in SAL (#wikimedia-operations) [2018-10-18T08:56:36Z] <banyek> enabling replication monitor check on pc2004 (T206992)

Mentioned in SAL (#wikimedia-operations) [2018-10-18T09:01:03Z] <banyek> enabling replication monitor check on pc1004 (T206992)

Mentioned in SAL (#wikimedia-operations) [2018-10-18T09:20:32Z] <banyek> enabling replication monitor check on pc1005 pc1006 pc2005 pc2006 (T206992)

I am not sure how useful is this, honestly- this alert would have not prevented the issue at all:

MariaDB Slave IO: pc1	OK 	2018-10-18 12:26:14 	0d 3h 7m 42s 	1/3 	OK slave_io_state not a slave

I am not sure how useful is this, honestly- this alert would have not prevented the issue at all:

MariaDB Slave IO: pc1	OK 	2018-10-18 12:26:14 	0d 3h 7m 42s 	1/3 	OK slave_io_state not a slave

I did check the parsercache hosts before the failover, to make sure they were all green - I would have seen that check and I would have remembered that they have to have replication enabled.

I did check the parsercache hosts before the failover, to make sure they were all green - I would have seen that check and I would have remembered that they have to have replication enabled.

That is my point- the check is green now, even if replication isn't working. Even if it wasn't (there may be a parameter for that), the error is not the replication, but the "freshness" of the data (hit ratio if it was active). We stop replication all the time- we need to check replication is working and recent - eg. maybe add it to the replication heartbeat checks on switchover (but not the read only phase) + "which % of the data is not expired" kind of check, in addition to this check.

I did check the parsercache hosts before the failover, to make sure they were all green - I would have seen that check and I would have remembered that they have to have replication enabled.

That is my point- the check is green now, even if replication isn't working. Even if it wasn't (there may be a parameter for that), the errors is not the replication, but the "freshness" of the data (hit ratio if it was active). We stop replication all the time- we need to check replication is working and recent - eg. maybe add it to the replication checks on switchover (but not the read only phase), in addition to this check.

I added the step of checking replication a few days in advance to our DC failover checklist so we can also remember that for next failover :https://wikitech.wikimedia.org/w/index.php?title=Switch_Datacenter%2Fplanned_db_maintenance&type=revision&diff=1805377&oldid=1805376

I added the step of checking replication a few days in advance to our DC failover checklist so we can also remember that for next failover :https://wikitech.wikimedia.org/w/index.php?title=Switch_Datacenter%2Fplanned_db_maintenance&type=revision&diff=1805377&oldid=1805376

That is good, I just don't think it is enough. I would add it as a switchdc recipe- and rollback the change if it fails.

@Volans ^ is that something we can do on the dc switchover script?

@jcrespo Yes, it is deployed, I was just waiting on close

I have created T207385 so we can follow the discussion there.