Maniphest T206992

Create replication icinga check for the Parsercache hosts
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Banyek
	Oct 15 2018, 8:06 AM

Description

The parsercache hosts doesn't have replication checks installed which already caused some headache, let's fix this problem (Let's see T206740)

Details

	Subject	Repo	Branch	Lines +/-
	mariadb: enable replication check on Parsercache hosts	operations/puppet	production	+12 -1

Customize query in gerrit

Related Objects

Mentioned In: T207273: Parser cache hit ratio alerting
Mentioned Here: T207385: Create a check on the DC failover script to see if codfw -> eqiad replication is working before failing over to codfw (considering eqiad as the active DC by default)
T206740: parsercache used disk space increase

Event Timeline

• Banyek created this task.Oct 15 2018, 8:06 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 15 2018, 8:06 AM

• Banyek moved this task from Triage to Pending comment on the DBA board.Oct 15 2018, 3:59 PM

Marostegui triaged this task as Medium priority.Oct 16 2018, 5:42 AM

Marostegui added a project: Wikimedia-Incident.

I think we could also consider adding an alert based on the hit ratio of the parsercache caches (we already have the data in grafana)

greg moved this task from Active investigation to Follow-up prevention on the Wikimedia-Incident board.Oct 16 2018, 6:49 PM

• Banyek moved this task from Backlog to next on the User-Banyek board.Oct 17 2018, 9:11 AM

• Banyek mentioned this in T207273: Parser cache hit ratio alerting.Oct 17 2018, 12:32 PM

Change 467959 had a related patch set uploaded (by Banyek; owner: Banyek):
[operations/puppet@production] mariadb: enable replication check on Parsercache hosts

https://gerrit.wikimedia.org/r/467959

gerritbot added a project: Patch-For-Review.Oct 17 2018, 12:53 PM

• Banyek moved this task from next to In progress on the User-Banyek board.Oct 17 2018, 3:38 PM

The check is ready to be deployed

Where is the compiler run link?

I made them earlier:

https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/13001/console
https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/13002/console

In T206992#4674780, @Banyek wrote:

I made them earlier:

https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/13001/console
https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/13002/console

Normally put them on the Gerrit patch as a comment, so people reviewing the patch can see the links in the comments and keep discussing there

updated the gerrit patch then

Mentioned in SAL (#wikimedia-operations) [2018-10-18T08:40:12Z] <banyek> adding replication monitoring checks to parsercache hosts (T206992)

Mentioned in SAL (#wikimedia-operations) [2018-10-18T08:41:03Z] <banyek> disabling puppet on parser caches (T206992)

Change 467959 merged by Banyek:
[operations/puppet@production] mariadb: enable replication check on Parsercache hosts

https://gerrit.wikimedia.org/r/467959

Mentioned in SAL (#wikimedia-operations) [2018-10-18T08:56:36Z] <banyek> enabling replication monitor check on pc2004 (T206992)

Mentioned in SAL (#wikimedia-operations) [2018-10-18T09:01:03Z] <banyek> enabling replication monitor check on pc1004 (T206992)

Mentioned in SAL (#wikimedia-operations) [2018-10-18T09:20:32Z] <banyek> enabling replication monitor check on pc1005 pc1006 pc2005 pc2006 (T206992)

Is this alert fully deployed?

I am not sure how useful is this, honestly- this alert would have not prevented the issue at all:

MariaDB Slave IO: pc1	OK 	2018-10-18 12:26:14 	0d 3h 7m 42s 	1/3 	OK slave_io_state not a slave

In T206992#4677326, @jcrespo wrote:
I am not sure how useful is this, honestly- this alert would have not prevented the issue at all:
MariaDB Slave IO: pc1	OK 	2018-10-18 12:26:14 	0d 3h 7m 42s 	1/3 	OK slave_io_state not a slave

I did check the parsercache hosts before the failover, to make sure they were all green - I would have seen that check and I would have remembered that they have to have replication enabled.

I did check the parsercache hosts before the failover, to make sure they were all green - I would have seen that check and I would have remembered that they have to have replication enabled.

That is my point- the check is green now, even if replication isn't working. Even if it wasn't (there may be a parameter for that), the error is not the replication, but the "freshness" of the data (hit ratio if it was active). We stop replication all the time- we need to check replication is working and recent - eg. maybe add it to the replication heartbeat checks on switchover (but not the read only phase) + "which % of the data is not expired" kind of check, in addition to this check.

In T206992#4677332, @jcrespo wrote:

I did check the parsercache hosts before the failover, to make sure they were all green - I would have seen that check and I would have remembered that they have to have replication enabled.

That is my point- the check is green now, even if replication isn't working. Even if it wasn't (there may be a parameter for that), the errors is not the replication, but the "freshness" of the data (hit ratio if it was active). We stop replication all the time- we need to check replication is working and recent - eg. maybe add it to the replication checks on switchover (but not the read only phase), in addition to this check.

I added the step of checking replication a few days in advance to our DC failover checklist so we can also remember that for next failover :https://wikitech.wikimedia.org/w/index.php?title=Switch_Datacenter%2Fplanned_db_maintenance&type=revision&diff=1805377&oldid=1805376

I added the step of checking replication a few days in advance to our DC failover checklist so we can also remember that for next failover :https://wikitech.wikimedia.org/w/index.php?title=Switch_Datacenter%2Fplanned_db_maintenance&type=revision&diff=1805377&oldid=1805376

That is good, I just don't think it is enough. I would add it as a switchdc recipe- and rollback the change if it fails.

@Volans ^ is that something we can do on the dc switchover script?

@jcrespo Yes, it is deployed, I was just waiting on close

I have created T207385 so we can follow the discussion there.

Krinkle edited projects, added Sustainability (Incident Followup); removed Wikimedia-Incident.Apr 28 2020, 9:50 PM

Maintenance_bot removed a project: Patch-For-Review.Apr 28 2020, 10:17 PM

Create replication icinga check for the Parsercache hostsClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Create replication icinga check for the Parsercache hosts
Closed, ResolvedPublic
Actions