Maniphest T207273

Parser cache hit ratio alerting
Open, MediumPublic
Actions

Assigned To

None

Authored By

	• Banyek
	Oct 17 2018, 12:32 PM

Description

We had an incident regarding to the parser caches recently (T206740) - after data center switchover the parser caches were empty - which had several impacts on production (T206841).
We agreed in T206992 that we need to create several icinga checks for avoid these in the future, and in T206992#4671263 @Volans brought up an idea for creating a check for the parser cache hit ratio - knowing that telemetry would help us to prevent and/or avoid incidents like that in the future.

The graphs which could be the base of those icinge checks are the following:
https://grafana.wikimedia.org/dashboard/db/parser-cache?refresh=5m&orgId=1&from=1539606697297&to=1539779497298

Related Objects

Mentioned In: T207385: Create a check on the DC failover script to see if codfw -> eqiad replication is working before failing over to codfw (considering eqiad as the active DC by default)
Mentioned Here: T207385: Create a check on the DC failover script to see if codfw -> eqiad replication is working before failing over to codfw (considering eqiad as the active DC by default)
T206740: parsercache used disk space increase
T206841: Evaluate the consequences of the parsercache being empty post-switchover
T206992: Create replication icinga check for the Parsercache hosts

Event Timeline

• Banyek created this task.Oct 17 2018, 12:32 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 17 2018, 12:32 PM

This needs some thought in order to make it effective:

The hosts in the passive DC will always have 0 hit ration.
What will happen if we depool a host? Do we have to always downtime it or will it alert?
Should it page or just IRC?

• Banyek renamed this task from Alert based on the hit ratio of the parsercache to Parser cache hit ration alerting.Oct 18 2018, 7:34 AM

• Banyek updated the task description. (Show Details)

Parser cache hit ratio alerting is difficult, specially on a passive DC. A better option would be a script that checks that most of the content is not expired, aka "does not contain mostly garbage". That together with "replication is running" would confirm the parser cache has a good status without actually checking the hit ratio. A 2/3 relation of usefull content globally would be enough to allow for a single server to fail without the alert going off.

jcrespo mentioned this in T207385: Create a check on the DC failover script to see if codfw -> eqiad replication is working before failing over to codfw (considering eqiad as the active DC by default).Oct 18 2018, 12:52 PM

• Banyek renamed this task from Parser cache hit ration alerting to Parser cache hit ratio alerting.Oct 18 2018, 12:53 PM

My suggestion for this kind of check was not for the passive dc, but mainly the active one to make sure that the parser caches are properly used. We might have changes in mediawiki that will change the hit ratio over time and it could go below a threshold that causes issues.
I think it might be useful in general to have this check and it would have also immediately alarm after the switch to tell us the real cause of the issue. It's not meant to prevent it, for that we'll have the other ones (replication/heartbeat/cookbook)

I don't think having such alarm is bad- it is easy to setup, just setting up a prometheus one- but it may arrive too late. I think a check on switchdc would prevent issues rather than identify them afterwards- and it should be trivial to add pc1 pc2 pc3 to the list of checks to test, let me see the code.

That's exactly what I meant, we should have this check independently and adding other checks to the other part described in T207385 to prevent it.

fgiunchedi moved this task from Inbox to Radar on the observability board.Jul 20 2020, 1:27 PM

Parser cache hit ratio alertingOpen, MediumPublicActions

Description

Related Objects

Event Timeline

Parser cache hit ratio alerting
Open, MediumPublic
Actions