Page MenuHomePhabricator

Parser cache hit ratio alerting
Open, MediumPublic

Description

We had an incident regarding to the parser caches recently (T206740) - after data center switchover the parser caches were empty - which had several impacts on production (T206841).
We agreed in T206992 that we need to create several icinga checks for avoid these in the future, and in T206992#4671263 @Volans brought up an idea for creating a check for the parser cache hit ratio - knowing that telemetry would help us to prevent and/or avoid incidents like that in the future.

The graphs which could be the base of those icinge checks are the following:
https://grafana.wikimedia.org/dashboard/db/parser-cache?refresh=5m&orgId=1&from=1539606697297&to=1539779497298

Event Timeline

Marostegui moved this task from Triage to Backlog on the DBA board.

This needs some thought in order to make it effective:

The hosts in the passive DC will always have 0 hit ration.
What will happen if we depool a host? Do we have to always downtime it or will it alert?
Should it page or just IRC?

Banyek renamed this task from Alert based on the hit ratio of the parsercache to Parser cache hit ration alerting.Oct 18 2018, 7:34 AM
Banyek updated the task description. (Show Details)

Parser cache hit ratio alerting is difficult, specially on a passive DC. A better option would be a script that checks that most of the content is not expired, aka "does not contain mostly garbage". That together with "replication is running" would confirm the parser cache has a good status without actually checking the hit ratio. A 2/3 relation of usefull content globally would be enough to allow for a single server to fail without the alert going off.

Banyek renamed this task from Parser cache hit ration alerting to Parser cache hit ratio alerting.Oct 18 2018, 12:53 PM

My suggestion for this kind of check was not for the passive dc, but mainly the active one to make sure that the parser caches are properly used. We might have changes in mediawiki that will change the hit ratio over time and it could go below a threshold that causes issues.
I think it might be useful in general to have this check and it would have also immediately alarm after the switch to tell us the real cause of the issue. It's not meant to prevent it, for that we'll have the other ones (replication/heartbeat/cookbook)

I don't think having such alarm is bad- it is easy to setup, just setting up a prometheus one- but it may arrive too late. I think a check on switchdc would prevent issues rather than identify them afterwards- and it should be trivial to add pc1 pc2 pc3 to the list of checks to test, let me see the code.

That's exactly what I meant, we should have this check independently and adding other checks to the other part described in T207385 to prevent it.