Page MenuHomePhabricator

pybal_monitor_down_results_total metric only created when PyBal goes down
Closed, ResolvedPublic

Description

There's been an alert for AlertLintProblem for a few sites:

Screenshot at 2023-12-19 15-01-14.png (329×549 px, 46 KB)

This is caused by no metrics actually existing for any of the instances in the sites in question. The pybal_monitor_down_results_total metric is never initialized until a failure event occurs; Any PyBal instance that always remains healthy will be marked by AlertLintProblem as problematic.

Compare a site with metrics:

# HELP pybal_monitor_down_results_total Monitor down result count
# TYPE pybal_monitor_down_results_total counter
pybal_monitor_down_results_total{host="cp4037.ulsfo.wmnet",monitor="IdleConnection",service="testlb_443"} 22.0
pybal_monitor_down_results_total{host="ncredir4001.ulsfo.wmnet",monitor="ProxyFetch",service="ncredirlb_443"} 1.0
[...]
# TYPE pybal_monitor_down_results_created gauge
[...]

With a problem site:

# HELP pybal_monitor_down_results_total Monitor down result count
# TYPE pybal_monitor_down_results_total counter
# HELP pybal_monitor_status Monitor up status
# TYPE pybal_monitor_status gauge
[...]

Prometheus advises against missing metrics, so we should initialize the counters to 0 when PyBal starts.

See the _resultDown method as a starting point.

Event Timeline

BCornwall triaged this task as Medium priority.Dec 19 2023, 11:07 PM
BCornwall moved this task from Backlog to Ready for work on the Traffic board.
Reedy renamed this task from LVS hosts have missing metrics when PyBal never goes down to LVS hosts have missing metrics even though PyBal never goes down.Dec 20 2023, 12:25 AM

The old title was more descriptive IMO - The metric, pybal_monitor_down_results_total, is missing specifically when PyBal never goes down.

BCornwall renamed this task from LVS hosts have missing metrics even though PyBal never goes down to LVS hosts have missing metrics when PyBal never goes down.Dec 20 2023, 12:29 AM
BCornwall updated the task description. (Show Details)

The old title was more descriptive IMO - The metric, pybal_monitor_down_results_total, is missing specifically when PyBal never goes down.

It doesn't seem to read correctly though? If it was "LVS hosts have missing metrics when PyBal goes down", fine, that makes sense. But with the negation added by "never", it seems clunky.

BCornwall renamed this task from LVS hosts have missing metrics when PyBal never goes down to pybal_monitor_down_results_total metric only created when PyBal goes down.Dec 20 2023, 1:03 AM

Hopefully this helps!

It might be worth investigating if we can exclude this alert from the linter; There shouldn't be any adverse effects with this problem (which is somewhat pedantic in any case). Additionally, PyBal is in pretty strict maintenance mode while Liberica is being built as replacement.

@fgiunchedi Do you have any opinion on this?

Yeah asking pint to ignore that alert seems the right thing to do here, there are examples in alerts.git @BCornwall !

Change 987499 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/alerts@master] pybal: Disable Pint promql/series checks

https://gerrit.wikimedia.org/r/987499

Mentioned in SAL (#wikimedia-operations) [2024-01-04T19:59:47Z] <brett> restarting pybal on lvs5006 for testing purposes - T353760

BCornwall claimed this task.

Okay, lvs5006 has been restarted and the metrics are missing. There aren't any linting alerts now!

Change 987499 merged by BCornwall:

[operations/alerts@master] pybal: Disable Pint promql/series checks

https://gerrit.wikimedia.org/r/987499