pybal_monitor_down_results_total metric only created when PyBal goes down
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	BCornwall
	Dec 19 2023, 11:06 PM

Description

There's been an alert for AlertLintProblem for a few sites:

Screenshot at 2023-12-19 15-01-14.png (329×549 px, 46 KB)

This is caused by no metrics actually existing for any of the instances in the sites in question. The pybal_monitor_down_results_total metric is never initialized until a failure event occurs; Any PyBal instance that always remains healthy will be marked by AlertLintProblem as problematic.

Compare a site with metrics:

# HELP pybal_monitor_down_results_total Monitor down result count
# TYPE pybal_monitor_down_results_total counter
pybal_monitor_down_results_total{host="cp4037.ulsfo.wmnet",monitor="IdleConnection",service="testlb_443"} 22.0
pybal_monitor_down_results_total{host="ncredir4001.ulsfo.wmnet",monitor="ProxyFetch",service="ncredirlb_443"} 1.0
[...]
# TYPE pybal_monitor_down_results_created gauge
[...]

With a problem site:

# HELP pybal_monitor_down_results_total Monitor down result count
# TYPE pybal_monitor_down_results_total counter
# HELP pybal_monitor_status Monitor up status
# TYPE pybal_monitor_status gauge
[...]

Prometheus advises against missing metrics, so we should initialize the counters to 0 when PyBal starts.

See the _resultDown method as a starting point.

Details

	Subject	Repo	Branch	Lines +/-
	pybal: Disable Pint promql/series checks	operations/alerts	master	+3 -0

Customize query in gerrit

Event Timeline

BCornwall created this task.Dec 19 2023, 11:06 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 19 2023, 11:06 PM

BCornwall triaged this task as Medium priority.Dec 19 2023, 11:07 PM

BCornwall moved this task from Backlog to Ready for work on the Traffic board.

Reedy renamed this task from LVS hosts have missing metrics when PyBal never goes down to LVS hosts have missing metrics even though PyBal never goes down.Dec 20 2023, 12:25 AM

The old title was more descriptive IMO - The metric, pybal_monitor_down_results_total, is missing specifically when PyBal never goes down.

BCornwall renamed this task from LVS hosts have missing metrics even though PyBal never goes down to LVS hosts have missing metrics when PyBal never goes down.Dec 20 2023, 12:29 AM

BCornwall updated the task description. (Show Details)

BCornwall removed BCornwall as the assignee of this task.Dec 20 2023, 12:37 AM

In T353760#9417365, @BCornwall wrote:

The old title was more descriptive IMO - The metric, pybal_monitor_down_results_total, is missing specifically when PyBal never goes down.

It doesn't seem to read correctly though? If it was "LVS hosts have missing metrics when PyBal goes down", fine, that makes sense. But with the negation added by "never", it seems clunky.

Hopefully this helps!

Vgutierrez added a project: PyBal.Dec 20 2023, 4:23 PM

It might be worth investigating if we can exclude this alert from the linter; There shouldn't be any adverse effects with this problem (which is somewhat pedantic in any case). Additionally, PyBal is in pretty strict maintenance mode while Liberica is being built as replacement.

@fgiunchedi Do you have any opinion on this?

Yeah asking pint to ignore that alert seems the right thing to do here, there are examples in alerts.git @BCornwall !

Change 987499 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/alerts@master] pybal: Disable Pint promql/series checks

https://gerrit.wikimedia.org/r/987499

gerritbot added a project: Patch-For-Review.Jan 3 2024, 11:39 PM

Mentioned in SAL (#wikimedia-operations) [2024-01-04T19:59:47Z] <brett> restarting pybal on lvs5006 for testing purposes - T353760

Okay, lvs5006 has been restarted and the metrics are missing. There aren't any linting alerts now!

Change 987499 merged by BCornwall: