Page MenuHomePhabricator

Use DNS name instead of IP in PyBal alerts
Closed, ResolvedPublic

Description

While working thru T321605 , Search Platform SREs accidentally depooled all eqiad hosts for WCQS*.
This set off a Pybal alert, but we did not see it because the hostname was not in the alerts.

Luckily for us, @RhinosF1 dug up the IP address from our public-facing Puppet repo and contacted us.

Creating this ticket to request that we use DNS hostnames for Pybal VIP alerts instead of, or in addition to, IPs. For example, the current alert reads:

[15:55:54] <+icinga-wm> PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.67:443]) https://wikitech.wikimedia.org/wiki/PyBal

An rDNS query of that IP reveals the VIP DNS name as wcqs.svc.eqiad.wmnet , so we'd like to see that in the alert instead of the IP.

Thanks and please let us know if you have any questions.

*WCQS has no official SLO yet, and codfw was not affected, so we don't consider this an official "incident"

Event Timeline

bking added a subscriber: RKemper.

The check is defined in:

modules/pybal/manifests/monitoring.pp:    nrpe::plugin { 'check_pybal_ipvs_diff':

so it runs a command via NRPE on the local machines. And the script it executes is:

./modules/pybal/files/check_pybal_ipvs_diff.py

code can be viewed via "gitiles" at: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/pybal/files/check_pybal_ipvs_diff.py

and here is the latest change to it:

https://gerrit.wikimedia.org/r/c/operations/puppet/+/705375

BCornwall changed the task status from Open to In Progress.Apr 28 2023, 12:33 AM
BCornwall claimed this task.
BCornwall triaged this task as Low priority.
BCornwall moved this task from Backlog to Traffic team actively servicing on the Traffic board.

Change 913004 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] pybal: Send service hostnames on alert

https://gerrit.wikimedia.org/r/913004

@bking: My patch still keeps the IP addresses around since I feel that some information is better than no information in the case of DNS lookup failures.

Change 913004 merged by BCornwall:

[operations/puppet@production] pybal: Fix hostnames not being sent on alert

https://gerrit.wikimedia.org/r/913004

Change 933398 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] pybal: update check to conform to the nagios plugin api

https://gerrit.wikimedia.org/r/933398

Change 933398 merged by BCornwall:

[operations/puppet@production] pybal: Make check conform to the Nagios plugin API

https://gerrit.wikimedia.org/r/933398