Page MenuHomePhabricator

Move service::catalog checks (“monitoring” section) to blackbox exporter and Alertmanager
Closed, ResolvedPublic

Description

The monitoring section of service::catalog contains a significant amount of technical debt (e.g. repetition, historical artifacts, general opaqueness/friction in how to use it).

The scope of this work is to revamp (blackbox) monitoring for service::catalog so that:

  • Monitoring for common services (i.e. HTTP) requires minimal to no configuration
  • Alerting is relevant and general for all services (e.g. ssl expiration monitoring)

The concepts/principles adopted here should be also general enough to be reused in other parts of the infrastructure (e.g. services not in the catalog for one reason or another). In other words it should be easy to ask puppet "please probe this HTTP service".

The focus will be on HTTP(s) services since those are the most relevant and common.

Status update (Feb 2022): all internal HTTP ipv4 service::catalog services are being probed. For services offering both encrypted and unencrypted versions the former is probed since the latter should be deprecated/unused. Still TODO are "pop services" like text/upload/ncredir which are a bit special (non-default ip config, ipv6 enabled, deployed to all sites not only codfw/eqiad)

How do network probes will work?

There are two configurations: the blackbox-exporter configuration (called modules) and the Prometheus targets configuration. The former is used to set parameters for a probe "template" (e.g. prefer v4 over v6, which headers to send/expect, valid status codes, etc) while the latter is used to specify e.g. which host/address to talk to and the URL to probe (for HTTP services)

The prometheus::blackbox::modules::service_catalog (source) class generates suitable modules (one or more per HTTP service) from each service configuration; most data is inferred from the service itself (e.g. encryption true/false) while some data can be provided/overridden in the probes section of the catalog (e.g. which status codes are valid; see also wmflib::service::probe::http_module_options)

The "running" of probes is handled by prometheus::service_catalog_targets (source) which writes a Prometheus targets file containing a list of (module, target) pairs. Each service gets a icmp_ip4 module (i.e. probe template) to verify pings work as expected. Additionally a http_<service>_ip4 module is used to run the service-specific probe. The target is the URL to probe (which DNS name (or address), port and path to use). Something worth noting is that the probes are run from the Prometheus hosts within the same site (i.e. they don't cross the WAN anymore)

PoP services

The current implementation covers all internal services (with or without discovery records) and some public services (without lvs) and makes some assumptions (e.g. which DNS zones to use).

The missing services are the public LVS services, deployed to all sites: text/upload/ncredir. These are peculiar in some ways:

  • their hostnames are on "SITE.wikimedia.org"
  • need both ipv4 and ipv6 to be probed

Docs

Runbooks at https://wikitech.wikimedia.org/wiki/Network_monitoring#Blackbox_Probes_(Prometheus)

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+0 -639
operations/alertsmaster+13 -13
operations/puppetproduction+10 -14
operations/puppetproduction+25 -99
operations/puppetproduction+0 -14
operations/puppetproduction+3 -3
operations/puppetproduction+1 -0
operations/puppetproduction+32 -2
operations/puppetproduction+1 -0
operations/alertsmaster+27 -0
operations/alertsmaster+86 -3
operations/alertsmaster+32 -12
operations/puppetproduction+4 -0
operations/puppetproduction+108 -0
operations/puppetproduction+4 -14
operations/puppetproduction+1 -26
operations/alertsmaster+19 -7
operations/puppetproduction+36 -31
operations/puppetproduction+6 -4
operations/puppetproduction+219 -237
operations/puppetproduction+1 -0
operations/puppetproduction+21 -1
operations/puppetproduction+5 -0
operations/puppetproduction+2 -0
operations/puppetproduction+0 -1
operations/puppetproduction+22 -14
operations/puppetproduction+6 -0
operations/puppetproduction+0 -21
operations/alertsmaster+23 -0
operations/puppetproduction+1 -1
operations/puppetproduction+24 -0
operations/puppetproduction+5 -5
operations/puppetproduction+2 -2
operations/puppetproduction+24 -1
operations/puppetproduction+50 -87
operations/puppetproduction+3 -0
operations/puppetproduction+23 -15
operations/puppetproduction+23 -15
operations/puppetproduction+3 -1
operations/puppetproduction+2 -0
operations/puppetproduction+10 -5
operations/puppetproduction+24 -9
operations/puppetproduction+14 -1
operations/puppetproduction+10 -0
operations/puppetproduction+7 -9
operations/puppetproduction+32 -0
operations/puppetproduction+17 -3
operations/puppetproduction+0 -9
operations/puppetproduction+25 -6
operations/puppetproduction+1 -2
operations/puppetproduction+99 -0
operations/puppetproduction+8 -0
operations/puppetproduction+96 -54
operations/puppetproduction+17 -8
operations/puppetproduction+32 -1
operations/puppetproduction+143 -1
operations/puppetproduction+6 -1
operations/puppetproduction+34 -16
operations/puppetproduction+1 -1
operations/puppetproduction+2 -2
operations/puppetproduction+62 -1
operations/puppetproduction+23 -0
operations/puppetproduction+21 -0
operations/puppetproduction+181 -1
operations/puppetproduction+8 -4
operations/puppetproduction+0 -14
operations/puppetproduction+12 -0
operations/puppetproduction+273 -74
operations/puppetproduction+16 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 757447 merged by Filippo Giunchedi:

[operations/puppet@production] service catalog: introduce 'page' field

https://gerrit.wikimedia.org/r/757447

Change 761279 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: add 'tcp' probe type

https://gerrit.wikimedia.org/r/761279

Change 761279 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: add 'tcp' probe type

https://gerrit.wikimedia.org/r/761279

Change 761888 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] network: make slice_network_constants work outside of module

https://gerrit.wikimedia.org/r/761888

Change 761889 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: add probes for \"pop services\"

https://gerrit.wikimedia.org/r/761889

Change 761890 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: split probes by network sphere and address family

https://gerrit.wikimedia.org/r/761890

Change 761910 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] role: reorder prometheus profile inclusion

https://gerrit.wikimedia.org/r/761910

Change 761888 merged by Filippo Giunchedi:

[operations/puppet@production] network: make slice_network_constants work outside of module

https://gerrit.wikimedia.org/r/761888

Change 761910 merged by Filippo Giunchedi:

[operations/puppet@production] role: reorder prometheus profile inclusion

https://gerrit.wikimedia.org/r/761910

Change 761889 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: add probes for \"pop services\"

https://gerrit.wikimedia.org/r/761889

Change 761890 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: split probes by network sphere and address family

https://gerrit.wikimedia.org/r/761890

Change 762406 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] wmflib: do not quote probe icmp address

https://gerrit.wikimedia.org/r/762406

Change 762406 merged by Filippo Giunchedi:

[operations/puppet@production] wmflib: do not quote probe icmp address

https://gerrit.wikimedia.org/r/762406

Change 762407 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] role: move probe alerts to alerts.git

https://gerrit.wikimedia.org/r/762407

Change 762408 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: add probes alerts

https://gerrit.wikimedia.org/r/762408

Change 762408 merged by Filippo Giunchedi:

[operations/alerts@master] sre: add probes alerts

https://gerrit.wikimedia.org/r/762408

Change 762407 merged by Filippo Giunchedi:

[operations/puppet@production] role: move probe alerts to alerts.git

https://gerrit.wikimedia.org/r/762407

Change 762413 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hiera: expect 301 from text/upload/ncredir on http

https://gerrit.wikimedia.org/r/762413

Change 762413 merged by Filippo Giunchedi:

[operations/puppet@production] hiera: expect 301 from text/upload/ncredir on http

https://gerrit.wikimedia.org/r/762413

Change 762415 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] wmflib: add 'expect_redirect' option to probes

https://gerrit.wikimedia.org/r/762415

Change 762415 merged by Filippo Giunchedi:

[operations/puppet@production] wmflib: add 'expect_redirect' option to probes

https://gerrit.wikimedia.org/r/762415

Change 762416 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hiera: ncredir's /_status doesn't redirect to https

https://gerrit.wikimedia.org/r/762416

Change 762416 merged by Filippo Giunchedi:

[operations/puppet@production] hiera: ncredir's /_status doesn't redirect to https

https://gerrit.wikimedia.org/r/762416

Change 762417 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hiera: fix librenms probe

https://gerrit.wikimedia.org/r/762417

Change 762417 merged by Filippo Giunchedi:

[operations/puppet@production] hiera: fix librenms probe

https://gerrit.wikimedia.org/r/762417

Change 762437 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: fix missing probes configuration

https://gerrit.wikimedia.org/r/762437

Change 762437 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: fix missing probes configuration

https://gerrit.wikimedia.org/r/762437

Change 762446 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: per-service icmp prober module

https://gerrit.wikimedia.org/r/762446

Change 762447 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: fix labweb probe

https://gerrit.wikimedia.org/r/762447

Change 762446 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: per-service icmp prober module

https://gerrit.wikimedia.org/r/762446

Change 762447 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: fix labweb probe

https://gerrit.wikimedia.org/r/762447

Change 762782 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: override probe service address

https://gerrit.wikimedia.org/r/762782

Change 762782 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: override probe service address

https://gerrit.wikimedia.org/r/762782

Change 764371 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: default Host header to SNI

https://gerrit.wikimedia.org/r/764371

Change 764371 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: default Host header to SNI

https://gerrit.wikimedia.org/r/764371

Change 764726 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: adjust ProbeDown alert

https://gerrit.wikimedia.org/r/764726

Change 764726 merged by Filippo Giunchedi:

[operations/alerts@master] sre: adjust ProbeDown alert

https://gerrit.wikimedia.org/r/764726

Moving up to the parent task as requested:

I'm happy to discuss (and tend to agree) whether checking L7/L4 is enough in this case, though IMHO not in this task to not scope creep (this task's parent or other means). I'm not sure what you are suggesting with ""unlimited pings" as an intentionally-supported service" since AFAIK that is noone's expectation nor plan

My point is that ICMP Echo is not the production service on offer at these IPs, and while they do happen to support some ICMP Echo traffic, there's known ratelimiters in place for it, and so I mean very literally that it's known to not be un-limited.

Whatever icinga was doing was (I guess) working, and this quadrupled rate (see T302265#7732588) with the prometheus replacement isn't working. We could tune it on either end, maybe, but I question the value of these pings in general, even when done by icinga. Host IP pings make sense for IPs that belong to a single host and are representative of its basic layer 3 interface connectivity, but IMHO they don't make a ton of sense for L4-routed virtual service IP:ports. It's unclear what exactly the ICMP response really means (that the L4 router is ok? that at least 1/N destination hosts is ok? neither or both? for which of multiple independent service ports on that virtual service IP, which might have different clusters behind them?), and doesn't seem like it would offer any differential diagnostic utility either, vs just having the real L4+ service checks operating on the real production services.

Moving up to the parent task as requested:

I'm happy to discuss (and tend to agree) whether checking L7/L4 is enough in this case, though IMHO not in this task to not scope creep (this task's parent or other means). I'm not sure what you are suggesting with ""unlimited pings" as an intentionally-supported service" since AFAIK that is noone's expectation nor plan

My point is that ICMP Echo is not the production service on offer at these IPs, and while they do happen to support some ICMP Echo traffic, there's known ratelimiters in place for it, and so I mean very literally that it's known to not be un-limited.

Whatever icinga was doing was (I guess) working, and this quadrupled rate (see T302265#7732588) with the prometheus replacement isn't working. We could tune it on either end, maybe, but I question the value of these pings in general, even when done by icinga. Host IP pings make sense for IPs that belong to a single host and are representative of its basic layer 3 interface connectivity, but IMHO they don't make a ton of sense for L4-routed virtual service IP:ports. It's unclear what exactly the ICMP response really means (that the L4 router is ok? that at least 1/N destination hosts is ok? neither or both? for which of multiple independent service ports on that virtual service IP, which might have different clusters behind them?), and doesn't seem like it would offer any differential diagnostic utility either, vs just having the real L4+ service checks operating on the real production services.

Thank you for expanding on the rationale, it makes sense to me to rely on L4+ checks to catch multiple conditions too. The pings could mean "is the address even announced/routed on the network?" but as you said the higher layer checks would catch that too anyways. I'll followup with a patch to disable the icmp probes (which incidentally also simplifies things a little bit)

Change 765548 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: ditch automatic icmp probes for service::catalog

https://gerrit.wikimedia.org/r/765548

Change 765548 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: ditch automatic icmp probes for service::catalog

https://gerrit.wikimedia.org/r/765548

Change 765476 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] logstash: add blackbox-exporter filter config

https://gerrit.wikimedia.org/r/765476

Change 767775 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: use single probes/service job

https://gerrit.wikimedia.org/r/767775

Change 767775 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: use single probes/service job

https://gerrit.wikimedia.org/r/767775

Change 765476 merged by Filippo Giunchedi:

[operations/puppet@production] logstash: add blackbox-exporter filter config

https://gerrit.wikimedia.org/r/765476

Mentioned in SAL (#wikimedia-operations) [2022-03-03T16:30:57Z] <godog> roll-restart logstash to pick up config changes - T291946

Change 768039 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hiera: allow access to am api to cumin hosts

https://gerrit.wikimedia.org/r/768039

Change 768039 merged by Filippo Giunchedi:

[operations/puppet@production] hiera: allow access to am api to cumin hosts

https://gerrit.wikimedia.org/r/768039

Change 771883 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: add dashboard to network probes alerts

https://gerrit.wikimedia.org/r/771883

Change 771883 merged by Filippo Giunchedi:

[operations/alerts@master] sre: add dashboard to network probes alerts

https://gerrit.wikimedia.org/r/771883

Change 773747 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: add ProbeDown paging alert for enabled services

https://gerrit.wikimedia.org/r/773747

Change 773747 merged by Filippo Giunchedi:

[operations/alerts@master] sre: add ProbeDown paging alert for enabled services

https://gerrit.wikimedia.org/r/773747

Change 788346 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] team-sre: introduce paging probe down

https://gerrit.wikimedia.org/r/788346

Change 788346 merged by Filippo Giunchedi:

[operations/alerts@master] team-sre: introduce paging probe down

https://gerrit.wikimedia.org/r/788346

Change 790269 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: set ncredir as non-paging

https://gerrit.wikimedia.org/r/790269

Change 790269 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: set ncredir as non-paging

https://gerrit.wikimedia.org/r/790269

Change 790291 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: add 'timeout' override for service::catalog probes

https://gerrit.wikimedia.org/r/790291

Change 790292 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: set thumbor probe timeout

https://gerrit.wikimedia.org/r/790292

Change 790291 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: add 'timeout' override for service::catalog probes

https://gerrit.wikimedia.org/r/790291

Change 790292 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: set thumbor probe timeout

https://gerrit.wikimedia.org/r/790292

Change 790624 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] Fix thumbor probe timeout value

https://gerrit.wikimedia.org/r/790624

Change 790624 merged by Filippo Giunchedi:

[operations/puppet@production] Fix thumbor probe timeout value

https://gerrit.wikimedia.org/r/790624

Change 793815 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] lvs: stop double-checking catalog services from Icinga

https://gerrit.wikimedia.org/r/793815

Change 793816 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] icinga: deprecate service::monitor class

https://gerrit.wikimedia.org/r/793816

Change 793817 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] icinga: remove 'monitoring' from service::catalog

https://gerrit.wikimedia.org/r/793817

The new checks/alerts appear to work well: same (or better) SNR, more timely and aggregated by default (i.e. no alert/page spam). I'm planning to deprecate monitoring from service::catalog next week.

Change 797201 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] alerts: allow deploying site-specific alerts

https://gerrit.wikimedia.org/r/797201

I am attempting to remove monitoring from service::catalog now. The core functionality is in service::monitor which takes care of creating both the Icinga host and then attach services to those. Additionally, some checks defined elsewhere rely on the service hosts to be present (openapi/swagger checks and cirrus cluster checks)

As I see it, there are two avenues forward:

  1. Remove the Icinga services and leave the Icinga hosts in place. The host definition uses the hostname value from monitoring to set the hostname in Icinga. We don't have that map anywhere else in the catalog, and we could probably get away with using "<servicename>.svc.<site>.wmnet" for everything, except maps/restbase "external" openapi checks defined at the end of lvs::monitor_services.
  2. Remove both Icinga services and hosts. Has the advantage of cleaning up more legacy, but we need to find solutions for monitoring::openapi_service and icinga::monitor::elasticsearch::cirrus_cluster_checks. I believe for now we can shim a @monitoring::host where needed like I tried to do in https://gerrit.wikimedia.org/r/c/operations/puppet/+/793816/2/modules/icinga/manifests/monitor/elasticsearch/cirrus_cluster_checks.pp#19.

Moving the cirrus and openapi checks is outside the scope of this task, hence I'm leaning towards solution 2 and shim service hosts where needed.

  1. Remove both Icinga services and hosts. Has the advantage of cleaning up more legacy, but we need to find solutions for monitoring::openapi_service and icinga::monitor::elasticsearch::cirrus_cluster_checks. I believe for now we can shim a @monitoring::host where needed like I tried to do in https://gerrit.wikimedia.org/r/c/operations/puppet/+/793816/2/modules/icinga/manifests/monitor/elasticsearch/cirrus_cluster_checks.pp#19.

Moving the cirrus and openapi checks is outside the scope of this task, hence I'm leaning towards solution 2 and shim service hosts where needed.

An implementation of this solution is at https://gerrit.wikimedia.org/r/q/topic:bug%252FT291946-monitoring-and-host-removal

Change 793815 merged by Filippo Giunchedi:

[operations/puppet@production] lvs: stop double-checking docker registry from Icinga

https://gerrit.wikimedia.org/r/793815

Change 793816 merged by Filippo Giunchedi:

[operations/puppet@production] icinga: deprecate service::monitor class

https://gerrit.wikimedia.org/r/793816

Change 793816 merged by Filippo Giunchedi:

[operations/puppet@production] icinga: deprecate service::monitor class

https://gerrit.wikimedia.org/r/793816

With this change merged the service::catalog pages from Icinga are effectively disabled \o/

Change 803231 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] Deprecate 'monitoring_setup' service state

https://gerrit.wikimedia.org/r/803231

Change 803231 merged by Filippo Giunchedi:

[operations/puppet@production] Deprecate 'monitoring_setup' service state

https://gerrit.wikimedia.org/r/803231

Technically monitoring is not used anymore, however it is still in the catalog pending these changes:

Leaving the task open for now however this is essentially complete

Change 803902 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: include tcp probes in alerts

https://gerrit.wikimedia.org/r/803902

Change 803902 merged by Filippo Giunchedi:

[operations/alerts@master] sre: include tcp probes in alerts

https://gerrit.wikimedia.org/r/803902

Change 805887 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/alerts@master] Traffic: Port IPsec/Strongswan connection alert

https://gerrit.wikimedia.org/r/805887

Change 793817 merged by Filippo Giunchedi:

[operations/puppet@production] icinga: remove 'monitoring' from service::catalog

https://gerrit.wikimedia.org/r/793817

fgiunchedi claimed this task.

It has been a long journey and the effort is already paying off, monitoring is no longer a thing in service::catalog and fully replaced by probes instead. Thanks to all involved!