Maniphest T291946

Move service::catalog checks (“monitoring” section) to blackbox exporter and Alertmanager
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	lmata
	Sep 28 2021, 1:44 PM

Description

The monitoring section of service::catalog contains a significant amount of technical debt (e.g. repetition, historical artifacts, general opaqueness/friction in how to use it).

The scope of this work is to revamp (blackbox) monitoring for service::catalog so that:

Monitoring for common services (i.e. HTTP) requires minimal to no configuration
Alerting is relevant and general for all services (e.g. ssl expiration monitoring)

The concepts/principles adopted here should be also general enough to be reused in other parts of the infrastructure (e.g. services not in the catalog for one reason or another). In other words it should be easy to ask puppet "please probe this HTTP service".

The focus will be on HTTP(s) services since those are the most relevant and common.

Status update (Feb 2022): all internal HTTP ipv4 service::catalog services are being probed. For services offering both encrypted and unencrypted versions the former is probed since the latter should be deprecated/unused. Still TODO are "pop services" like text/upload/ncredir which are a bit special (non-default ip config, ipv6 enabled, deployed to all sites not only codfw/eqiad)

How do network probes will work?

There are two configurations: the blackbox-exporter configuration (called modules) and the Prometheus targets configuration. The former is used to set parameters for a probe "template" (e.g. prefer v4 over v6, which headers to send/expect, valid status codes, etc) while the latter is used to specify e.g. which host/address to talk to and the URL to probe (for HTTP services)

The prometheus::blackbox::modules::service_catalog (source) class generates suitable modules (one or more per HTTP service) from each service configuration; most data is inferred from the service itself (e.g. encryption true/false) while some data can be provided/overridden in the probes section of the catalog (e.g. which status codes are valid; see also wmflib::service::probe::http_module_options)

The "running" of probes is handled by prometheus::service_catalog_targets (source) which writes a Prometheus targets file containing a list of (module, target) pairs. Each service gets a icmp_ip4 module (i.e. probe template) to verify pings work as expected. Additionally a http_<service>_ip4 module is used to run the service-specific probe. The target is the URL to probe (which DNS name (or address), port and path to use). Something worth noting is that the probes are run from the Prometheus hosts within the same site (i.e. they don't cross the WAN anymore)

PoP services

The current implementation covers all internal services (with or without discovery records) and some public services (without lvs) and makes some assumptions (e.g. which DNS zones to use).

The missing services are the public LVS services, deployed to all sites: text/upload/ncredir. These are peculiar in some ways:

their hostnames are on "SITE.wikimedia.org"
need both ipv4 and ipv6 to be probed

Docs

Runbooks at https://wikitech.wikimedia.org/wiki/Network_monitoring#Blackbox_Probes_(Prometheus)

Details

Subject	Repo	Branch	Lines +/-
icinga: remove 'monitoring' from service::catalog	operations/puppet	production	+0 -639
sre: include tcp probes in alerts	operations/alerts	master	+13 -13
Deprecate 'monitoring_setup' service state	operations/puppet	production	+10 -14
icinga: deprecate service::monitor class	operations/puppet	production	+25 -99
lvs: stop double-checking docker registry from Icinga	operations/puppet	production	+0 -14
Fix thumbor probe timeout value	operations/puppet	production	+3 -3
hieradata: set thumbor probe timeout	operations/puppet	production	+1 -0
prometheus: add 'timeout' override for service::catalog probes	operations/puppet	production	+32 -2
hieradata: set ncredir as non-paging	operations/puppet	production	+1 -0
team-sre: introduce paging probe down	operations/alerts	master	+27 -0
sre: add ProbeDown paging alert for enabled services	operations/alerts	master	+86 -3
sre: add dashboard to network probes alerts	operations/alerts	master	+32 -12
hiera: allow access to am api to cumin hosts	operations/puppet	production	+4 -0
logstash: add blackbox-exporter filter config	operations/puppet	production	+108 -0
prometheus: use single probes/service job	operations/puppet	production	+4 -14
prometheus: ditch automatic icmp probes for service::catalog	operations/puppet	production	+1 -26
sre: adjust ProbeDown alert	operations/alerts	master	+19 -7
prometheus: default Host header to SNI	operations/puppet	production	+36 -31
prometheus: override probe service address	operations/puppet	production	+6 -4
prometheus: split probes by network sphere and address family	operations/puppet	production	+219 -237
hieradata: fix labweb probe	operations/puppet	production	+1 -0
prometheus: per-service icmp prober module	operations/puppet	production	+21 -1
hieradata: fix missing probes configuration	operations/puppet	production	+5 -0
hiera: fix librenms probe	operations/puppet	production	+2 -0
hiera: ncredir's /_status doesn't redirect to https	operations/puppet	production	+0 -1
wmflib: add 'expect_redirect' option to probes	operations/puppet	production	+22 -14
hiera: expect 301 from text/upload/ncredir on http	operations/puppet	production	+6 -0
role: move probe alerts to alerts.git	operations/puppet	production	+0 -21
sre: add probes alerts	operations/alerts	master	+23 -0
wmflib: do not quote probe icmp address	operations/puppet	production	+1 -1
hieradata: add probes for "pop services"	operations/puppet	production	+24 -0
role: reorder prometheus profile inclusion	operations/puppet	production	+5 -5
network: make slice_network_constants work outside of module	operations/puppet	production	+2 -2
prometheus: add 'tcp' probe type	operations/puppet	production	+24 -1
service catalog: introduce 'page' field	operations/puppet	production	+50 -87
hieradata: force Host header for thanos-swift	operations/puppet	production	+3 -0
P:base::firewall: Add prometheus hosts to catch all ferm rule	operations/puppet	production	+23 -15
P:base::firewall: Add prometheus hosts to catch all ferm rule	operations/puppet	production	+23 -15
P:base::firewall: Add proemethous hosts to catch all ferm rule	operations/puppet	production	+3 -1
hieradata: expect 404 when probing puppetdb-api/	operations/puppet	production	+2 -0
puppetdb-api: allow prometheus_nodes via ferm	operations/puppet	production	+10 -5
prometheus: support probes behind SSO	operations/puppet	production	+24 -9
prometheus: override valid status codes for http probes	operations/puppet	production	+14 -1
prometheus: add probes for non-lvs services	operations/puppet	production	+10 -0
prometheus: match SNI with Host when overridden	operations/puppet	production	+7 -9
hieradata: probe with http host override	operations/puppet	production	+32 -0
prometheus: add Host header support to probes	operations/puppet	production	+17 -3
prometheus: skip tcp module, already in http module	operations/puppet	production	+0 -9
prometheus: handle non-LVS service::catalog entries	operations/puppet	production	+25 -6
hieradata: use / as miscweb health check	operations/puppet	production	+1 -2
hieradata: add more network probes for internal services	operations/puppet	production	+99 -0
hieradata: add zotero and helm-charts probes	operations/puppet	production	+8 -0
prometheus: extend blackbox probes options	operations/puppet	production	+96 -54
prometheus: consider non-discovery case when sending SNI in blackbox	operations/puppet	production	+17 -8
prometheus: export service catalog metrics	operations/puppet	production	+32 -1
prometheus: add mini-textfile-exporter	operations/puppet	production	+143 -1
prometheus: don't probe services not deployed in the current site	operations/puppet	production	+6 -1
prometheus: pin discovery probes to their site	operations/puppet	production	+34 -16
hieradata: fix probe url for schema service	operations/puppet	production	+1 -1
prometheus: move probe alerts to severity 'warning'	operations/puppet	production	+2 -2
hieradata: add probes for non-critical catalog services	operations/puppet	production	+62 -1
alertmanager: add inhibit rules for network probes	operations/puppet	production	+23 -0
prometheus: add alerts for network probes	operations/puppet	production	+21 -0
prometheus: add blackbox/discovery jobs	operations/puppet	production	+181 -1
prometheus: bump logging level for blackbox-exporter	operations/puppet	production	+8 -4
prometheus: remove broken blackbox check for tools.wmflabs.org	operations/puppet	production	+0 -14
prometheus: use ip_protocol_fallback only on newer blackbox-exporter versions	operations/puppet	production	+12 -0
prometheus: support for blackbox configuration fragments	operations/puppet	production	+273 -74
wmflib: add 'probes' to service::catalog type	operations/puppet	production	+16 -0

Related Objects
Search...

Status	Assigned	Task
Resolved	fgiunchedi	T291946 Move service::catalog checks (“monitoring” section) to blackbox exporter and Alertmanager
Resolved	fgiunchedi	T297604 cergen should include the cert's name in SAN too
Resolved	fgiunchedi	T299633 Add search.discovery.wmnet to search certs SAN
Open	None	T300119 Using port in Host header for thanos-swift / thanos-query breaks vhost selection
Resolved	fgiunchedi	T302265 blackbox-exporter no icmp replies on prometheus1006 for a few services
Resolved	fgiunchedi	T309107 Set "timeout" value per-probe
Open	fgiunchedi	T316022 Clean up check_ssl checks from puppet also covered by blackbox prober

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 757447 merged by Filippo Giunchedi:

[operations/puppet@production] service catalog: introduce 'page' field

https://gerrit.wikimedia.org/r/757447

fgiunchedi updated the task description. (Show Details)Feb 2 2022, 10:39 AM

Change 761279 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: add 'tcp' probe type

https://gerrit.wikimedia.org/r/761279

fgiunchedi updated the task description. (Show Details)Feb 9 2022, 2:06 PM

Change 761279 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: add 'tcp' probe type

https://gerrit.wikimedia.org/r/761279

Change 761888 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] network: make slice_network_constants work outside of module

https://gerrit.wikimedia.org/r/761888

Change 761889 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: add probes for \"pop services\"

https://gerrit.wikimedia.org/r/761889

Change 761890 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: split probes by network sphere and address family

https://gerrit.wikimedia.org/r/761890

Change 761910 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] role: reorder prometheus profile inclusion

https://gerrit.wikimedia.org/r/761910

Change 761888 merged by Filippo Giunchedi:

[operations/puppet@production] network: make slice_network_constants work outside of module

https://gerrit.wikimedia.org/r/761888

Change 761910 merged by Filippo Giunchedi:

[operations/puppet@production] role: reorder prometheus profile inclusion

https://gerrit.wikimedia.org/r/761910

Change 761889 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: add probes for \"pop services\"

https://gerrit.wikimedia.org/r/761889

Change 761890 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: split probes by network sphere and address family

https://gerrit.wikimedia.org/r/761890

Change 762406 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] wmflib: do not quote probe icmp address

https://gerrit.wikimedia.org/r/762406

Change 762406 merged by Filippo Giunchedi:

[operations/puppet@production] wmflib: do not quote probe icmp address

https://gerrit.wikimedia.org/r/762406

Change 762407 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] role: move probe alerts to alerts.git

https://gerrit.wikimedia.org/r/762407

Change 762408 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: add probes alerts

https://gerrit.wikimedia.org/r/762408

Change 762408 merged by Filippo Giunchedi:

[operations/alerts@master] sre: add probes alerts

https://gerrit.wikimedia.org/r/762408

Change 762407 merged by Filippo Giunchedi:

[operations/puppet@production] role: move probe alerts to alerts.git

https://gerrit.wikimedia.org/r/762407

Change 762413 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hiera: expect 301 from text/upload/ncredir on http

https://gerrit.wikimedia.org/r/762413

Change 762413 merged by Filippo Giunchedi:

[operations/puppet@production] hiera: expect 301 from text/upload/ncredir on http

https://gerrit.wikimedia.org/r/762413

Change 762415 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] wmflib: add 'expect_redirect' option to probes

https://gerrit.wikimedia.org/r/762415

JMeybohm unsubscribed.Feb 14 2022, 10:55 AM

Change 762415 merged by Filippo Giunchedi:

[operations/puppet@production] wmflib: add 'expect_redirect' option to probes

https://gerrit.wikimedia.org/r/762415

Change 762416 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hiera: ncredir's /_status doesn't redirect to https

https://gerrit.wikimedia.org/r/762416

Change 762416 merged by Filippo Giunchedi:

[operations/puppet@production] hiera: ncredir's /_status doesn't redirect to https

https://gerrit.wikimedia.org/r/762416

Change 762417 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hiera: fix librenms probe

https://gerrit.wikimedia.org/r/762417

Change 762417 merged by Filippo Giunchedi:

[operations/puppet@production] hiera: fix librenms probe

https://gerrit.wikimedia.org/r/762417

Change 762437 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: fix missing probes configuration

https://gerrit.wikimedia.org/r/762437

Change 762437 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: fix missing probes configuration

https://gerrit.wikimedia.org/r/762437

Change 762446 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: per-service icmp prober module

https://gerrit.wikimedia.org/r/762446

Change 762447 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: fix labweb probe

https://gerrit.wikimedia.org/r/762447

Change 762446 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: per-service icmp prober module

https://gerrit.wikimedia.org/r/762446

Change 762447 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: fix labweb probe

https://gerrit.wikimedia.org/r/762447

Change 762782 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: override probe service address

https://gerrit.wikimedia.org/r/762782

Change 762782 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: override probe service address

https://gerrit.wikimedia.org/r/762782

Change 764371 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: default Host header to SNI

https://gerrit.wikimedia.org/r/764371

Change 764371 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: default Host header to SNI

https://gerrit.wikimedia.org/r/764371

Change 764726 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: adjust ProbeDown alert

https://gerrit.wikimedia.org/r/764726

Change 764726 merged by Filippo Giunchedi:

[operations/alerts@master] sre: adjust ProbeDown alert

https://gerrit.wikimedia.org/r/764726

Moving up to the parent task as requested:

I'm happy to discuss (and tend to agree) whether checking L7/L4 is enough in this case, though IMHO not in this task to not scope creep (this task's parent or other means). I'm not sure what you are suggesting with ""unlimited pings" as an intentionally-supported service" since AFAIK that is noone's expectation nor plan

My point is that ICMP Echo is not the production service on offer at these IPs, and while they do happen to support some ICMP Echo traffic, there's known ratelimiters in place for it, and so I mean very literally that it's known to not be un-limited.

Whatever icinga was doing was (I guess) working, and this quadrupled rate (see T302265#7732588) with the prometheus replacement isn't working. We could tune it on either end, maybe, but I question the value of these pings in general, even when done by icinga. Host IP pings make sense for IPs that belong to a single host and are representative of its basic layer 3 interface connectivity, but IMHO they don't make a ton of sense for L4-routed virtual service IP:ports. It's unclear what exactly the ICMP response really means (that the L4 router is ok? that at least 1/N destination hosts is ok? neither or both? for which of multiple independent service ports on that virtual service IP, which might have different clusters behind them?), and doesn't seem like it would offer any differential diagnostic utility either, vs just having the real L4+ service checks operating on the real production services.

In T291946#7732595, @BBlack wrote:

Moving up to the parent task as requested:

I'm happy to discuss (and tend to agree) whether checking L7/L4 is enough in this case, though IMHO not in this task to not scope creep (this task's parent or other means). I'm not sure what you are suggesting with ""unlimited pings" as an intentionally-supported service" since AFAIK that is noone's expectation nor plan

My point is that ICMP Echo is not the production service on offer at these IPs, and while they do happen to support some ICMP Echo traffic, there's known ratelimiters in place for it, and so I mean very literally that it's known to not be un-limited.

Whatever icinga was doing was (I guess) working, and this quadrupled rate (see T302265#7732588) with the prometheus replacement isn't working. We could tune it on either end, maybe, but I question the value of these pings in general, even when done by icinga. Host IP pings make sense for IPs that belong to a single host and are representative of its basic layer 3 interface connectivity, but IMHO they don't make a ton of sense for L4-routed virtual service IP:ports. It's unclear what exactly the ICMP response really means (that the L4 router is ok? that at least 1/N destination hosts is ok? neither or both? for which of multiple independent service ports on that virtual service IP, which might have different clusters behind them?), and doesn't seem like it would offer any differential diagnostic utility either, vs just having the real L4+ service checks operating on the real production services.

Thank you for expanding on the rationale, it makes sense to me to rely on L4+ checks to catch multiple conditions too. The pings could mean "is the address even announced/routed on the network?" but as you said the higher layer checks would catch that too anyways. I'll followup with a patch to disable the icmp probes (which incidentally also simplifies things a little bit)

Change 765548 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: ditch automatic icmp probes for service::catalog

https://gerrit.wikimedia.org/r/765548

Change 765548 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: ditch automatic icmp probes for service::catalog

https://gerrit.wikimedia.org/r/765548

Change 765476 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] logstash: add blackbox-exporter filter config

https://gerrit.wikimedia.org/r/765476

fgiunchedi closed subtask T302265: blackbox-exporter no icmp replies on prometheus1006 for a few services as Resolved.Feb 25 2022, 2:21 PM

Change 767775 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: use single probes/service job

https://gerrit.wikimedia.org/r/767775

Change 767775 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: use single probes/service job

https://gerrit.wikimedia.org/r/767775

Change 765476 merged by Filippo Giunchedi:

[operations/puppet@production] logstash: add blackbox-exporter filter config

https://gerrit.wikimedia.org/r/765476

Mentioned in SAL (#wikimedia-operations) [2022-03-03T16:30:57Z] <godog> roll-restart logstash to pick up config changes - T291946

Change 768039 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hiera: allow access to am api to cumin hosts

https://gerrit.wikimedia.org/r/768039

Change 768039 merged by Filippo Giunchedi:

[operations/puppet@production] hiera: allow access to am api to cumin hosts

https://gerrit.wikimedia.org/r/768039

Change 771883 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: add dashboard to network probes alerts

https://gerrit.wikimedia.org/r/771883

Change 771883 merged by Filippo Giunchedi:

[operations/alerts@master] sre: add dashboard to network probes alerts

https://gerrit.wikimedia.org/r/771883

fgiunchedi mentioned this in T304237: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet.Mar 21 2022, 8:46 AM

Change 773747 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: add ProbeDown paging alert for enabled services

https://gerrit.wikimedia.org/r/773747

BTullis subscribed.Mar 25 2022, 7:41 PM

fgiunchedi updated the task description. (Show Details)Mar 28 2022, 7:40 AM

Change 773747 merged by Filippo Giunchedi:

[operations/alerts@master] sre: add ProbeDown paging alert for enabled services

https://gerrit.wikimedia.org/r/773747

JMeybohm mentioned this in T305358: service::catalog entries and dnsdisc for Kubernetes services under Ingress.Apr 4 2022, 10:00 AM

fgiunchedi moved this task from FY2021/2022-Q3 to FY2021/2022-Q4 on the SRE Observability board.Apr 11 2022, 12:55 PM

fgiunchedi edited projects, added SRE Observability (FY2021/2022-Q4); removed SRE Observability (FY2021/2022-Q3).

fgiunchedi mentioned this in T305847: Migrate SRE paging alerts off Icinga and to Alertmanager.Apr 11 2022, 1:26 PM

Change 788346 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] team-sre: introduce paging probe down

https://gerrit.wikimedia.org/r/788346

Change 788346 merged by Filippo Giunchedi:

[operations/alerts@master] team-sre: introduce paging probe down

https://gerrit.wikimedia.org/r/788346

Change 790269 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: set ncredir as non-paging

https://gerrit.wikimedia.org/r/790269

Change 790269 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: set ncredir as non-paging

https://gerrit.wikimedia.org/r/790269

Change 790291 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: add 'timeout' override for service::catalog probes

https://gerrit.wikimedia.org/r/790291

Change 790292 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: set thumbor probe timeout

https://gerrit.wikimedia.org/r/790292

Change 790291 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: add 'timeout' override for service::catalog probes

https://gerrit.wikimedia.org/r/790291

Change 790292 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: set thumbor probe timeout

https://gerrit.wikimedia.org/r/790292

Change 790624 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] Fix thumbor probe timeout value

https://gerrit.wikimedia.org/r/790624

Change 790624 merged by Filippo Giunchedi:

[operations/puppet@production] Fix thumbor probe timeout value

https://gerrit.wikimedia.org/r/790624

Change 793815 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] lvs: stop double-checking catalog services from Icinga

https://gerrit.wikimedia.org/r/793815

Change 793816 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] icinga: deprecate service::monitor class

https://gerrit.wikimedia.org/r/793816

Change 793817 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] icinga: remove 'monitoring' from service::catalog

https://gerrit.wikimedia.org/r/793817

The new checks/alerts appear to work well: same (or better) SNR, more timely and aggregated by default (i.e. no alert/page spam). I'm planning to deprecate monitoring from service::catalog next week.

Change 797201 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] alerts: allow deploying site-specific alerts

https://gerrit.wikimedia.org/r/797201

I am attempting to remove monitoring from service::catalog now. The core functionality is in service::monitor which takes care of creating both the Icinga host and then attach services to those. Additionally, some checks defined elsewhere rely on the service hosts to be present (openapi/swagger checks and cirrus cluster checks)

As I see it, there are two avenues forward:

Remove the Icinga services and leave the Icinga hosts in place. The host definition uses the hostname value from monitoring to set the hostname in Icinga. We don't have that map anywhere else in the catalog, and we could probably get away with using "<servicename>.svc.<site>.wmnet" for everything, except maps/restbase "external" openapi checks defined at the end of lvs::monitor_services.
Remove both Icinga services and hosts. Has the advantage of cleaning up more legacy, but we need to find solutions for monitoring::openapi_service and icinga::monitor::elasticsearch::cirrus_cluster_checks. I believe for now we can shim a @monitoring::host where needed like I tried to do in https://gerrit.wikimedia.org/r/c/operations/puppet/+/793816/2/modules/icinga/manifests/monitor/elasticsearch/cirrus_cluster_checks.pp#19.

Moving the cirrus and openapi checks is outside the scope of this task, hence I'm leaning towards solution 2 and shim service hosts where needed.

herron awarded a token.May 23 2022, 2:08 PM

In T291946#7949983, @fgiunchedi wrote:

Remove both Icinga services and hosts. Has the advantage of cleaning up more legacy, but we need to find solutions for monitoring::openapi_service and icinga::monitor::elasticsearch::cirrus_cluster_checks. I believe for now we can shim a @monitoring::host where needed like I tried to do in https://gerrit.wikimedia.org/r/c/operations/puppet/+/793816/2/modules/icinga/manifests/monitor/elasticsearch/cirrus_cluster_checks.pp#19.

Moving the cirrus and openapi checks is outside the scope of this task, hence I'm leaning towards solution 2 and shim service hosts where needed.

An implementation of this solution is at https://gerrit.wikimedia.org/r/q/topic:bug%252FT291946-monitoring-and-host-removal

Change 793815 merged by Filippo Giunchedi:

[operations/puppet@production] lvs: stop double-checking docker registry from Icinga

https://gerrit.wikimedia.org/r/793815

fgiunchedi closed subtask T309107: Set "timeout" value per-probe as Resolved.May 25 2022, 1:09 PM

Change 793816 merged by Filippo Giunchedi:

[operations/puppet@production] icinga: deprecate service::monitor class

https://gerrit.wikimedia.org/r/793816

In T291946#7956942, @gerritbot wrote:

Change 793816 merged by Filippo Giunchedi:

[operations/puppet@production] icinga: deprecate service::monitor class

https://gerrit.wikimedia.org/r/793816

With this change merged the service::catalog pages from Icinga are effectively disabled \o/

Change 803231 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] Deprecate 'monitoring_setup' service state

https://gerrit.wikimedia.org/r/803231

Change 803231 merged by Filippo Giunchedi:

[operations/puppet@production] Deprecate 'monitoring_setup' service state

https://gerrit.wikimedia.org/r/803231

Technically monitoring is not used anymore, however it is still in the catalog pending these changes:

Leaving the task open for now however this is essentially complete

Change 803902 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: include tcp probes in alerts

https://gerrit.wikimedia.org/r/803902

Change 803902 merged by Filippo Giunchedi:

[operations/alerts@master] sre: include tcp probes in alerts

https://gerrit.wikimedia.org/r/803902

Change 805887 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/alerts@master] Traffic: Port IPsec/Strongswan connection alert

https://gerrit.wikimedia.org/r/805887

Change 793817 merged by Filippo Giunchedi:

[operations/puppet@production] icinga: remove 'monitoring' from service::catalog

https://gerrit.wikimedia.org/r/793817

It has been a long journey and the effort is already paying off, monitoring is no longer a thing in service::catalog and fully replaced by probes instead. Thanks to all involved!

colewhite moved this task from Inbox to Done on the SRE Observability (FY2021/2022-Q4) board.Jul 1 2022, 11:03 PM

Clement_Goubert mentioned this in T213689: Create a readiness probe for zotero.Apr 10 2024, 12:17 PM

Move service::catalog checks (“monitoring” section) to blackbox exporter and AlertmanagerClosed, ResolvedPublicActions