The monitoring section of service::catalog contains a significant amount of technical debt (e.g. repetition, historical artifacts, general opaqueness/friction in how to use it).
The scope of this work is to revamp (blackbox) monitoring for service::catalog so that:
- Monitoring for common services (i.e. HTTP) requires minimal to no configuration
- Alerting is relevant and general for all services (e.g. ssl expiration monitoring)
The concepts/principles adopted here should be also general enough to be reused in other parts of the infrastructure (e.g. services not in the catalog for one reason or another). In other words it should be easy to ask puppet "please probe this HTTP service".
The focus will be on HTTP(s) services since those are the most relevant and common.
Status update (Feb 2022): all internal HTTP ipv4 service::catalog services are being probed. For services offering both encrypted and unencrypted versions the former is probed since the latter should be deprecated/unused. Still TODO are "pop services" like text/upload/ncredir which are a bit special (non-default ip config, ipv6 enabled, deployed to all sites not only codfw/eqiad)
How do network probes will work?
There are two configurations: the blackbox-exporter configuration (called modules) and the Prometheus targets configuration. The former is used to set parameters for a probe "template" (e.g. prefer v4 over v6, which headers to send/expect, valid status codes, etc) while the latter is used to specify e.g. which host/address to talk to and the URL to probe (for HTTP services)
The prometheus::blackbox::modules::service_catalog (source) class generates suitable modules (one or more per HTTP service) from each service configuration; most data is inferred from the service itself (e.g. encryption true/false) while some data can be provided/overridden in the probes section of the catalog (e.g. which status codes are valid; see also wmflib::service::probe::http_module_options)
The "running" of probes is handled by prometheus::service_catalog_targets (source) which writes a Prometheus targets file containing a list of (module, target) pairs. Each service gets a icmp_ip4 module (i.e. probe template) to verify pings work as expected. Additionally a http_<service>_ip4 module is used to run the service-specific probe. The target is the URL to probe (which DNS name (or address), port and path to use). Something worth noting is that the probes are run from the Prometheus hosts within the same site (i.e. they don't cross the WAN anymore)
PoP services
The current implementation covers all internal services (with or without discovery records) and some public services (without lvs) and makes some assumptions (e.g. which DNS zones to use).
The missing services are the public LVS services, deployed to all sites: text/upload/ncredir. These are peculiar in some ways:
- their hostnames are on "SITE.wikimedia.org"
- need both ipv4 and ipv6 to be probed
Docs
Runbooks at https://wikitech.wikimedia.org/wiki/Network_monitoring#Blackbox_Probes_(Prometheus)