Change Details

The `monitoring` section of `service::catalog` contains a significant amount of technical debt (e.g. repetition, historical artifacts, general opaqueness/friction in how to use it). The scope of this work is to revamp (blackbox) monitoring for `service::catalog` so that: * Monitoring for common services (i.e. HTTP) requires minimal to no configuration * Alerting is relevant and general for all services (e.g. ssl expiration monitoring) The concepts/principles adopted here should be also general enough to be reused in other parts of the infrastructure (e.g. services not in the catalog for one reason or another). In other words it should be easy to ask puppet "please probe this HTTP service". The focus will be on HTTP(s) services since those are the most relevant and common. Status update (Feb 2022): all internal HTTP ipv4 service::catalog services are being probed. For services offering both encrypted and unencrypted versions the former is probed since the latter should be deprecated/unused. Still TODO are "pop services" like text/upload/ncredir which are a bit special (non-default ip config, ipv6 enabled, deployed to all sites not only codfw/eqiad) === How do network probes work now? === There are two configurations: the `blackbox-exporter` configuration (called `modules`) and the Prometheus targets configuration. The former is used to set parameters for a probe "template" (e.g. prefer v4 over v6, which headers to send/expect, valid status codes, etc) while the latter is used to specify e.g. which host/address to talk to and the URL to probe (for HTTP services) The `prometheus::blackbox::modules::service_catalog` ([[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/prometheus/manifests/blackbox/modules/service_catalog.pp | source ]]) class generates suitable `modules` (one or more per HTTP service) from each service configuration; most data is inferred from the service itself (e.g. `encryption` true/false) while some data can be provided/overridden in the `probes` section of the catalog (e.g. which status codes are valid; see also `wmflib::service::probe::http_module_options`) The "running" of probes is handled by `prometheus::service_catalog_targets` ([[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/prometheus/manifests/service_catalog_targets.pp | source ]]) which writes a Prometheus targets file containing a list of (module, target) pairs. Each service gets a `icmp_ip4` module (i.e. probe template) to verify pings work as expected. Additionally a `http_<service>_ip4` module is used to run the service-specific probe. The `target` is the URL to probe (which DNS name (or address), port and path to use). Something worth noting is that the probes are run from the Prometheus hosts within the same site (i.e. they don't cross the WAN anymore) ==== PoP services ==== The current implementation covers all internal services (with or without discovery records) and some public services (without lvs) and makes some assumptions (e.g. which DNS zones to use). The missing services are the public LVS services, deployed to all sites: text/upload/ncredir. These are peculiar in some ways: * their hostnames are on "SITE.wikimedia.org" * need both ipv4 and ipv6 to be probed

The `monitoring` section of `service::catalog` contains a significant amount of technical debt (e.g. repetition, historical artifacts, general opaqueness/friction in how to use it). The scope of this work is to revamp (blackbox) monitoring for `service::catalog` so that: * Monitoring for common services (i.e. HTTP) requires minimal to no configuration * Alerting is relevant and general for all services (e.g. ssl expiration monitoring) The concepts/principles adopted here should be also general enough to be reused in other parts of the infrastructure (e.g. services not in the catalog for one reason or another). In other words it should be easy to ask puppet "please probe this HTTP service". The focus will be on HTTP(s) services since those are the most relevant and common. Status update (Feb 2022): all internal HTTP ipv4 service::catalog services are being probed. For services offering both encrypted and unencrypted versions the former is probed since the latter should be deprecated/unused. Still TODO are "pop services" like text/upload/ncredir which are a bit special (non-default ip config, ipv6 enabled, deployed to all sites not only codfw/eqiad) === How do network probes will work? === There are two configurations: the `blackbox-exporter` configuration (called `modules`) and the Prometheus targets configuration. The former is used to set parameters for a probe "template" (e.g. prefer v4 over v6, which headers to send/expect, valid status codes, etc) while the latter is used to specify e.g. which host/address to talk to and the URL to probe (for HTTP services) The `prometheus::blackbox::modules::service_catalog` ([[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/prometheus/manifests/blackbox/modules/service_catalog.pp | source ]]) class generates suitable `modules` (one or more per HTTP service) from each service configuration; most data is inferred from the service itself (e.g. `encryption` true/false) while some data can be provided/overridden in the `probes` section of the catalog (e.g. which status codes are valid; see also `wmflib::service::probe::http_module_options`) The "running" of probes is handled by `prometheus::service_catalog_targets` ([[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/prometheus/manifests/service_catalog_targets.pp | source ]]) which writes a Prometheus targets file containing a list of (module, target) pairs. Each service gets a `icmp_ip4` module (i.e. probe template) to verify pings work as expected. Additionally a `http_<service>_ip4` module is used to run the service-specific probe. The `target` is the URL to probe (which DNS name (or address), port and path to use). Something worth noting is that the probes are run from the Prometheus hosts within the same site (i.e. they don't cross the WAN anymore) ==== PoP services ==== The current implementation covers all internal services (with or without discovery records) and some public services (without lvs) and makes some assumptions (e.g. which DNS zones to use). The missing services are the public LVS services, deployed to all sites: text/upload/ncredir. These are peculiar in some ways: * their hostnames are on "SITE.wikimedia.org" * need both ipv4 and ipv6 to be probed

The `monitoring` section of `service::catalog` contains a significant amount of technical debt (e.g. repetition, historical artifacts, general opaqueness/friction in how to use it). The scope of this work is to revamp (blackbox) monitoring for `service::catalog` so that: * Monitoring for common services (i.e. HTTP) requires minimal to no configuration * Alerting is relevant and general for all services (e.g. ssl expiration monitoring) The concepts/principles adopted here should be also general enough to be reused in other parts of the infrastructure (e.g. services not in the catalog for one reason or another). In other words it should be easy to ask puppet "please probe this HTTP service". The focus will be on HTTP(s) services since those are the most relevant and common. Status update (Feb 2022): all internal HTTP ipv4 service::catalog services are being probed. For services offering both encrypted and unencrypted versions the former is probed since the latter should be deprecated/unused. Still TODO are "pop services" like text/upload/ncredir which are a bit special (non-default ip config, ipv6 enabled, deployed to all sites not only codfw/eqiad) === How do network probes work nowwill work? === There are two configurations: the `blackbox-exporter` configuration (called `modules`) and the Prometheus targets configuration. The former is used to set parameters for a probe "template" (e.g. prefer v4 over v6, which headers to send/expect, valid status codes, etc) while the latter is used to specify e.g. which host/address to talk to and the URL to probe (for HTTP services) The `prometheus::blackbox::modules::service_catalog` ([[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/prometheus/manifests/blackbox/modules/service_catalog.pp | source ]]) class generates suitable `modules` (one or more per HTTP service) from each service configuration; most data is inferred from the service itself (e.g. `encryption` true/false) while some data can be provided/overridden in the `probes` section of the catalog (e.g. which status codes are valid; see also `wmflib::service::probe::http_module_options`) The "running" of probes is handled by `prometheus::service_catalog_targets` ([[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/prometheus/manifests/service_catalog_targets.pp | source ]]) which writes a Prometheus targets file containing a list of (module, target) pairs. Each service gets a `icmp_ip4` module (i.e. probe template) to verify pings work as expected. Additionally a `http_<service>_ip4` module is used to run the service-specific probe. The `target` is the URL to probe (which DNS name (or address), port and path to use). Something worth noting is that the probes are run from the Prometheus hosts within the same site (i.e. they don't cross the WAN anymore) ==== PoP services ==== The current implementation covers all internal services (with or without discovery records) and some public services (without lvs) and makes some assumptions (e.g. which DNS zones to use). The missing services are the public LVS services, deployed to all sites: text/upload/ncredir. These are peculiar in some ways: * their hostnames are on "SITE.wikimedia.org" * need both ipv4 and ipv6 to be probed