Background
In T422455, a regression in coredns resolution latency lasting several weeks was only discovered when periodic jobs started failing more frequently.
As summarized in #2 in T422455#11806562, one reason we did not notice is that the primary symptom by volume - i.e., EtcdConfig fetch timeouts that occur early in CommonSettings.php - is invisible to the mediawiki error and exception log channels except in the case when it's (1) fatal (i.e., results in an uncaught ConfigException when no stale APCu-cached config is available) and (2) PHP-FPM (where wmerrors will forward the exception to rsyslog). Otherwise, the associated E_USER_WARNING is only visible in the PHP-FPM errorlog channel (there is no custom handler yet).
Indeed, for PHP-FPM workloads, the caching is extremely effective. Although we did see elevated rates of ConfigExceptions reported, the overall rate was far lower than the rate of fetch timeouts or failures otherwise implicating DNS as seen in the errorlog (while also noting that the latter aren't non-impacting; those are queries that have now spent 2s attempting to refresh the config).
Proposal
We should introduce some way to detect and surface an unusually high rate of EtcdConfig fetch timeouts / failures, despite the fact that they occur so early in request lifetime.
One option would be to configure an elasticsearch-exporter rule that counts the number of instances of these classes of warnings, and then configure an alert based on that. Given that this is a "we should into this" problem rather than something requiring immediate urgent attention (see above re: caching), we should configure this to open a phabricator task.