Page MenuHomePhabricator

Detect elevated rates of EtcdConfig fetch failures
Open, MediumPublic

Description

Background

In T422455, a regression in coredns resolution latency lasting several weeks was only discovered when periodic jobs started failing more frequently.

As summarized in #2 in T422455#11806562, one reason we did not notice is that the primary symptom by volume - i.e., EtcdConfig fetch timeouts that occur early in CommonSettings.php - is invisible to the mediawiki error and exception log channels except in the case when it's (1) fatal (i.e., results in an uncaught ConfigException when no stale APCu-cached config is available) and (2) PHP-FPM (where wmerrors will forward the exception to rsyslog). Otherwise, the associated E_USER_WARNING is only visible in the PHP-FPM errorlog channel (there is no custom handler yet).

Indeed, for PHP-FPM workloads, the caching is extremely effective. Although we did see elevated rates of ConfigExceptions reported, the overall rate was far lower than the rate of fetch timeouts or failures otherwise implicating DNS as seen in the errorlog (while also noting that the latter aren't non-impacting; those are queries that have now spent 2s attempting to refresh the config).

Proposal

We should introduce some way to detect and surface an unusually high rate of EtcdConfig fetch timeouts / failures, despite the fact that they occur so early in request lifetime.

One option would be to configure an elasticsearch-exporter rule that counts the number of instances of these classes of warnings, and then configure an alert based on that. Given that this is a "we should into this" problem rather than something requiring immediate urgent attention (see above re: caching), we should configure this to open a phabricator task.

Event Timeline

Scott_French moved this task from Inbox to Needs Info / Blocked on the ServiceOps new board.

Moving this to Needs Info while we converge on whether this sounds reasonable. If it does, I'd propose we schedule it for this quarter.

@Clement_Goubert @JMeybohm - Does this sound reasonable to you? I think it should be relatively low effort (e.g., exporter rule -> task-severity alert), and is a surprising enough gap in our monitoring that it probably makes sense to prioritize soon (i.e., errorlog is basically /dev/null).

@Clement_Goubert @JMeybohm - Does this sound reasonable to you? I think it should be relatively low effort (e.g., exporter rule -> task-severity alert), and is a surprising enough gap in our monitoring that it probably makes sense to prioritize soon (i.e., errorlog is basically /dev/null).

Yep that sounds reasonable.

@Clement_Goubert @JMeybohm - Does this sound reasonable to you? I think it should be relatively low effort (e.g., exporter rule -> task-severity alert), and is a surprising enough gap in our monitoring that it probably makes sense to prioritize soon (i.e., errorlog is basically /dev/null).

Yep that sounds reasonable.

I agree. In parallel we should probably open a task for the MW folks to improve the error reporting so that we can get more meaningful/precise error messages (like you described in T346971#11791136).

Thank you both! Optimistically moving this to "this Q" given the relative implementation cost vs. benefit.

[...] In parallel we should probably open a task for the MW folks to improve the error reporting so that we can get more meaningful/precise error messages (like you described in T346971#11791136).

Good idea - just opened T424280 for that.

MLechvien-WMF subscribed.

Per Slack discussion, assigning to Scott as he has the right context, and moving this to next quarter as we won't have the capacity in current one.