Page MenuHomePhabricator

Test EtcdConfig in different failure scenarios
Closed, ResolvedPublic

Description

We need to be sure some of the guarantees that EtcdConfig promises hold true, specifically:

  1. In case the remote server is unresponsive (see T156924#3269464 for more details)
    1. the call to etcd will timeout and what is in cache will be used.
    2. At most one concurrent call to etcd will be made, and all other requests will use the cache
    3. No etcd connections are leaked
  2. In case the remote server sends an empty response the data in cache will be used
  3. In case the remote server is down at the startup of the application server, our monitoring requests will fail, depooling the server
  4. If more than one server is listed, either via dns discovery or via a list of servers, and one is unresponsive, the next one will be called
  5. The server sends a response that takes longer than our timeouts to be received, the cache will be used

A good chunk of these tests can be easily run in deployment-prep.

Event Timeline

Joe triaged this task as Normal priority.Jan 17 2018, 9:31 AM
Joe created this task.

Mentioned in SAL (#wikimedia-operations) [2018-02-21T16:40:43Z] <_joe_> testing various etcd failure scenarios on mwdebug1001, T185078

Mentioned in SAL (#wikimedia-operations) [2018-02-22T10:37:46Z] <_joe_> benchmarking EtcdConfig failure scenarios on mwdebug1001, T185078

Joe claimed this task.Feb 22 2018, 10:38 AM
Joe added a project: User-Joe.
Joe moved this task from Backlog to Doing on the User-Joe board.

Mentioned in SAL (#wikimedia-operations) [2018-02-22T12:24:18Z] <_joe_> live-hacking ProductionServices.php on mwdebug1001 for testing (T185078)

Mentioned in SAL (#wikimedia-operations) [2018-02-22T12:42:07Z] <_joe_> ended live-hacking on mwdebug1001 (T185078)

Joe added a comment.Feb 22 2018, 12:54 PM

First test to fail:

If I declare an invalid hostname as the etcd server in a config change, an exception is thrown and not caught at runtime:

Feb 22 12:30:21 mwdebug1001 hhvm[2942]: [Thu Feb 22 12:30:21 2018] [hphp] [2942:7f7b1d3ff700:16772:000001] [] 
Fatal error: Uncaught exception 'ConfigException' with message 'Failed to load configuration from etcd: (curl error: 6) Couldn't resolve host name' in /srv/mediawiki/php-1.31.0-wmf.21/includes/config/EtcdConfig.php:191
Stack trace:
#0 /srv/mediawiki/php-1.31.0-wmf.21/includes/config/EtcdConfig.php(113): EtcdConfig->load()
#1 /srv/mediawiki/wmf-config/etcd.php(28): EtcdConfig->get()
#2 /srv/mediawiki/wmf-config/etcd.php(35): wmfSetupEtcd()
#3 /srv/mediawiki/wmf-config/CommonSettings.php(113): include()
#4 /srv/mediawiki/php-1.31.0-wmf.21/LocalSettings.php(4): include()
#5 /srv/mediawiki/php-1.31.0-wmf.21/includes/Setup.php(94): include()
#6 /srv/mediawiki/php-1.31.0-wmf.21/includes/WebStart.php(94): include()
#7 /srv/mediawiki/php-1.31.0-wmf.21/index.php(39): include()
#8 /srv/mediawiki/w/index.php(3): include()
#9 {main}

So I guess we just need to catch such an exception and just serve the stale values from cache if available.

Joe added a comment.Feb 23 2018, 11:11 AM

I dug into the code a bit and turns out my testing strategy was flawed: since the cache key depends on the hostname, by changing it I just created a situation where the cache was empty and the etcd cluster unreachable, thus just making us fall in an expected failure scenario.

So this far all tests are giving us functionally good results. I'll analyze my benchmark data to better assess the performance penalty we suffer in case of failures on the etcd side though.

Joe added a comment.Mar 7 2018, 11:40 AM

So some results from different runs of testing using ab to render https://en.wikipedia.org/wiki/Francesco_Totti on mwdebug1001 at different concurrency levels testing the following failure scenarios at various levels of concurrency:

  • 1 server down (iptables DROP)
  • 1 server where etcd has crashed/is stopped (iptables REJECT)
  • all 3 servers down
  • all 3 etcd instances crashed/failed

At all latencies but the smallest one (5 requests/s) the throughput barely notices the failures, but some requests (less than 1%) can be very slow, in particular when all servers are down, which admittedly is not such a probable scenario.

Joe closed this task as Resolved.Mar 7 2018, 12:01 PM