Test EtcdConfig in different failure scenarios
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Joe
	Jan 17 2018, 9:31 AM

Description

We need to be sure some of the guarantees that EtcdConfig promises hold true, specifically:

In case the remote server is unresponsive (see T156924#3269464 for more details)
1. the call to etcd will timeout and what is in cache will be used.
2. At most one concurrent call to etcd will be made, and all other requests will use the cache
3. No etcd connections are leaked
In case the remote server sends an empty response the data in cache will be used
In case the remote server is down at the startup of the application server, our monitoring requests will fail, depooling the server
If more than one server is listed, either via dns discovery or via a list of servers, and one is unresponsive, the next one will be called
The server sends a response that takes longer than our timeouts to be received, the cache will be used

A good chunk of these tests can be easily run in deployment-prep.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Joe	T182597 Use EtcdConfig in production to allow automation of a datacenter switch
		Resolved		Joe	T185078 Test EtcdConfig in different failure scenarios

Event Timeline

Joe triaged this task as Medium priority.Jan 17 2018, 9:31 AM

Joe created this task.

Volans updated the task description. (Show Details)Jan 17 2018, 9:59 AM

Volans mentioned this in T182597: Use EtcdConfig in production to allow automation of a datacenter switch.

Mentioned in SAL (#wikimedia-operations) [2018-02-21T16:40:43Z] <_joe_> testing various etcd failure scenarios on mwdebug1001, T185078

Mentioned in SAL (#wikimedia-operations) [2018-02-22T10:37:46Z] <_joe_> benchmarking EtcdConfig failure scenarios on mwdebug1001, T185078

Joe claimed this task.Feb 22 2018, 10:38 AM

Joe added a project: User-Joe.

Joe moved this task from Backlog to Doing on the User-Joe board.

Mentioned in SAL (#wikimedia-operations) [2018-02-22T12:24:18Z] <_joe_> live-hacking ProductionServices.php on mwdebug1001 for testing (T185078)

Mentioned in SAL (#wikimedia-operations) [2018-02-22T12:42:07Z] <_joe_> ended live-hacking on mwdebug1001 (T185078)

First test to fail:

If I declare an invalid hostname as the etcd server in a config change, an exception is thrown and not caught at runtime:

Feb 22 12:30:21 mwdebug1001 hhvm[2942]: [Thu Feb 22 12:30:21 2018] [hphp] [2942:7f7b1d3ff700:16772:000001] [] 
Fatal error: Uncaught exception 'ConfigException' with message 'Failed to load configuration from etcd: (curl error: 6) Couldn't resolve host name' in /srv/mediawiki/php-1.31.0-wmf.21/includes/config/EtcdConfig.php:191
Stack trace:
#0 /srv/mediawiki/php-1.31.0-wmf.21/includes/config/EtcdConfig.php(113): EtcdConfig->load()
#1 /srv/mediawiki/wmf-config/etcd.php(28): EtcdConfig->get()
#2 /srv/mediawiki/wmf-config/etcd.php(35): wmfSetupEtcd()
#3 /srv/mediawiki/wmf-config/CommonSettings.php(113): include()
#4 /srv/mediawiki/php-1.31.0-wmf.21/LocalSettings.php(4): include()
#5 /srv/mediawiki/php-1.31.0-wmf.21/includes/Setup.php(94): include()
#6 /srv/mediawiki/php-1.31.0-wmf.21/includes/WebStart.php(94): include()
#7 /srv/mediawiki/php-1.31.0-wmf.21/index.php(39): include()
#8 /srv/mediawiki/w/index.php(3): include()
#9 {main}

So I guess we just need to catch such an exception and just serve the stale values from cache if available.

I dug into the code a bit and turns out my testing strategy was flawed: since the cache key depends on the hostname, by changing it I just created a situation where the cache was empty and the etcd cluster unreachable, thus just making us fall in an expected failure scenario.

So this far all tests are giving us functionally good results. I'll analyze my benchmark data to better assess the performance penalty we suffer in case of failures on the etcd side though.

So some results from different runs of testing using ab to render https://en.wikipedia.org/wiki/Francesco_Totti on mwdebug1001 at different concurrency levels testing the following failure scenarios at various levels of concurrency:

1 server down (iptables DROP)
1 server where etcd has crashed/is stopped (iptables REJECT)
all 3 servers down
all 3 etcd instances crashed/failed

At all latencies but the smallest one (5 requests/s) the throughput barely notices the failures, but some requests (less than 1%) can be very slow, in particular when all servers are down, which admittedly is not such a probable scenario.

Joe closed this task as Resolved.Mar 7 2018, 12:01 PM

Test EtcdConfig in different failure scenariosClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Test EtcdConfig in different failure scenarios
Closed, ResolvedPublic
Actions

Related Objects
Search...