discovery
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Joe
	Apr 10 2015, 8:30 AM

Description

We strongly need a consistent key-value store to use in cluster discovery/coordination and configuration.

Potential candidates with substantial adoption, development community health, and feature-completeness are at the moment in my opinion:

etcd
consul
Zookeeper

The chosen product should at least:

Guarantee read-availability during a node failure, and consistency and recovery after a failure to be easily understandable and manageable
have a decent write performance, but an extremely good read performance with very low latencies in all operating conditions
Allow clients to watch the value of a key/ a tree for changes
Allow easy backups
Work cross-datacenter (even with some limitations)
Have clean client libraries in most languages we use at the WMF
Allow (force?) encrypted connections from clients

Bonuses:

Easy to query from the CLI
Allow some level of authentication/grants system
Packaged in debian

Related Objects
Search...

Status	Assigned	Task
Resolved	LSobanski	T95662 Implement a configuration discovery system
Resolved	Joe	T97029 integrate (pybal\|varnish)->varnish backend config/state with etcd or similar
Resolved	Joe	T95656 Choose a consistent, distributed k/v storage for configuration management/discovery
Resolved	Joe	T96825 etcd evaluation
Resolved	Joe	T96832 consul evaluation
Resolved	Joe	T96839 zookeeper evaluation

Event Timeline

Joe created this task.Apr 10 2015, 8:30 AM

Joe raised the priority of this task from to Needs Triage.

Joe updated the task description. (Show Details)

Joe added a project: acl*sre-team.

Joe subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 10 2015, 8:30 AM

Joe added a parent task: T95662: Implement a configuration discovery system.Apr 10 2015, 9:09 AM

Out of curiosity, since analytics already uses zookeeper for hive/kafka, maybe it should be given a try first and other solutions looked at if zookeeper does not match our needs. That would be one less technology introduced to the cluster. 0.02€

Andrew triaged this task as Medium priority.Apr 11 2015, 9:27 PM

Andrew set Security to None.

ZK needs for analytics are completely different from the ones we have here, or I would have surely followed that path, @hashar.

But given analytics is basically doing apt-get install zookeeper and leaving it to be managed/interacted with from hadoop, it would be like choosing to use apache to serve wikipedia because we use it for serving gerrit.

@Joe thank you for the explanation.

Probably one thing we should also think about is integrating config management with rolling deployments, where a subset of nodes might need different (new) configuration directives. Example: a service's v1 is deployed using configuration cA on 10 nodes, but the new version of the service, v2, uses config cB. The differences between cA and cB might be miscellaneous (added/removed/changed keys, new format, etc), but it is easily conceivable that feeding cA to v2 or cB to v1 might bring the service down (or cause it to malfunction). In a rolling-deployment scenario, we usually have only a subset of machines using the new version (until confirmed it worked) at any given point in time:

3 machines running v2 using cB
7 machines running v1 using cA

We, thus, need a way to provide both configs, or ensure that each machine is using the correct one. Current approaches to config management include:

putting the configuration directly with the code to be deployed
- good: factors in config version changes, can be changed at will by devs
- bad: need one config per environment where everything is practically hard-coded, and, well, can be changed at will by devs :)
putting the configuration in ops/puppet
- good: less hard-coded config directives, can be adapted dynamically based on the environment (prod, labs, beta, staging, etc), better config supervision as opsens need to +2 it
- bad: once merged, the config is installed on all concerned machines regardless of their state (in terms of cA-vs-Cb needs), and, well, opsens need to +2 it even for ultra-small changes

It follows that neither is entirely acceptable. With that in mind, I am not sure which approach needs to be taken wrt the discovery mechanism. How should it be fed the config - read it from puppet or let the service give it on start-up? How can we ensure rolling deploys will work with it in place?

I realise breaking config changes are evil and ideally should not happen at all. I'm kind of more thinking out loud here and fishing for other people's thoughts on this.

@marko I think all what you state here is something that will be enabled by this software, once integrated with our tools (salt, puppet, pybal, etc.).

The specifics will need to be ironed out for sure, but not in the ticket about the configuration store probably :)

In T95656#1221186, @Joe wrote:

@marko I think all what you state here is something that will be enabled by this software, once integrated with our tools (salt, puppet, pybal, etc.).

Good, thnx.

The specifics will need to be ironed out for sure, but not in the ticket about the configuration store probably :)

Probably, but was just putting it out there for consideration (better safe than sorry)

Joe added subscribers: fgiunchedi, akosiaris.Apr 22 2015, 10:06 AM

BBlack added a parent task: T97029: integrate (pybal|varnish)->varnish backend config/state with etcd or similar.Apr 23 2015, 2:30 PM

hashar unsubscribed.Apr 24 2015, 10:05 AM

Joe claimed this task.Apr 30 2015, 8:58 AM

Joe closed subtask T96832: consul evaluation as Resolved.Apr 30 2015, 9:02 AM

Joe closed subtask T96839: zookeeper evaluation as Resolved.Apr 30 2015, 9:39 AM

Joe closed subtask T96825: etcd evaluation as Resolved.Apr 30 2015, 9:54 AM

In T95656#1198202, @hashar wrote:

Out of curiosity, since analytics already uses zookeeper for hive/kafka, maybe it should be given a try first and other solutions looked at if zookeeper does not match our needs. That would be one less technology introduced to the cluster. 0.02€

Just adding my own $0.02 here since this keeps coming in related irc/email conversations: I don't think analytics use of ZK is much of an argument here, either. What we're looking to do here is a very specialized thing that will be deeply integrated with some of our front-line / outage-sensitive infrastructure, and the requirements are completely different in terms of interfaces, data size/schema, geographic/replication issues, fault/isolation tolerance, etc...

I don't think analytics use of ZK is much of an argument here, either.

+1. we have 3 ZK servers that are used by Kafka for leader election, and for occasional non-production consumer offset management. Zookeeper works great, but that is because Kafka has been coded to work with it. I have a hunch that it would be a pain to use for these other opsy things.

Since no one really complained about my evaluation, we'll go on with etcd for now.

Joe closed this task as Resolved.May 4 2015, 7:29 AM

I was really just wondering about pre existing usage of ZeroKeeper. @Joe promptly addressed it at T95656#1220342 :-]

Welcome etcd!

Choose a consistent, distributed k/v storage for configuration management/discoveryClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Choose a consistent, distributed k/v storage for configuration management/discovery
Closed, ResolvedPublic
Actions

Related Objects
Search...