[RFC] Define the on-disk and live structure of etcd pool data
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	Joe
	May 29 2015, 1:10 PM

Description

We need to create a structure on-disk, based on revision control, that gets synced to the etcd cluster. this structure will contain the "static" config, while the dynamic "state" of the single key will be determined at runtime by changing etcd.

Terminology used in the remainder of this document:

cluster: The value of the $cluster variable in puppet for the given node (either explicitly set or fetched from hiera)
datacenter: The value of the $::site variable in puppet for the given node
pool: the ensemble of nodes that are the backends of e.g. a pybal virtual IP
service: an individual service running on a specific node

So a single "pool" would be identified by the (datacenter, cluster, service) tuple. For example, eqiad, cache_text, varnish-frontend identifies the list of hosts that a specific pybal service (text-lb.svc.eqiad.wikimedia.org:80 IIRC) uses as backends.

Proposal

While on disk we'd probably like to have a structure that is based on nodes rather than on services, so that the information is as normalized as possible, on etcd we'd like to have the data aggregated by-service, so that they are easy to query, reducing the amount of parsing the clients have to perform.

So one possible structure we should maintain as a yaml file per datacenter (e.g. eqiad/pools.yaml) would be

cache_text:
  cp1052:
     services: ['nginx', 'varnish-fe', 'varnish-be']
  cp1053:
     services: [...]
...
appservers:
  mw1018:
      services: ['apache']
  ...

and have an additional file for describing individual services (this time a single file, services.yaml):

cache_text:
  varnish-fe:
    port: 80
    default_values: { pooled: no, weight: 10 }
...
appservers:
  apache:
    user: foo
    default_values: { pooled: yes, weight: 1 }

A specific sync script will then see to replicate this to a more-convenient structure on the distributed k/v store in a denormalized form:

/pools/datacenter/cluster/service/node

/pools/eqiad/cache_text/varnish-fe/cp1052 => {pooled: yes, weight: 10000}
/pools/eqiad/cache_text/varnish/fe/cp1053 => {pooled: yes, weight: 10000}
...
/pools/eqiad/cache_text/varnish-be/cp1052 => {pooled: yes, weight: 10000}
/pools/eqiad/cache_text/varnish-be/cp1053 => {pooled: yes, weight: 10000}
...
/pools/eqiad/appservers/mw1018 =>  {pooled: no, weight: 10000}

A command-line tool will be provided to easily change the state of a resource. Its syntax will be something like

$ conftool --datacenter eqiad --cluster cache_text --service varnish-fe "cp1052 pool=no; cp1053 pool=yes:weight=20000"
  Node cp1052 depooled from service varnish-fe
  Node cp1053 pooled in service varnish-fe with weight 20000

Note that referring to such a structure in puppet will be very easy, as (apart from the "service" label) everything has a 1:1 correspondence in puppet.
Any suggestion is welcome, I am building the base blocks for both tools right now, so some decision would be useful in a near future.

Also note that any structure we choose now can be prefixed with a version number, like /v1/pools so that any future migration will be easier. (I am leaving the "/v1/pools vs /pools_v1" yakshaving session to another ticket)

Related Objects

Mentioned In: T103344: RESTBase deployment process
T101858: Create a confd template for pybal files that will work with our etcd schema.
T94620: [EPIC] The future of MediaWiki deployment: Tooling

Event Timeline

Joe created this task.May 29 2015, 1:10 PM

Joe raised the priority of this task from to High.

Joe updated the task description. (Show Details)

Joe added projects: services-tooling, discovery-system, acl*sre-team.

Joe added subscribers: Joe, faidon, BBlack and 3 others.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 29 2015, 1:10 PM

Joe claimed this task.May 31 2015, 2:50 PM

Joe set Security to None.

Joe updated the task description. (Show Details)May 31 2015, 9:09 PM

I know this is bikesheddy, but you asked for it ;)

Nomenclature:

dcs: datacenters; ex: eqiad
services: functional service names / LVS IPs; ex: varnish-text-fe
instances: individual instances of each service: IP / port pairs; potentially configured as PyBal backends, or used directly in case of services like Cassandra
hwnodes: hardware nodes running instances; usually the least interesting to other services, but important for puppet; IP address or host name
roles: setup / config unit a la puppet roles; typically set up one instance, but can sometimes set up multiple instances or 'pods' of co-located services too
hwclusters: groups of hardware nodes with similar physical properties; can be used to map roles to clusters of hwnodes; ex: sca,

Some use cases:

list instances per DC / service:
- pybal needs to list active (pooled) instances per DC / service; ex: /eqiad/services/varnish-text-fe/target/
- service instances need to list other service's active instances per DC / service (ex: Varnish backends, Cassandra); ex: /eqiad/services/varnish-text-fe/actual/
add / remove active instances:
- many config changes are fire & forget; clients will eventually pick them up & apply them; write to /target/{ip:port} for each instance
- rolling deploys will need an ack that an instance is indeed depooled before processing to restart it: check 'actual' state as in /eqiad/services/varnish-text-fe/actual/{ip:port} until state change has happened
list roles to apply to a physical node: /eqiad/hwnodes/{ip}/roles/target/
list hwclusters: /eqiad/hwclusters/{cluster}/

@GWicke I think you have a point in considering a (host, port) tuple as a possible future expansion of the model, but I like to have a good matching between puppet and what we'll have to look up onto here. Apart from that, right now pybal does not support defining an host:port combination for the backends separately for each instance, so for now there is no point in changing that paradigm.

Also, right now pybal will *not* hack it has depooled an instance. And doing that will require more work than just sending an event to etcd - we need to verify that the ipvsadm command effectively succeeded, something that right now pybal doesn't do.

Note that if we use a version prefix it would be relatively easy to move to that paradigm at a slightly later time - at the moment I'm structuring the code so that changing the paradigm would not be harder than doing a schema change in a database. My idea in the end would be to make pybal write "dump" of the current state, and check regularly that against its desired state that comes from etcd. that could make use of what you just proposed.

So for now I think `/pools/eqiad/cache_text/varnish-fe/<node-name>' is ok, and we can move to a schema where we use something along the lines you propose once we move to a second stage (when we *may* have instances running on arbitrary hardware nodes/ports according to some scheduler (a la marathon/mesos, I mean).

Also in that case, we won't have any static config to read from for instances/pools, I guess.

I will do my best to keep the code objectified and modular enough to make it easy for us to perform such schema changes with the smallest possible effort.

Apart from that, right now pybal does not support defining an host:port combination for the backends separately for each instance, so for now there is no point in changing that paradigm.

We could still start with the restriction that the port needs to be the same across all instances used by pybal. I mainly care about keeping this regular / extensible, so that we can use the same general pattern for pybal and other services. It's not too hard to only use the IP in PyBal for now, but writing different query & update clients per service would be more work and entropy.

Re acking: Maybe we could start with /target/ only, and then add /actual/ later once that's supported? Clients with a need for acks can then switch over, while those who don't care will keep working with /target/.

Regarding puppet: The 'role' abstraction is pretty close to what puppet and other config tools commonly use. My feeling is that it's better to separate the roles from instances, as those are two different concerns / use cases. With a separation we might have an easier time adapting to mesos or whatever that might not be dealing with 'roles' in the future.

I guess my broader point is that I feel that it's worth trying to avoid having to adapt a lot of config code from one API version to another. Granted, our guesses right now might still turn out to be wrong; but, if we can start with something that has at least a chance of being right longer term without addieng significant costs in the short term then lets do that.

• GWicke added a comment.Jun 1 2015, 10:21 AM

This comment was removed by • GWicke.

(removed double-post from shaky cell phone connection on train)

Joe moved this task from Backlog to In progress on the discovery-system board.Jun 1 2015, 2:28 PM

I think roles in your view are an approximation of what I put in the /services hierarchy. That's a collection of (mostly immutable) data about one service.

My idea would be that a port info would be attached to the service in general, and could be in the host key in my example above, becoming host:port.

I dislike the idea of keeping /target/ and /actual/ in the same place, or even in the same structure. For the pybal confirmation (wherever we want it to be) we'll use something like /pools/state/eqiad/cache_text/varnish-fe which will just be a list of host:port combinations (this is the easiest thing to do from pybal, for CaS/locking purposes too).

This way, it will be easier to query the data from confd or any other client for application purposes and also to refer to it from puppet (where we'd need to configure confd).

I want to underline that single applications can and will use their own namespaces on etcd, although it's a good idea to keep the running dataset ~100 MB per cluster (note this is true for all strongly consistent kvstores, they don't scale very well with a lot of data). So consider this data structure discussion to be mostly limited to pooling/depooling in pybal.

Something that somehow escaped me in my previous comments:

this is all thought for static configuration we get from a file on the filesystem. If we move to a more "elastic" setup in the future, we'll have something resembling the coreos sidekicks auto-registering a service to etcd when it starts, so that specific use-case is out of scope for this tool. For all practical purposes here, we will not have multiple instances of the same service on different ports that will need to sync from this tool.

• Gage subscribed.Jun 1 2015, 11:03 PM

I'm approaching this discussion more from the perspective of a discovery API consumer. As a consumer, I'm more concerned about ease-of-use and regularity, and less about how much work it is to get the data into the system / make it regular. I don't care whether the information fed into the API is coming from static files or something more dynamic.

The roles I was talking refer to puppet (or ansible) roles. The concept is specific to the way we set up instances (via puppet), and not the same as the instances themselves. Future methods of spinning up instances might not use roles.

• GWicke mentioned this in T94620: [EPIC] The future of MediaWiki deployment: Tooling.Jun 4 2015, 8:18 AM

Joe mentioned this in T101858: Create a confd template for pybal files that will work with our etcd schema..Jun 9 2015, 5:21 PM

@GWicke I think for service discovery, as long as everything is mediated via pybal, this data structure is ok. When we'll need different use-cases (like true service discovery, which would I guess include the ability to fetch the service swagger spec), we can add a new part of the schema, which given how I am writing the sync tool is going to be relatively easy to do.

As per pybal acknowledging pooling/depooling, I do hear your concern and sure, we'll address that ASAP, but it's something for the next iteration I guess.

I'm leaving the task open as what we're doing right now is only a stage 1, and Gwicke's suggestions are very valuable for a 2.0 (or even 1.1) version.

Joe moved this task from In progress to Done on the discovery-system board.Jun 11 2015, 11:03 AM

• GWicke mentioned this in T103344: RESTBase deployment process.Jun 22 2015, 4:32 PM

Joe lowered the priority of this task from High to Low.Jul 13 2015, 5:05 PM

Restricted Application added a subscriber: Matanya. · View Herald TranscriptJul 13 2015, 5:05 PM

Danny_B added a project: Proposal.May 2 2016, 10:46 PM

Joe removed Joe as the assignee of this task.Oct 5 2016, 8:53 AM

Joe added a project: User-Joe.

• GWicke added a project: Services.Oct 12 2016, 11:22 PM

• Pchelolo moved this task from Backlog to watching on the Services board.Oct 12 2016, 11:43 PM

• Pchelolo edited projects, added Services (watching); removed Services.

This has been practically superseded by so many specific tickets it doesn't really make much sense anymore.

[RFC] Define the on-disk and live structure of etcd pool dataClosed, DeclinedPublicActions

Description

Related Objects

Event Timeline

[RFC] Define the on-disk and live structure of etcd pool data
Closed, DeclinedPublic
Actions