While it's easy for anyone to query the logical status of a server in etcd, that doesn't mean pybal has actually depooled or repooled it.
There are plenty of reasons it can happen that there is a significant delay between a state change in etcd and a state change in pybal.
Applications like cookbooks, local scripts for restarts, and even maintenance scripts might need to know, with different levels of accuracy, which is the situation on the load-balancers.
While pybal itself has an HTTP API, querying that directly is inconvenient for a series or reasons:
- Configuration is going to be complex, as every application needs to know the load-balancers to connect to
- Getting the details right requires much more pybal knowledge than it should be needed
Ideally, all this information should be easily querable from an unified API that allows to get a simple response.
My first idea was to export this information to prometheus from pybal, but that won't guarantee a time granularity that we need for things like our cookbooks.
So my proposal is to create a very simple service that aggregates information from all load-balancers for each service and returns data that's easy to parse both for a human and a machine.
Stub api
- GET /host/ should return a 404 - you need to specify a server name
- GET /host/:servername - Returns a dictionary in the form {serviceA: true, serviceB: false} to indicate which services defined here are serving traffic
- GET /service/ returns a list of links to the actual service urls
- GET / service/:servicename returns a dictionary in the form {dcA: {serverfoo: true, serverbar: false}}` giving a complete view of the state of pools in all datacenters
Data flow
For implementation, I think we need to separate a very simple (probably a few lines of python or go or php) web interface and the scraper job, that should fill in the datastore.
Ideally, we will have pybal emit an event for any pooled status change to our MEP (so using eventgate), and have the service listen to this event stream to integrate changes. The current state will be kept in a datastore (mysql?) so that we only need to have one client updating it. This can even be a separated job from the actual public API service. The job will also need to be capable of scraping the pybal apis in case of need (for instance when first populating its content).