Page MenuHomePhabricator

Applications and scripts need to be able to understand the pooled status of servers in our load balancers.
Open, MediumPublic

Description

While it's easy for anyone to query the logical status of a server in etcd, that doesn't mean pybal has actually depooled or repooled it.

There are plenty of reasons it can happen that there is a significant delay between a state change in etcd and a state change in pybal.

Applications like cookbooks, local scripts for restarts, and even maintenance scripts might need to know, with different levels of accuracy, which is the situation on the load-balancers.

While pybal itself has an HTTP API, querying that directly is inconvenient for a series or reasons:

  • Configuration is going to be complex, as every application needs to know the load-balancers to connect to
  • Getting the details right requires much more pybal knowledge than it should be needed

Ideally, all this information should be easily querable from an unified API that allows to get a simple response.

My first idea was to export this information to prometheus from pybal, but that won't guarantee a time granularity that we need for things like our cookbooks.

So my proposal is to create a very simple service that aggregates information from all load-balancers for each service and returns data that's easy to parse both for a human and a machine.

Stub api

  • GET /host/ should return a 404 - you need to specify a server name
  • GET /host/:servername - Returns a dictionary in the form {serviceA: true, serviceB: false} to indicate which services defined here are serving traffic
  • GET /service/ returns a list of links to the actual service urls
  • GET / service/:servicename returns a dictionary in the form {dcA: {serverfoo: true, serverbar: false}}` giving a complete view of the state of pools in all datacenters

Data flow

For implementation, I think we need to separate a very simple (probably a few lines of python or go or php) web interface and the scraper job, that should fill in the datastore.

Ideally, we will have pybal emit an event for any pooled status change to our MEP (so using eventgate), and have the service listen to this event stream to integrate changes. The current state will be kept in a datastore (mysql?) so that we only need to have one client updating it. This can even be a separated job from the actual public API service. The job will also need to be capable of scraping the pybal apis in case of need (for instance when first populating its content).

Event Timeline

ema triaged this task as Medium priority.Nov 28 2019, 10:29 AM

We could also think of writing a sort of HTTP router that returns a list of PyBal API endpoints for a node. For instance:

GET /host/cp3060.esams.wmnet
http://lvs3005.esams.wmnet:9090/pools/textlb6_80/cp3060.esams.wmnet
http://lvs3005.esams.wmnet:9090/pools/textlb6_443/cp3060.esams.wmnet
[...]

This approach has the advantage, compared to the solution outlined in the task, of not duplicating state, and the disadvantage of hitting the PyBal API directly, which might be a deal breaker though?

We could also think of writing a sort of HTTP router that returns a list of PyBal API endpoints for a node. For instance:

GET /host/cp3060.esams.wmnet
http://lvs3005.esams.wmnet:9090/pools/textlb6_80/cp3060.esams.wmnet
http://lvs3005.esams.wmnet:9090/pools/textlb6_443/cp3060.esams.wmnet
[...]

This approach has the advantage, compared to the solution outlined in the task, of not duplicating state, and the disadvantage of hitting the PyBal API directly, which might be a deal breaker though?

this still leaves on the client the burden of understanding pybal's logic and of reconciling results from multiple LVS servers.

Even if we go with a simple solution, I'd rather proxy requests for our users and give a more understandable response back.

need to be able to understand the pooled status

I have to question this. Why do they need to do so? Do they have different criteria than what pybal uses to pool nodes ? And if they do, would it make more sense to add support for that to pybal e.g, by allowing the nodes to expose more information about their status (e.g. Lag == RED) and pybal pooling/depooling based on that?

need to be able to understand the pooled status

I have to question this. Why do they need to do so? Do they have different criteria than what pybal uses to pool nodes ? And if they do, would it make more sense to add support for that to pybal e.g, by allowing the nodes to expose more information about their status (e.g. Lag == RED) and pybal pooling/depooling based on that?

To be more clear. A sidecar approach would be valid here and since this already has nginx it could just route e.g /healthz to the sidecar which would take all the decisions internally talking to whatever it needs to talk to. That would mean that the clients wouldn't really need to know anything about the pooled status of backends

BBlack subscribed.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all tickets that aren't are neither part of our current planned work nor clearly a recent, higher-priority emergent issue. This is simply one step in a larger task cleanup effort. Further triage of these tickets (and especially, organizing future potential project ideas from them into a new medium) will occur afterwards! For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!