Page MenuHomePhabricator

Magnum control plane trying to access https://discovery.etcd.io/new?size=1
Closed, ResolvedPublic

Description

Creation of k8s clusters no longer works on codfw1dev since the control nodes moved off of public IPs. According to the docs for 'coe cluster create':

–discovery-url <discovery-url>

    The custom discovery url for node discovery. This is used by the COE to discover the servers that have been created to host the containers. The actual discovery mechanism varies with the COE. In some cases, Magnum fills in the server info in the discovery service. In other cases, if the discovery-url is not specified, Magnum will use the public discovery service at:

    https://discovery.etcd.io

    In this case, Magnum will generate a unique url here for each cluster and store the info for the servers.

Naturally that doesn't work since cloudcontrol nodes don't have access to the wider internet.

I don't entirely understand what this is for, so I'm not sure what the right solution is. We could set up a proxy, or (I assume) create our own internal discovery service to use instead. It looks like we can override the default in config, so we wouldn't require users to always specify a custom --discovery-url.

From magnum.conf:

# Url for etcd public discovery endpoint. (string value)
#etcd_discovery_service_endpoint_format = https://discovery.etcd.io/new?size=%(size)d

Event Timeline

I guess the easiest would be to deploy an instance of https://github.com/etcd-io/discoveryserver to some virtual machine in the cloudinfra project and point the magnum config to it.

Indeed. It's even packaged in Debian https://packages.debian.org/bookworm/etcd-discovery for this very exact purpose!

Right, so even easier:

  • create a puppet role/profile to install this package and the configuration.
  • create a VM using the above role
  • point some proxy address to the VM, for example etcd-discovery.codfw1dev.wmcloud.org
  • refresh magnum config to include the new endpoint
  • do the same in eqiad1

Is running it on a VM better than just running it alongside magnum on the cloudcontrols? I'm guessing there's some coordination involved here so having multiple discovery servers requires them to coordinate somehow?

/me reads the docs

Is running it on a VM better than just running it alongside magnum on the cloudcontrols?

Yes: hardware is mostly pets. VMs is mostly cattle.

I don't see any benefit in running it on hardware.

  • point some proxy address to the VM, for example etcd-discovery.codfw1dev.wmcloud.org

We don't even need this public FQDN. It can be something like etcd-discovery.svc.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud given the connections wont leave the openstack virtual network.

I have this service running on etcd-discovery-1.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud behind proxy etcd-discovery.codfw1dev.wmcloud.org (for https purposes).

Whenever etcd-discovery tries to talk to the local etcd node it fails with

setupToken returned: Couldn't setup state <nil> client: response is invalid json. The endpoint is probably not valid etcd cluster endpoint.

I'm guessing this means that etcd needs more setup than 'apt install etcd' is providing. Can't find any docs to that effect though.

If I'm reading the docs correctly, the discovery server requires a working etcd backend :-(

I guess you need to install the etcd-client debian package which contains the etcdctl tool, to bootstrap the etcd cluster before the discovery service will work.

Here is how this is done for toolforge: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/Deploying#etcd_nodes

In case you think is too much work, we may reconsider other options, like creating a reverse proxy for the upstream https://discovery.etcd.io (ie, a small VM with haproxy nginx or other stuff that just proxies to the upstream one, and therefore uses the egress NAT and can contact the internet).

We have existing puppetization for etcd that we can reuse - I don't think it should be very complicated to set up a single-node etcd cluster.

I just checked the reverse proxy option. It is definitely simple. I don't think it is worth deploying our own etcd server.

With this nginx config file:

server {
	listen 80 default_server;
	listen [::]:80 default_server;
	location / {
	    proxy_pass https://discovery.etcd.io/;
	    sub_filter "https://discovery.etcd.io/" "http://localhost/";
	    sub_filter_types *;
	    sub_filter_once off;
	}
}

You can do this:

aborrero@etcd-discovery-1:~$ curl -X PUT http://localhost:80/new?size=1 ; echo
http://localhost/f9c8c3ee4cecc76a3ec5281cca21ee69

It should be trivial to add https everywhere and have a our own service FQDN.

Note the sub_filter is required otherwise the response contains the external endpoint https://discovery.etcd.io/96051d49d10377b76163cd7565f4e45[..].

Change 937104 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Add puppet role and profile for etcd_discovery service

https://gerrit.wikimedia.org/r/937104

Change 937138 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Magnum: allow configuration of etcd discovery service host

https://gerrit.wikimedia.org/r/937138

Change 937104 merged by Andrew Bogott:

[operations/puppet@production] Add puppet role and profile for etcd_discovery service

https://gerrit.wikimedia.org/r/937104

Change 937172 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] etcd-discovery: restart etcd after config change

https://gerrit.wikimedia.org/r/937172

Change 937138 merged by Andrew Bogott:

[operations/puppet@production] Magnum: allow configuration of etcd discovery service host

https://gerrit.wikimedia.org/r/937138

Change 937172 merged by Andrew Bogott:

[operations/puppet@production] etcd-discovery: restart etcd after config change

https://gerrit.wikimedia.org/r/937172

Change 937176 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] magnum: use eqiad1-hosted etcd discovery service

https://gerrit.wikimedia.org/r/937176

Change 937176 merged by Andrew Bogott:

[operations/puppet@production] magnum: use eqiad1-hosted etcd discovery service

https://gerrit.wikimedia.org/r/937176