Page MenuHomePhabricator

cloud-init: figure out how to change /etc/hosts from cloud-init/vendordata
Closed, ResolvedPublic

Description

The recent nova cloud-init/vendor data setup for /etc/hosts generates a file that doesn't work for grid engine, see T277653#6926509:

127.0.1.1      toolsbeta-sgegrid-master.toolsbeta.eqiad1.wikimedia.cloud       toolsbeta-sgegrid-master
127.0.0.1       localhost

This could work:

#127.0.1.1      toolsbeta-sgegrid-master.toolsbeta.eqiad1.wikimedia.cloud       toolsbeta-sgegrid-master
127.0.0.1       localhost

This also works:

172.16.0.126    toolsbeta-sgegrid-master.toolsbeta.eqiad1.wikimedia.cloud       toolsbeta-sgegrid-master
127.0.0.1       localhost

This does not work:

127.0.0.1       localhost  toolsbeta-sgegrid-master.toolsbeta.eqiad1.wikimedia.cloud       toolsbeta-sgegrid-master

This does not work either:

127.0.0.1      toolsbeta-sgegrid-master.toolsbeta.eqiad1.wikimedia.cloud       toolsbeta-sgegrid-master
127.0.0.1       localhost

As of today, the /etc/hosts file is generated by cloud-init from the template /etc/cloud/templates/hosts.debian.tmpl:

{# The value '{{hostname}}' will be replaced with the local-hostname -#}
127.0.1.1 {{fqdn}} {{hostname}}
127.0.0.1 localhost

But we don't have that template in our puppet repo (or I couldn't find it) so figure out what to do, in order to make the grid happier.

Event Timeline

aborrero triaged this task as Medium priority.Mar 19 2021, 12:48 PM
aborrero created this task.
aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

Change 673475 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: nova: disable /etc/host management from cloud-init

https://gerrit.wikimedia.org/r/673475

Change 673475 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: nova: disable /etc/host management from cloud-init

https://gerrit.wikimedia.org/r/673475

Other option is to manage the template /etc/cloud/templates/hosts.debian.tmp via puppet before cloud-init runs at VM creating time.

If {{fqdn}} {{hostname}} aren't really getting substituted then something interesting/bad is happening :(

I'd like to try to understand why cloud-init is misfiring. My recollection is that I added this setting to cloud-init to simplify VM startup; previously we had to update /etc/hosts with a firstboot script which was quite fragile.

If {{fqdn}} {{hostname}} aren't really getting substituted then something interesting/bad is happening :(

I'd like to try to understand why cloud-init is misfiring. My recollection is that I added this setting to cloud-init to simplify VM startup; previously we had to update /etc/hosts with a firstboot script which was quite fragile.

Trying to clarify, the issue is not that {{fqdn}} and related are not getting substituted, is that the grid does not like the entry where the host ip is not the external ip (I think that it tries to resolve in some fancy way forward and backwards the hostname, and with an entry like that it does not work). A chunk of the strace:

800 connect(3, {sa_family=AF_INET, sin_port=htons(6444), sin_addr=inet_addr("127.0.1.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
801 poll([{fd=3, events=POLLIN|POLLPRI|POLLOUT}], 1, 1000) = 1 ([{fd=3, revents=POLLOUT}])
802 getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
803 setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
804 write(3, "<gmsh><dl>280</dl></gmsh><cm version=\"0.4\"><df>bin</df><ct>message</ct><dst host=\"toolsbeta-sgegrid-master.toolsbeta.eqiad1.wikimedia.cloud\" comp=\"qmaster\" id=\"1\"></dst><rdata host=\"toolsbeta-sgegrid-master.toolsbeta.eqiad1.wikimedia.cloud\" comp=\"qconf\" id=\"0\"></rdata><port>0    </port><ac>disabled</ac></cm>", 305) = 305
805 read(3, "<gmsh><dl>369</dl></gm", 22)   = 22
806 read(3, "s", 1)                         = 1
807 read(3, "h", 1)                         = 1
808 read(3, ">", 1)                         = 1
809 read(3, "<crm version=\"0.3\"><cs condition=\"access denied\">client IP resolved to host name \"localhost\". This is not identical to clients host name \"toolsbeta-sgegrid-master.toolsbeta.eqiad1.wikimedia.cloud\"</cs><rdata host=\"toolsbeta-sgegrid-master.toolsbeta.eqiad1.wikimedia.cloud\" comp=\"q    conf\" id=\"6537\"></rdata><params>gdi_timeout=60:gdi_retries=0:cl_ping=false</params></crm>", 369) = 369
810 futex(0x564a426732a4, FUTEX_WAKE_PRIVATE, 2147483647) = 0
811 write(2, "error: commlib error: access denied (client IP resolved to host name \"localhost\". This is not identical to clients host name \"toolsbeta-sgegrid-master.toolsbeta.eqiad1.wikimedia.cloud\")\n", 186) = 186
812 write(2, "unable to send message to qmaster using port 6444 on host \"toolsbeta-sgegrid-master.toolsbeta.eqiad1.wikimedia.cloud\": got send error\n", 134) = 134
813 shutdown(3, SHUT_RDWR)                  = 0
814 close(3)                                = 0
815 exit_group(1)                           = ?
816 +++ exited with 1 +++

Can you provide me with a before and after picture of a 'good' /etc/hosts and a 'bad' /etc/hosts? I'm having trouble following T277653#6926509

thanks!

This does not work (generated config):

127.0.1.1      toolsbeta-sgegrid-master.toolsbeta.eqiad1.wikimedia.cloud       toolsbeta-sgegrid-master
127.0.0.1       localhost

This works:

#127.0.1.1      toolsbeta-sgegrid-master.toolsbeta.eqiad1.wikimedia.cloud       toolsbeta-sgegrid-master
127.0.0.1       localhost

This also works:

172.16.0.126    toolsbeta-sgegrid-master.toolsbeta.eqiad1.wikimedia.cloud       toolsbeta-sgegrid-master
127.0.0.1       localhost

This does not work:

127.0.0.1       localhost  toolsbeta-sgegrid-master.toolsbeta.eqiad1.wikimedia.cloud       toolsbeta-sgegrid-master

This does not work either:

127.0.0.1      toolsbeta-sgegrid-master.toolsbeta.eqiad1.wikimedia.cloud       toolsbeta-sgegrid-master
127.0.0.1       localhost

If {{fqdn}} {{hostname}} aren't really getting substituted then something interesting/bad is happening :(

I'd like to try to understand why cloud-init is misfiring. My recollection is that I added this setting to cloud-init to simplify VM startup; previously we had to update /etc/hosts with a firstboot script which was quite fragile.

Sorry, the task description is wrong, I'll update it now. I used {{fqdn}} and {{hostname}} as simple placeholders.

The actual problem is what @dcaro described in T277866#6933745

The cloud-init docs points you to update /etc/cloud/templates/hosts.debian.tmp before cloud-init runs. But if we update that template via puppet, then there could be a chicken-egg problem (for running puppet we need a good /etc/hosts file which is provided by cloud-init which needs a puppet run first). At that point, if that's true, we could pretty much craft the template at image building time, which is something we are trying to avoid, right?

I'm out of ideas :-P

There's also this:

If manage_etc_hosts is set to localhost, then cloud-init will not rewrite /etc/hosts entirely, but rather will ensure that a entry for the fqdn with a distribution dependent ip is present in /etc/hosts (i.e. ping <hostname> will ping 127.0.0.1 or 127.0.1.1 or other ip).

That doesn't sound ideal but might get us what we need; I can experiment.

The cloud-init docs points you to update /etc/cloud/templates/hosts.debian.tmp before cloud-init runs. But if we update that template via puppet, then there could be a chicken-egg problem (for running puppet we need a good /etc/hosts file which is provided by cloud-init which needs a puppet run first). At that point, if that's true, we could pretty much craft the template at image building time, which is something we are trying to avoid, right?

I'm not sure the chicken/egg issue is very serious here. We already know that the /etc/hosts file generated by cloud-init is adequate for puppetization. If puppet modifies the template during its initial setup, then subsequent puppet runs will get us the /etc/hosts file we really want. The question then is "when does cloud-init refresh /etc/hosts"? I fear the answer is not until a reboot but there are two ways around that:

  1. Puppet could automatically force a cloud-init refresh anytime the template is updated, or
  2. Puppet could manage /etc/hosts AND the template file so that post first puppet run any action by either puppet or cloud-init is a no-op.

Two more questions:

  • If you could choose one option from the list in the task description, what would it be?
  • Is there reason to think that whatever change is made should be made cloud-wide, or only on the grid masters?

I think the final format that makes the most sense to me is this one:

172.16.0.126    toolsbeta-sgegrid-master.toolsbeta.eqiad1.wikimedia.cloud       toolsbeta-sgegrid-master
127.0.0.1       localhost

I think this change should be OK for adoption cloud-wide. Why would anyone use the 172.0.1.1 entry anyways?

I think the final format that makes the most sense to me is this one:

172.16.0.126    toolsbeta-sgegrid-master.toolsbeta.eqiad1.wikimedia.cloud       toolsbeta-sgegrid-master
127.0.0.1       localhost

After some digging, it's not obvious to me that the current ipv4 is available to the template language used by cloud-init. Would it be adequate to just do

127.0.0.1       localhost

?

The .1.1 entry is documented here:

https://www.debian.org/doc/manuals/debian-reference/ch05.en.html#_the_hostname_resolution

As far as I can tell we don't care about that particular work-around.

Change 674091 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs instances: override /etc/cloud/templates/hosts.debian.tmpl

https://gerrit.wikimedia.org/r/674091

The .1.1 entry is documented here:

https://www.debian.org/doc/manuals/debian-reference/ch05.en.html#_the_hostname_resolution

As far as I can tell we don't care about that particular work-around.

Ok, I see. The document also mentions:

For a system with a permanent IP address, that permanent IP address should be used here instead of 127.0.1.1.

I think it should be safe to replace 127.0.1.1 with the proper VM address.

This is the current grid master (stretch) /etc/hosts file:

# HEADER: This file was autogenerated at 2019-12-04 20:47:27 +0000
# HEADER: by puppet.  While it can still be managed manually, it
# HEADER: is definitely not recommended.
127.0.0.1	localhost
::1	localhost	ip6-localhost ip6-loopback
ff02::1	ip6-allnodes
ff02::2	ip6-allrouters

172.16.4.197	tools-sgegrid-master.tools.eqiad.wmflabs	tools-sgegrid-master
10.64.16.149	statsd.eqiad.wmnet	statsd

As for gathering this data. It seems this is known to cloud-init, at least I can run this:

buster VM
aborrero@toolsbeta-sgegrid-master:~ $ cloud-init query ds.ec2_metadata.local_ipv4
172.16.0.126

Not sure if we can query this JSON (try cloud-init query ds) in the appropriate moment.

Reading this https://cloudinit.readthedocs.io/en/latest/topics/instancedata.html#using-instance-data suggests we can use that information in any cloud-init template: {{ds.ec2_metadata.local_ipv4}} so perhaps the file would be something like:

{{ds.ec2_metadata.local_ipv4}} {{fqdn}} {{hostname}}
127.0.0.1 localhost

Reading this https://cloudinit.readthedocs.io/en/latest/topics/instancedata.html#using-instance-data suggests we can use that information in any cloud-init template: {{ds.ec2_metadata.local_ipv4}} so perhaps the file would be something like:

{{ds.ec2_metadata.local_ipv4}} {{fqdn}} {{hostname}}
127.0.0.1 localhost

It doesn't seem to be using the full cloud-init templating language, only fqdn and hostname. At least, when I try other options I get errors and/or 'CI_MISSING_JINJA_VAR/<whatever>' inserted into the file.

Change 674091 merged by Andrew Bogott:
[operations/puppet@production] wmcs instances: override /etc/cloud/templates/hosts.debian.tmpl

https://gerrit.wikimedia.org/r/674091

Change 674126 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] hosts.debian.tmpl: further attempt to handle VMs without cloud-init

https://gerrit.wikimedia.org/r/674126

Change 674126 merged by Andrew Bogott:
[operations/puppet@production] hosts.debian.tmpl: further attempt to handle VMs without cloud-init

https://gerrit.wikimedia.org/r/674126

Change 673475 abandoned by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: nova: disable /etc/host management from cloud-init

Reason:
merged https://gerrit.wikimedia.org/r/c/674091 instead

https://gerrit.wikimedia.org/r/673475

I just double tested this, is solved! thanks everyone!