Page MenuHomePhabricator

some hosts provisioned with 127.0.1.1 entries in /etc/hosts
Closed, DeclinedPublic

Description

from looking at an issue with elastic1008 it seems there are some hosts with
the wrong entry in /etc/hosts for the hostname itself pointing to 127.0.1.1 as
opposed to the real ip:
<root at palladium:~# salt '*' cmd>
127.0.1.1 carbon.wikimedia.org carbon
127.0.1.1 elastic1008.eqiad.wmnet elastic1008
127.0.1.1 ssl1005.wikimedia.org ssl1005
127.0.1.1 ytterbium.wikimedia.org ytterbium
127.0.1.1 labstore1001.eqiad.wmnet labstore1001
127.0.1.1 searchidx1001.eqiad.wmnet searchidx1001
127.0.1.1 ssl1009.wikimedia.org ssl1009
127.0.1.1 ssl1006.wikimedia.org ssl1006
127.0.1.1 labsdb1002.eqiad.wmnet labsdb1002
127.0.1.1 labsdb1001.eqiad.wmnet labsdb1001
127.0.1.1 virt1001.eqiad.wmnet virt1001
127.0.1.1 gadolinium.wikimedia.org gadolinium
127.0.1.1 hafnium.wikimedia.org hafnium
127.0.1.1 labsdb1003.eqiad.wmnet labsdb1003
this apparently is done when debian-installer netcfg can't find an address for
the hostname, so that might be one of the reasons (temporary failure during
provisioning)
https://www.debian.org/doc/manuals/debian-reference/ch05.en.html#_the_hostname_resolution
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=316099
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=719621
the full impact is not clear, but using libc's resolver for both the qualified
and unqualified name of course will point to 127.0.1.1 so that will still work
for applications that bind on * or localhost. It will likely break for
applications that want to bind specifically a non-localhost interface (or which
we force in the config to the dns address for example)

Details

Reference
rt8130

Event Timeline

rtimport raised the priority of this task from to Medium.Dec 18 2014, 2:13 AM
rtimport added a project: ops-core.
rtimport set Reference to rt8130.

On Thu Aug 14 14:22:06 2014, fgiunchedi wrote:

from looking at an issue with elastic1008 it seems there are some
hosts with
the wrong entry in /etc/hosts for the hostname itself pointing to
127.0.1.1 as

Yeah, while the reason is bigger as you explain below, I think it was gnome that uncovered this first.

opposed to the real ip:

<root at palladium:~# salt '*' cmd> so that will
still work
for applications that bind on * or localhost. It will likely break for
applications that want to bind specifically a non-localhost interface
(or which
we force in the config to the dns address for example)

Status changed from 'new' to 'open' by RT_System

On Mon Aug 25 09:31:17 2014, akosiaris wrote:

this apparently is done when debian-installer netcfg can't find an
address for
the hostname, so that might be one of the reasons (temporary failure
during
provisioning)

I might be making a logical hop here but it reminds of
With Faidon we debugged a case some months ago (shortly after brewster
was shutdown and carbon introduced) where a lot of machines did not
have a proper /etc/network/interfaces configuration stanza but
rather had DHCP configured. It was chased down afterwards to a
human error (mine) for carbon and our puppet config missing
configurations for various subnets in modules/install-
server/files/autoinstall/subnets/

Some we inadvertently fixed by me (like
https://gerrit.wikimedia.org/r/#/c/115154/), other fixed on
purpose.

I am pretty sure given the machines that this is connected. In that
case we might have already solved it ? And as Faidon says in a
comment in the commit above we need to automate/template this
things more.

ha-ha! that would explain it indeed! and much more reasonable than "dns
resolution temporarily failed"
anyways since it might be bound to happen again, what I had in mind is a puppet
check that would fail if it finds and entry like "127.0.1.1 <hostname> <fqdn>"
in /etc/hosts, on the basis that it would start failing right after
provisioning (and not after, since we don't touch /etc/hosts for good reasons)
It isn't a real solution for sure, more like a check in the "check provisioning
did the right thing" checklist, thoughts?

On Tue Aug 26 10:13:51 2014, fgiunchedi wrote:

On Mon Aug 25 09:31:17 2014, akosiaris wrote:

this apparently is done when debian-installer netcfg can't find an
address for
the hostname, so that might be one of the reasons (temporary

failure

during
provisioning)

I might be making a logical hop here but it reminds of
With Faidon we debugged a case some months ago (shortly after

brewster

was shutdown and carbon introduced) where a lot of machines did not
have a proper /etc/network/interfaces configuration stanza but
rather had DHCP configured. It was chased down afterwards to a
human error (mine) for carbon and our puppet config missing
configurations for various subnets in modules/install-
server/files/autoinstall/subnets/

Some we inadvertently fixed by me (like
https://gerrit.wikimedia.org/r/#/c/115154/), other fixed on
purpose.

I am pretty sure given the machines that this is connected. In that
case we might have already solved it ? And as Faidon says in a
comment in the commit above we need to automate/template this
things more.

ha-ha! that would explain it indeed! and much more reasonable than
"dns
resolution temporarily failed"

anyways since it might be bound to happen again, what I had in mind is
a puppet
check that would fail if it finds and entry like "127.0.1.1 <hostname>
<fqdn>"
in /etc/hosts, on the basis that it would start failing right after
provisioning (and not after, since we don't touch /etc/hosts for good
reasons)

It isn't a real solution for sure, more like a check in the "check
provisioning
did the right thing" checklist, thoughts?

proposed a check in https://gerrit.wikimedia.org/r/#/c/157795/1
another run from today:
<root at palladium:~# salt --timeout 20 --show-timeout --output=raw '*' cmd>

fgiunchedi changed the visibility from "WMF-NDA (Project)" to "Public (No Login Required)".May 29 2015, 3:15 PM
fgiunchedi changed the edit policy from "WMF-NDA (Project)" to "All Users".
fgiunchedi set Security to None.

Unlikely this is still relevant