Page MenuHomePhabricator

Move LabsDB aliases to DNS
Closed, ResolvedPublic

Description

At the moment, Labs instances that want to be able to connect to "dewiki.labsdb" and end up at the DB replica server hosting the German Wikipedia, have to copy /etc/hosts from a Tools instance. This system (per-database hostname) comes from the toolserver setup and is in active use by (most) tools that need to connect to project databases - in particular, tools which can operate on more than one project use the project name as a 'selector' in this way.

The ideal solution would be to do away with this system entirely, and have tools resolve the proper database dynamically at runtime according to the information made available in every replica in the meta_p.wiki table. This, however, requires altering the code that runs those tools and possibly restructuring it according to the new scheme, something which is difficult to demand of every maintainer (not all of whom are active regularily).

A decent intermediate solution is to have those aliases served properly by DNS as part of a subdomain, where only one copy of the data exists and needs to be maintained (and, since it's in git, can be maintained automatically if wanted).

The current changeset proposes allocating labs.$site.wmnet for that purpose, and places the aliases in the zone file accordingly. resolv.conf on labs instances has already been taken by puppet to include 'labs.$site.wmnet' in the search paths, and set ndots to 2 meaning that the same hostnames will resolve through DNS the same way they did with /etc/hosts

Possible improvements include delegating the subdomain to a labs server.

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:53 AM
bzimport added a project: Cloud-VPS.
bzimport set Reference to bz61897.
bzimport added a subscriber: Unknown Object (MLST).
scfc created this task.Feb 25 2014, 8:56 AM
scfc added a comment.Feb 1 2015, 6:30 PM

IIRC the last time I thought about that there were basically two alternatives:

  1. Put the aliases in operations/dns:templates/wmnet under labsdb.svc.eqiad.wmnet and append svc.eqiad.wmnet to search in /etc/resolv.conf. For consistency, there would need to be some monitoring that checks that the template (or can the actual zone be queried?) is in sync with operations/mediawiki-config:all.dblist. While this is rather easy to set up, this would have the side-effects of a) affecting the resolution for all servers and b) resolving for example appservers to appservers.svc.eqiad.wmnet which might not be intuitive?
  2. Write a backend for pdns which I think serves Labs instances (?) that for *.labsdb reads a git clone of all.dblist, or use the MySQL backend and point it to the existing meta_p.wiki at any Labs database server.

I would prefer the first option as the currentness of data is probably sufficient and there is a lot less possibility for surprises there.

I think a simpler solution is to use the -A option of dnsmasq to have it return A records for these entries. We can set that in puppet now (see dnsmasq-nova.conf.erb).

scfc added a comment.Feb 2 2015, 12:25 AM

That would indeed be *much* simpler :-):

-A, --address=/<domain>/[domain/]<ipaddr>
       Specify an IP address to return for any host
       in   the  given  domains.   […]

But how are the connections then distributed over the three (?) replica servers? AFAICS, currently enwiki goes to one, then some to another and the rest to the third.

We'll just have to replicate how things are set up in curent /etc/hosts (which do *not* use NAT rules, just pick one of labsdb1001-3 as default for a particular wiki)

scfc added a comment.Feb 2 2015, 1:16 AM

Ah, you mean one -A option per wiki! I thought just one -A /labsdb/10.something.

ah, yeah. per wiki I meant.

Copying my comment from elsewhere:

Currently, aliases for things like enwiki.labsdb and s1.labsdb are maintained as /etc/host entries, manually. Why aren't they just DNS records? This causes a multitude of problems, such as:

No easily verifiable single source of truth if two instances differ
One more step to be able to forget when creating new toollabs nodes
Problematic for other projects on labs
This should be in DNS and in puppet (see how public / private aliases are set up with dnsmasq right now)

mark assigned this task to coren.Mar 3 2015, 12:35 PM

Coren, could you correct this to use proper configuration management (and probably DNS)?

coren added a comment.Mar 3 2015, 2:14 PM

There are only two issues with using DNS atm (neither of which are unsurmountable, only need decision): what domain to put those under (requires setting ndots to 2 labswide and adding that domain to search), or whether we create a local tld (labsdb.) for them.

Opinions?

Can you explain what you mean by 'which domain'? I guess we would want the
current names (enwiki.labsdb etc) to just resolve already. The -A solution
described a few comments ago should just work no?

coren added a comment.Mar 3 2015, 4:20 PM

Resolution won't work with hostnames containing a dot unless they are fqdn or resolv.conf is set to try the search prefixes in the presence of (less than a set number) of dots. Just setting, say, enwiki.labsdb with a -A effectively makes labsdb. a tld (solution 2).

That said, it works for me even though it's "wrong" from a DNS pov.

I think that ship sailed when those names were put in /etc/hosts so lets go
with 2. That would also make it easier on other projects

scfc added a comment.Mar 6 2015, 6:52 AM

I found some old notes regarding possible migration (non-)issues based on a imaginary world after https://gerrit.wikimedia.org/r/#/c/156599/ was merged that applies to the real world as well (just because it's always better to be wrong in writing :-)):

  • For instances where enwiki.labsdb & Co. already resolve to 10.64.4.11, etc. there will be no change.
  • For instances where enwiki.labsdb & Co. resolve to 192.168.* that have NAT rules in effect, on trimming down /etc/hosts new processes will use the new aliases. Old processes will continue to use the NAT rules until the next reboot where they can be safely purged or not re-applied.

One issue that I found just now is that /usr/bin/sql greps /etc/hosts to determine whether a database has been replicated. I'll replace that by a call to getent hosts which queries /etc/hosts and DNS.

scfc added a comment.Mar 6 2015, 8:03 AM

Forgot: I had uploaded a small script at https://gerrit.wikimedia.org/r/#/c/191846/ that transforms text files to YAML (for hiera) which might be useful here.

Change 194858 had a related patch set uploaded (by coren):
Labs: manage resolv.conf in labs also

https://gerrit.wikimedia.org/r/194858

Change 194858 merged by coren:
Labs: manage resolv.conf in labs also

https://gerrit.wikimedia.org/r/194858

Change 194865 had a related patch set uploaded (by coren):
Add labs.eqiad.wmnet. subnet

https://gerrit.wikimedia.org/r/194865

coren renamed this task from Move LabsDB aliases and NAT to DNS and LabsDB servers to Move LabsDB aliases to DNS.Mar 6 2015, 6:31 PM
coren updated the task description. (Show Details)
coren set Security to None.

Edited task description to be clearer about the current status.

@coren: Any updates on this? I had to manually copy another /etc/hosts file again today...

Change 194865 abandoned by coren:
Add labs.eqiad.wmnet. subnet

Reason:
Not going to use this approach

https://gerrit.wikimedia.org/r/194865

coren triaged this task as Medium priority.Mar 25 2015, 2:12 PM

My original approach was to tackle this at the prod DNS level, which had issues. Now that we are nearing having a read DNS server for labs, this is where it should properly belong. I'll sync up with @Andrew to create a properly delegated zone for them.

Another option is to point all *.labsdb to localhost, and then put a mysql proxy there that intelligently routes things. This lets us dynamically adjust weights, and even do failover very nicely. @Springle thoughts? Does such a proxy exist at all? :)

coren added a comment.Mar 25 2015, 2:18 PM

That adds a moving part. I'm pretty sure that's not something we should be gunning for.

Oh totally :) I just remember @Springle asking (a long time ago) for an easy way to shift traffic around easily.

However, I just realized that if we use DNS properly, *that* will work fine too, so ignore my proxy question :)

coren moved this task from Triage to In Progress on the Cloud-Services board.
scfc added a comment.Mar 25 2015, 5:19 PM

At Toolserver, all connections went through a central HA load-balancing proxy, and inter alia because that HA proxy had to be rebooted daily (!) and took down long-running queries with it every night, I have a very strong aversion against introducing new SPOFs :-).

Yes, please ignore my comment about the proxy :) I realized when writing my second comment that DNS will give us what we want (specifically, the ability to slowly drain connections away from one host if necessary), and that's good enough...

However, note that if we *did* have a proxy, it wouldn't be a SPOF at all - it would run on each exec / submit host, and would be running n localhost only. So it would go down only when the machine has issues, etc. However, it does add another moving part, so let's not do that, etc.

Change 210000 had a related patch set uploaded (by Tim Landscheidt):
Tools: Puppetize database aliases as host resources

https://gerrit.wikimedia.org/r/210000

Change 210000 merged by Yuvipanda:
tools: Puppetize database aliases as host resources

https://gerrit.wikimedia.org/r/210000

We still need to move these to DNS. Need to set up a bunch of blocker tasks for that (moving to designate, split horizon, etc)

Designate is done, split horizon is done, etc. We've switched over to designate - time to pick this back up.

Krenair claimed this task.Sep 16 2015, 5:03 AM

Change 238672 had a related patch set uploaded (by Alex Monk):
Move *.labsdb aliases into DNS

https://gerrit.wikimedia.org/r/238672

Change 238672 merged by Yuvipanda:
labs: Move *.labsdb aliases into DNS

https://gerrit.wikimedia.org/r/238672

scfc added a comment.Sep 18 2015, 6:51 PM

On a host where /etc/hosts does not contain the aliases:

scfc@toolsbeta-vmbuilder-precise:~$ cat /etc/hosts
127.0.0.1 localhost
127.0.1.1 ubuntu.openstack.eqiad.wmflabs ubuntu

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
scfc@toolsbeta-vmbuilder-precise:~$ host enwiki.labsdb
enwiki.labsdb has address 10.64.4.11
scfc@toolsbeta-vmbuilder-precise:~$

So left (for move = nothing is left at the source) are:

  1. Remove the puppetization of /etc/hosts for Toolforge instances.
  2. Remove the aliases from the existing /etc/hosts on Toolforge instances (probably a script and pdsh/salt).
scfc added a comment.Sep 18 2015, 6:54 PM

I retract that; the template for /etc/hosts for Toolforge instances contains a dynamic mapping for tools-redis, so this would need testing first if the IP change can be done in pure Puppet.

Change 239447 had a related patch set uploaded (by Alex Monk):
Pull *.labsdb out of /etc/hosts on tools

https://gerrit.wikimedia.org/r/239447

We still have to keep an /etc/hosts because of tools-db, tools-redis and tools-redis.eqiad.wmflabs. The *.labsdb entries can go though.

Change 239447 merged by Yuvipanda:
tools: Pull *.labsdb out of /etc/hosts on tools

https://gerrit.wikimedia.org/r/239447

I think that cleans it out?

scfc closed this task as Resolved.Sep 18 2015, 8:25 PM

Yes.