Maniphest T207533

Move labs-recursors in WMCS
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	faidon
	Oct 20 2018, 9:53 AM

Description

Cloud VPSes are currently pointed to a couple of recursor IPs, labs-recursor0 and labs-recursor1, that currently run in labservices cluster which is in itself in production.

This is non-ideal, as it creates yet another cloud->prod flow and escalation vectors in case of exploitation. These recursor should move within WMCS instances, e.g. in a couple of instances in the cloudinfra project, similar to what we did with smarthosts in T41785.

The only gotcha seems to be that the recursor runs some custom Lua code, that uses data generated by a Python script, that in turn seems to gather those from Nova's API. I'm not sure if that's accessible publicly or from within WMCS.

SRE can possibly help with the move, as long as we all agree on the plan and the Nova API -> VPS data flow gets figured out :)

Note that this is different that Cloud's authoritative DNSes (labs-ns0/labs-ns1); currently both recursors and authoritatives run as distinct services on the same servers, but there is not reason for that to continue to be the case (and they are distinct e.g. in production).

Also see: T119660, T200358.

Details

	Subject	Repo	Branch	Lines +/-
	labs dnsrecursor: require clientlib before labsaliaser	operations/puppet	production	+1 -0
	labsaliaser: use keystone public port instead of admin port	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
		Restricted Task
Resolved	None	T207536 Move various support services for Cloud VPS currently in prod into their own instances
Declined	None	T207533 Move labs-recursors in WMCS

Event Timeline

faidon triaged this task as Medium priority.Oct 20 2018, 9:53 AM

faidon created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 20 2018, 9:53 AM

faidon updated the task description. (Show Details)Oct 20 2018, 9:56 AM

The only gotcha seems to be that the recursor runs some custom Lua code, that uses data generated by a Python script, that in turn seems to gather those from Nova's API. I'm not sure if that's accessible publicly or from within WMCS.

It is, there are guest credentials known as 'novaobserver' which VMs can use, and indeed that appears to be what the current setup uses so we should be fine there.

Krenair added a parent task: T207536: Move various support services for Cloud VPS currently in prod into their own instances.Oct 20 2018, 11:08 AM

Paladox subscribed.Oct 20 2018, 12:22 PM

Change 468709 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] labsaliaser: use keystone public port instead of admin port

https://gerrit.wikimedia.org/r/468709

gerritbot added a project: Patch-For-Review.Oct 20 2018, 2:36 PM

Change 468714 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] labs dnsrecursor: require clientlib before labsaliaser

https://gerrit.wikimedia.org/r/468714

Created labs-dnsrecursor-alex-test.openstack.eqiad.wmflabs (edit: since shut down, but these instructions are still helpful) and applied profile::openstack::base::pdns::recursor::service as well as this hieradata to make it as similar to a random prod labs-recursor as possible:

profile::openstack::base::keystone_host: cloudcontrol1003.wikimedia.org
profile::openstack::base::pdns::recursor: labs-recursor2.wikimedia.org
profile::openstack::base::nova_controller: cloudcontrol1003.wikimedia.org
profile::openstack::base::pdns::tld: wmflabs
profile::openstack::base::pdns::recursor_aliaser_extra_records:
  tools-db.tools.eqiad.wmflabs.: 10.64.37.9 # labsdb1005.eqiad.wmnet / tools-db
  tools-redis.tools.eqiad.wmflabs.: 10.68.22.56 # tools-redis-1001.tools.eqiad.wmflabs
  tools-redis.eqiad.wmflabs.: 10.68.22.56 # tools-redis-1001.tools.eqiad.wmflabs
  puppet.: 208.80.154.158 # labpuppetmaster1001.wikimedia.org
profile::openstack::base::pdns::use_metal_resolver: True
profile::openstack::base::observer_project: observer
profile::openstack::base::pdns::host: labs-ns2.wikimedia.org
profile::openstack::base::pdns::private_reverse_zones:
- '68.10.in-addr.arpa'
- '16.172.in-addr.arpa'
- '56.15.185.in-addr.arpa'
profile::openstack::base::observer_password: Fs6Dq2RtG8KwmM2Z
profile::openstack::base::observer_user: novaobserver

(password is publicly available elsewhere, it's a guest read-only account)

Uploaded https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/468708 for a missing puppet resource dependency bug I found.
Manually installed python-keystoneclient and python-novaclient (those come from clientlib which is included by other profiles on the prod hosts, I've made https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/468714).
Disabled Puppet and replaced 35357 with 5000 in cloudcontrol URL in /etc/labs-dns-alias.yaml (manual application of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/468709).
Ran /usr/local/bin/labs-ip-alias-dump.py (normally run by cron so service starting fails at first until this gets done) and service pdns-recursor start (because it would've failed earlier).

Change 468709 merged by Andrew Bogott:
[operations/puppet@production] labsaliaser: use keystone public port instead of admin port

https://gerrit.wikimedia.org/r/468709

My only concern about this is that those recursors are used about every second on every VM, so they're a huge, vital point of failure and I'm a bit reluctant to rock the boat.

In theory having redundant recursors will work around the issue with us needing to occasionally restart/move VMs but it will still make things a lot more fragile. In general I'd prefer to address any concerns with the recursors directly.

MoritzMuehlenhoff subscribed.Oct 22 2018, 7:55 AM

In T207533#4683780, @Andrew wrote:

My only concern about this is that those recursors are used about every second on every VM, so they're a huge, vital point of failure and I'm a bit reluctant to rock the boat.

In theory having redundant recursors will work around the issue with us needing to occasionally restart/move VMs but it will still make things a lot more fragile. In general I'd prefer to address any concerns with the recursors directly.

Is it possible to pin an instance to a specific labvirt/cloudvirt node (or a set of nodes) to ensure that the recursor instances are always on different virt hosts? Then we'd have availability guarantees roughly comparable to what we do with serpens/seaborgium; if one of the two (or three) recursors is down, the VPS host would query the next server configued in resolv.conf once the current timeout has passed (currently configured to two seconds).

In T207533#4684756, @MoritzMuehlenhoff wrote:

In T207533#4683780, @Andrew wrote:

My only concern about this is that those recursors are used about every second on every VM, so they're a huge, vital point of failure and I'm a bit reluctant to rock the boat.

In theory having redundant recursors will work around the issue with us needing to occasionally restart/move VMs but it will still make things a lot more fragile. In general I'd prefer to address any concerns with the recursors directly.

Is it possible to pin an instance to a specific labvirt/cloudvirt node (or a set of nodes) to ensure that the recursor instances are always on different virt hosts?

I think this has been done before.

In T207533#4684756, @MoritzMuehlenhoff wrote:

Then we'd have availability guarantees roughly comparable to what we do with serpens/seaborgium; if one of the two (or three) recursors is down, the VPS host would query the next server configued in resolv.conf once the current timeout has passed (currently configured to two seconds).

Apparently this is not always the most reliable mechanism. I'm wondering if we can use anycast routing with neutron and have each labs-recursor* IP backed by multiple instances on different hosts.

In T207533#4683780, @Andrew wrote:

My only concern about this is that those recursors are used about every second on every VM, so they're a huge, vital point of failure and I'm a bit reluctant to rock the boat.

In theory having redundant recursors will work around the issue with us needing to occasionally restart/move VMs but it will still make things a lot more fragile. In general I'd prefer to address any concerns with the recursors directly.

It sounds like your concern is about SPOFs and the reliability of those new VPSes. Is this related to the underlying reliability expectations of cloudvirts, the reliability of Cloud VPSes in general, or about the possibility that those resolvers will end up run on the same cloudvirts and share fate?

Regardless, this sounds like a broader concern than just recursors and something that may affect other sibling tasks of this as well as other existing critical infrastructure (e.g. the MXes?), so something we should find ways to address this anyway. Do you agree?

I don't think SMTP and DNS are in the same level, so perhaps is not a fair comparison. No VM needs SMTP to work, a failure in SMTP servers is not a disaster. Most stuff running on the CloudVPS will keep working even if SMTP servers are down.
Moreover, a failure in a SMTP server wouldn't cause any trouble in fixing the SMTP server itself.

On the other hand, as you all know, DNS servers are a vital part of the infra. If VPS resolvers run inside openstack, and we have any issue with the resolvers themselves, we may have severe troubles trying to fix them, which could lead to extended downtime while rebuilding stuff, etc. Unable to ssh to most VPS is probably the first symptom of a failure in DNS servers, which can be an additional issue when trying to fix any other issue (think on local puppetmasters for example).

My concern is with the design approach: avoiding the chicken-egg problem. Also, I think cloudvirts are reliable enough to run anything, but we lack a key feature which is live migration (right now, in order to move a VM from one virt server to another, we have to shut it down). BTW we won't have live migration in the short term because other reasons.

I will follow up in the parent task with some general comments not related to this specific task.

I'm not sure why there would be a chicken-and-egg problem. Prod recursors run in prod, right? Why is this different?

Also, while I can see a recursor outage cascading into various random issues across the infrastructure, I'm unsure why such an issue would prevent one from ssh'ing to the recursors themselves and fixing them. Is there a specific issue you're thinking of? If an underlying hidden dependency like this exists, perhaps it needs to be addressed regardless.

In T207533#4689756, @faidon wrote:

while I can see a recursor outage cascading into various random issues across the infrastructure, I'm unsure why such an issue would prevent one from ssh'ing to the recursors themselves and fixing them. Is there a specific issue you're thinking of? If an underlying hidden dependency like this exists, perhaps it needs to be addressed regardless.

The most obvious one is that to SSH normally as your usual user, the instance needs to be able to DNS lookup ldap-labs.(eqiad|codfw).wikimedia.org so it can find out if you're in project-${::labsproject} and what the permitted SSH public keys for your username are. But I think you can get around that if you have access to SSH directly to root.

Another issue is that we typically ssh via a bastion -- if the bastion is unable to resolve the target host then the connection will fail.

Yeah but for that one you can at least do the DNS lookup for yourself.

• GTirloni subscribed.Oct 24 2018, 7:50 PM

Change 468714 merged by Andrew Bogott:
[operations/puppet@production] labs dnsrecursor: require clientlib before labsaliaser

https://gerrit.wikimedia.org/r/468714

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 10:32 PM

• GTirloni added a project: cloud-services-team (Kanban).Mar 21 2019, 6:51 PM

• GTirloni unsubscribed.Mar 21 2019, 9:11 PM

Maintenance_bot removed a project: Patch-For-Review.May 22 2019, 3:49 PM

taavi mentioned this in T307357: Move cloud vps ns-recursor IPs to host/row-independent addressing.May 2 2022, 1:48 PM