Page MenuHomePhabricator

Move labs-recursors in WMCS
Closed, DeclinedPublic

Description

Cloud VPSes are currently pointed to a couple of recursor IPs, labs-recursor0 and labs-recursor1, that currently run in labservices cluster which is in itself in production.

This is non-ideal, as it creates yet another cloud->prod flow and escalation vectors in case of exploitation. These recursor should move within WMCS instances, e.g. in a couple of instances in the cloudinfra project, similar to what we did with smarthosts in T41785.

The only gotcha seems to be that the recursor runs some custom Lua code, that uses data generated by a Python script, that in turn seems to gather those from Nova's API. I'm not sure if that's accessible publicly or from within WMCS.

SRE can possibly help with the move, as long as we all agree on the plan and the Nova API -> VPS data flow gets figured out :)

Note that this is different that Cloud's authoritative DNSes (labs-ns0/labs-ns1); currently both recursors and authoritatives run as distinct services on the same servers, but there is not reason for that to continue to be the case (and they are distinct e.g. in production).

Also see: T119660, T200358.

Event Timeline

faidon triaged this task as Medium priority.Oct 20 2018, 9:53 AM
faidon created this task.

The only gotcha seems to be that the recursor runs some custom Lua code, that uses data generated by a Python script, that in turn seems to gather those from Nova's API. I'm not sure if that's accessible publicly or from within WMCS.

It is, there are guest credentials known as 'novaobserver' which VMs can use, and indeed that appears to be what the current setup uses so we should be fine there.

Change 468709 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] labsaliaser: use keystone public port instead of admin port

https://gerrit.wikimedia.org/r/468709

Change 468714 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] labs dnsrecursor: require clientlib before labsaliaser

https://gerrit.wikimedia.org/r/468714

Created labs-dnsrecursor-alex-test.openstack.eqiad.wmflabs (edit: since shut down, but these instructions are still helpful) and applied profile::openstack::base::pdns::recursor::service as well as this hieradata to make it as similar to a random prod labs-recursor as possible:

profile::openstack::base::keystone_host: cloudcontrol1003.wikimedia.org
profile::openstack::base::pdns::recursor: labs-recursor2.wikimedia.org
profile::openstack::base::nova_controller: cloudcontrol1003.wikimedia.org
profile::openstack::base::pdns::tld: wmflabs
profile::openstack::base::pdns::recursor_aliaser_extra_records:
  tools-db.tools.eqiad.wmflabs.: 10.64.37.9 # labsdb1005.eqiad.wmnet / tools-db
  tools-redis.tools.eqiad.wmflabs.: 10.68.22.56 # tools-redis-1001.tools.eqiad.wmflabs
  tools-redis.eqiad.wmflabs.: 10.68.22.56 # tools-redis-1001.tools.eqiad.wmflabs
  puppet.: 208.80.154.158 # labpuppetmaster1001.wikimedia.org
profile::openstack::base::pdns::use_metal_resolver: True
profile::openstack::base::observer_project: observer
profile::openstack::base::pdns::host: labs-ns2.wikimedia.org
profile::openstack::base::pdns::private_reverse_zones:
- '68.10.in-addr.arpa'
- '16.172.in-addr.arpa'
- '56.15.185.in-addr.arpa'
profile::openstack::base::observer_password: Fs6Dq2RtG8KwmM2Z
profile::openstack::base::observer_user: novaobserver

(password is publicly available elsewhere, it's a guest read-only account)

Change 468709 merged by Andrew Bogott:
[operations/puppet@production] labsaliaser: use keystone public port instead of admin port

https://gerrit.wikimedia.org/r/468709

My only concern about this is that those recursors are used about every second on every VM, so they're a huge, vital point of failure and I'm a bit reluctant to rock the boat.

In theory having redundant recursors will work around the issue with us needing to occasionally restart/move VMs but it will still make things a lot more fragile. In general I'd prefer to address any concerns with the recursors directly.

My only concern about this is that those recursors are used about every second on every VM, so they're a huge, vital point of failure and I'm a bit reluctant to rock the boat.

In theory having redundant recursors will work around the issue with us needing to occasionally restart/move VMs but it will still make things a lot more fragile. In general I'd prefer to address any concerns with the recursors directly.

Is it possible to pin an instance to a specific labvirt/cloudvirt node (or a set of nodes) to ensure that the recursor instances are always on different virt hosts? Then we'd have availability guarantees roughly comparable to what we do with serpens/seaborgium; if one of the two (or three) recursors is down, the VPS host would query the next server configued in resolv.conf once the current timeout has passed (currently configured to two seconds).

My only concern about this is that those recursors are used about every second on every VM, so they're a huge, vital point of failure and I'm a bit reluctant to rock the boat.

In theory having redundant recursors will work around the issue with us needing to occasionally restart/move VMs but it will still make things a lot more fragile. In general I'd prefer to address any concerns with the recursors directly.

Is it possible to pin an instance to a specific labvirt/cloudvirt node (or a set of nodes) to ensure that the recursor instances are always on different virt hosts?

I think this has been done before.

Then we'd have availability guarantees roughly comparable to what we do with serpens/seaborgium; if one of the two (or three) recursors is down, the VPS host would query the next server configued in resolv.conf once the current timeout has passed (currently configured to two seconds).

Apparently this is not always the most reliable mechanism. I'm wondering if we can use anycast routing with neutron and have each labs-recursor* IP backed by multiple instances on different hosts.

My only concern about this is that those recursors are used about every second on every VM, so they're a huge, vital point of failure and I'm a bit reluctant to rock the boat.

In theory having redundant recursors will work around the issue with us needing to occasionally restart/move VMs but it will still make things a lot more fragile. In general I'd prefer to address any concerns with the recursors directly.

It sounds like your concern is about SPOFs and the reliability of those new VPSes. Is this related to the underlying reliability expectations of cloudvirts, the reliability of Cloud VPSes in general, or about the possibility that those resolvers will end up run on the same cloudvirts and share fate?

Regardless, this sounds like a broader concern than just recursors and something that may affect other sibling tasks of this as well as other existing critical infrastructure (e.g. the MXes?), so something we should find ways to address this anyway. Do you agree?

I don't think SMTP and DNS are in the same level, so perhaps is not a fair comparison. No VM needs SMTP to work, a failure in SMTP servers is not a disaster. Most stuff running on the CloudVPS will keep working even if SMTP servers are down.
Moreover, a failure in a SMTP server wouldn't cause any trouble in fixing the SMTP server itself.

On the other hand, as you all know, DNS servers are a vital part of the infra. If VPS resolvers run inside openstack, and we have any issue with the resolvers themselves, we may have severe troubles trying to fix them, which could lead to extended downtime while rebuilding stuff, etc. Unable to ssh to most VPS is probably the first symptom of a failure in DNS servers, which can be an additional issue when trying to fix any other issue (think on local puppetmasters for example).

My concern is with the design approach: avoiding the chicken-egg problem. Also, I think cloudvirts are reliable enough to run anything, but we lack a key feature which is live migration (right now, in order to move a VM from one virt server to another, we have to shut it down). BTW we won't have live migration in the short term because other reasons.

I will follow up in the parent task with some general comments not related to this specific task.

I'm not sure why there would be a chicken-and-egg problem. Prod recursors run in prod, right? Why is this different?

Also, while I can see a recursor outage cascading into various random issues across the infrastructure, I'm unsure why such an issue would prevent one from ssh'ing to the recursors themselves and fixing them. Is there a specific issue you're thinking of? If an underlying hidden dependency like this exists, perhaps it needs to be addressed regardless.

while I can see a recursor outage cascading into various random issues across the infrastructure, I'm unsure why such an issue would prevent one from ssh'ing to the recursors themselves and fixing them. Is there a specific issue you're thinking of? If an underlying hidden dependency like this exists, perhaps it needs to be addressed regardless.

The most obvious one is that to SSH normally as your usual user, the instance needs to be able to DNS lookup ldap-labs.(eqiad|codfw).wikimedia.org so it can find out if you're in project-${::labsproject} and what the permitted SSH public keys for your username are. But I think you can get around that if you have access to SSH directly to root.

Another issue is that we typically ssh via a bastion -- if the bastion is unable to resolve the target host then the connection will fail.

Yeah but for that one you can at least do the DNS lookup for yourself.

Change 468714 merged by Andrew Bogott:
[operations/puppet@production] labs dnsrecursor: require clientlib before labsaliaser

https://gerrit.wikimedia.org/r/468714

LSobanski subscribed.

Removing SRE, please add the tag back when there are specific actions to be performed.

Declining per work done on T307357: Move cloud vps ns-recursor IPs to host/row-independent addressing

The recursor got a private address from the cloud realm, but lives on a hardware server.