Page MenuHomePhabricator

Ensure we can survive a loss of labservices1001
Closed, ResolvedPublic

Description

Previously we had several issues when labservices1001 was unavailable across Puppet in Labs and CI. We need to test this before T148506: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) which will happen on the 26th of April to ensure we know how to handle the outage.

Services:

  • Authoritative DNS
  • Recursive DNS
  • Designate sink (which registers dns records for new instances)
  • Designate api (supports the Horizon DNS and Proxy UI)

Event Timeline

chasemp updated the task description. (Show Details)

@chasemp or @Andrew hi, im not sure if you already thought of this but what about switching labs services to the second labservice if there is a secondary one? I am unsure if nodepool will be able to detect that the main labservice is down and switch to secondary. (sorry if i am asking a question that has already been thought off)

For the most part I'd expect DNS to fail over gracefully -- in the cases where it doesn't, that's misbehavior (or misconfig) on the part of the clients. That is https://phabricator.wikimedia.org/T119660.

Designate doesn't have live fail-over; it requires a hiera change and puppet runs on californium, labservices100* and possibly labcontrol.

Change 350239 had a related patch set uploaded (by Andrew Bogott):
[operations/puppet@production] Switch labservices1002 to the primary designate/dns server.

https://gerrit.wikimedia.org/r/350239

Change 350239 merged by Andrew Bogott:
[operations/puppet@production] Switch labservices1002 to the primary designate/dns server.

https://gerrit.wikimedia.org/r/350239

Change 350251 had a related patch set uploaded (by Andrew Bogott):
[operations/puppet@production] Switch labservices1002 to the primary designate/dns server.

https://gerrit.wikimedia.org/r/350251

Change 350251 merged by Andrew Bogott:
[operations/puppet@production] Switch labservices1002 to the primary designate/dns server.

https://gerrit.wikimedia.org/r/350251

The answer to the question:

Loss of a labservices node does not seem to cause any dns outages. New instances are unable to launch properly, though, because firstboot has a hardcoded labvirt1001 in as the primary dns resolver.

Andrew subscribed.

The particular maintenance that prompted this task is now complete. We still need to improve here, though.

Fyi, I opened a new task to discuss and perform a similar maintenance ( T172459 ).

Andrew claimed this task.

We have, unfortunately, demonstrated that we can live for hours without this box without suffering anything serious.