Page MenuHomePhabricator

How can labs-ns0 live on labcontrol1001?
Closed, ResolvedPublic

Description

labs-ns0 was on virt1000. Virt1000 is getting killed in favor of labcontrol1001, but labs-ns0 is in a different subnet from labcontrol1001 so can't be moved there.

This is further complicated by the fact that virt1000 is slated to become a hot spare for labcontrol1001; this suggests that we need all three IPs in the same subnet.

Event Timeline

Andrew raised the priority of this task from to Needs Triage.
Andrew updated the task description. (Show Details)
Andrew added projects: Cloud-Services, Cloud-VPS.
Andrew added a subscriber: Andrew.

If you want your hot spares in separate rows, probably the correct solution is not to allocate that from a specific row-subnet, but instead from our very limited supply of out-of-subnet space and then route it to the correct host. That disassociates it from any fixed notion of a specific host, row, or subnet, and gives you the freedom to reassign it to any host in the datacenter as appropriate. This is what we do for the primary nameservers for the production domains, and if we did that for labs-ns[01] they would probably be in the Misc subnet at the bottom of the two zonefiles.

A far simpler option (avoids allocating precious out-of-subnet space and dealing with special routing) would be to simply put both spares in the same subnet/row within a DC, or to not have local spares at all, seeing as you already have cross-site redundancy with labs-ns0 in eqiad and labs-ns1 in codfw.

Keeping things in separate rows is probably not important, as Labs is already critically vulnerable to a failure of that row and pretty much always will be.

Also, you're right that we have a spare for labs-dns in codfw, but we nevertheless need a spare for all the other services on labcontrol1001. So, I propose:

  1. We move labs-ns0 to a new IP that's in the same subnet as labcontrol1001
  2. When virt1000 hardware is renamed to labcontrol1002 it is also physically moved to the same rack as labcontrol1001 and given an IP in the same subnet

I'm not entirely sure how to do step 1 without prolonged outage of labs-ns0, but we have a fallback in labs-ns1...

@BBlack, can you confirm that T102411 and https://gerrit.wikimedia.org/r/#/c/218139/ are an appropriate fix for this? thx!