Page MenuHomePhabricator

Fix internal TLD in use in codfw1dev
Closed, ResolvedPublic

Description

As discussed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/566735, currently hostnames in this deployment look like:
$vmname.$project.codfw1dev.cloud
There should be a wikimedia label between codfw1dev and cloud.

Event Timeline

I had a look a this today.

Puppet change seems rather simple, a cuple of hiera keys:

profile::openstack::codfw1dev::designate::domain_id_internal_forward: '8d67f102-5984-4e4d-a15a-e94ccd6d9eaf'
profile::openstack::codfw1dev::nova::dhcp_domain: 'codfw1dev.cloud'

The current domain in use is

root@cloudcontrol2001-dev:~# designate domain-get 8d67f102-5984-4e4d-a15a-e94ccd6d9eaf --all-tenants
+-------------+--------------------------------------+
| Field       | Value                                |
+-------------+--------------------------------------+
| description | None                                 |
| created_at  | 2019-09-06T20:06:14.000000           |
| updated_at  | 2020-01-14T00:08:24.000000           |
| email       | abogott@wikimedia.org                |
| ttl         | 3600                                 |
| serial      | 1578960500                           |
| id          | 8d67f102-5984-4e4d-a15a-e94ccd6d9eaf |
| name        | codfw1dev.cloud.                     |
+-------------+--------------------------------------+

The change could be to simply introduce the new domain, but @Andrew commented to me that we need a special hack to introduce the new domain for it to belong to the noauth tenant.
I decided that I didn't want this special hack and wanted the new codfw1dev.wikimedia.cloud domain to belong to the admin tenant (or wmflabsdotorg or whatever). But no special hacks.
Then @Andrew commented that we would need some updates to desginate-sink to be able to work with a zone that belongs to a project rather than to noauth.

Since this is something we would eventually do to eqiad1 as well, my proposal is to:

  • identify what changes are required for designated-sink to work with zones that belong to proper projects rather than noauth.
  • evaluate zones with noauth and consider moving them to a proper project. Some options:
    • use admin project to hold all base domains. May not be read by novaobserver, so this option may be wrong.
    • use wmflabsdotorg, which makes me nervous because it holds more than just the wmflabs.org domain which eventually will go away anyway.
    • create a new infra project to hold base domain names. Naming things is hard, but perhaps something like the dns-infra project can hold all the new base domains <deployment>.wikimedia.cloud and .wmcloud.org.
  • evaluate what would it take to do implement these changes (specially downtime, user impact). Do them first in codfw1dev.
aborrero triaged this task as Medium priority.Jan 24 2020, 6:09 PM
aborrero moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.

I've re-read the code a bit, and it's not obvious to me that we need to specify a tenant to sink (outside of the implicit tenant associated with the zone id from the config). So for starters let's just try swapping in a tenant-owned zone and see if everything just works.

  • evaluate zones with noauth and consider moving them to a proper project. Some options:
    • use admin project to hold all base domains. May not be read by novaobserver, so this option may be wrong.
    • use wmflabsdotorg, which makes me nervous because it holds more than just the wmflabs.org domain which eventually will go away anyway.
    • create a new infra project to hold base domain names. Naming things is hard, but perhaps something like the dns-infra project can hold all the new base domains <deployment>.wikimedia.cloud and .wmcloud.org.

I agree let's not use the admin tenant, as IIRC it already has some special meaning, and novaobserver would not work (unless we fixed novaobserver to work in that tenant, don't recall why it doesn't there...)

Let's not expand the wmflabsdotorg tenant either.
We could use cloudinfra, which is already existing and used for things with a cloud-wide impact, and isn't a special case at a technical OpenStack level. Alternatively, yes, we could make a new one and transfer things over if necessary - dnsinfra (without the hyphen) sounds fine.

Mentioned in SAL (#wikimedia-cloud) [2020-01-27T12:44:31Z] <arturo> [codfw1dev] root@cloudcontrol2001-dev:~# openstack zone create --description "main DNS domain for VMs" --email "root@wmflabs.org" --type PRIMARY --ttl 3600 codfw1dev.wikimedia.cloud. T243556

Mentioned in SAL (#wikimedia-cloud) [2020-01-27T12:45:49Z] <arturo> [codfw1dev] manually move the new domain to the cloudinfra-codfw1dev project clouddb2001-dev: [designate]> update zones set tenant_id='cloudinfra-codfw1dev' where id = '4c75410017904858a5839de93c9e8b3d'; T243556

Change 567453 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/dns@master] cloud: introduce delegation for codfw1dev.wikimedia.cloud

https://gerrit.wikimedia.org/r/567453

Change 567454 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] wmcs: codfw1dev: fix .cloud domain

https://gerrit.wikimedia.org/r/567454

Change 567453 merged by Arturo Borrero Gonzalez:
[operations/dns@master] cloud: introduce delegation for codfw1dev.wikimedia.cloud

https://gerrit.wikimedia.org/r/567453

Change 567454 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] wmcs: codfw1dev: fix .cloud domain

https://gerrit.wikimedia.org/r/567454

Ok, I got to this point:

  • the new domain codfw1dev.wikimedia.cloud has been delegated to designate @ cloudservices2002-dev.wikimedia.org. I did this by means of the proposal at T243766 (ns0.openstack.codfw1dev.wikimediacloud.org)
  • the new domain belongs to the cloudinfra-codfw1dev project
  • with this patch https://gerrit.wikimedia.org/r/567454 openstack is instructed to issue FQDNs in the new domain for new VMs
  • records per instance creation/deletion seems to work:
root@arturo-new-domain-test:~# host 172.16.128.12
12.128.16.172.in-addr.arpa domain name pointer arturo-new-domain-test.admin.codfw1dev.wikimedia.cloud.
root@arturo-new-domain-test:~# host arturo-new-domain-test.admin.codfw1dev.wikimedia.cloud.
arturo-new-domain-test.admin.codfw1dev.wikimedia.cloud has address 172.16.128.12
root@arturo-new-domain-test:~# host arturo-new-domain-test.codfw1dev.wikimedia.cloud
arturo-new-domain-test.codfw1dev.wikimedia.cloud has address 172.16.128.12
  • what doesn't work is puppet certs. Somehow the VM is requesting a certificate for the wrong FQDN:
root@labtestpuppetmaster2001:~# puppet cert list
  "arturo-new-domain-test.admin.wikimedia.cloud"   (SHA256) 29:CC:18:B3:86:29:2B:1F:D5:6A:AE:67:C4:97:C9:40:2D:EC:DD:F7:F2:F3:34:C0:46:87:25:EC:D6:FA:87:10
  ^^^ NOTE missing .codfw1dev. in there ^^^

Change 568473 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] CloudVPS: set domain for VM instances using nova-api metadata

https://gerrit.wikimedia.org/r/568473

Change 568493 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: puppet-enc: allow hostnames from the new domain

https://gerrit.wikimedia.org/r/568493

Change 568493 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: puppet-enc: allow hostnames from the new domain

https://gerrit.wikimedia.org/r/568493

Change 568473 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] CloudVPS: set domain for VM instances using nova-api metadata

https://gerrit.wikimedia.org/r/568473

Change 568535 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] hieradata: openstack: introduce basic keys for codfw1dev

https://gerrit.wikimedia.org/r/568535

Change 568535 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] hieradata: openstack: introduce basic keys for codfw1dev

https://gerrit.wikimedia.org/r/568535

Things are far better now:

  • create a VM
  • accept the puppet cert in the puppetmaster (labtestpuppetmaster2001)
  • do the first puppet run
  • ssh to the VM!

Basically, this is solved. The process is very similar to what we have in eqiad1. Auto-signing if appropriate would be another improvement, but I leave that for other phab task.

aborrero claimed this task.

Closing task now, feel free to reopen if required.