Page MenuHomePhabricator

Deploy durum: check service for Wikidough
Closed, ResolvedPublic

Description

durum (named after https://en.wikipedia.org/wiki/Durum) is the web application that will power check.wikimedia-dns.org, the check service for Wikidough.

This service will allow users to test if they have configured Wikidough in their stub resolver for DoH or DoT, and is similar to https://dnsleaktest.com and https://1.1.1.1/help. Since we cannot recommend external services for a project like Wikidough, we decided to design and deploy our own to solve the problem of finding out the recursor a user has configured. As of now, a user has no way of finding out if they have configured Wikidough, other than using the third-party websites above, or capturing their traffic, neither of which is ideal.

This task tracks the development and deployment of durum, both in its web application form and the related Puppet configurations.

Event Timeline

Change 714998 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] durum: add role insetup

https://gerrit.wikimedia.org/r/714998

Change 714998 merged by Dzahn:

[operations/puppet@production] durum: add role insetup

https://gerrit.wikimedia.org/r/714998

Change 715001 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] install_server: add durum to partman, standard VM recipe

https://gerrit.wikimedia.org/r/715001

Change 715001 merged by Dzahn:

[operations/puppet@production] install_server: add durum to partman, standard VM recipe

https://gerrit.wikimedia.org/r/715001

Mentioned in SAL (#wikimedia-operations) [2021-08-26T12:21:21Z] <sukhe> running puppet initial run on durum1001.eqiad.wmnet - T289536

Change 715007 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/homer/public@master] Add durum1001 to BGP anycast in eqiad

https://gerrit.wikimedia.org/r/715007

Change 715029 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] acme_chief: authorize durum1001 host for durum

https://gerrit.wikimedia.org/r/715029

Change 715038 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] durum: intial commit

https://gerrit.wikimedia.org/r/715038

Change 715007 merged by jenkins-bot:

[operations/homer/public@master] Add durum1001 to BGP anycast in eqiad

https://gerrit.wikimedia.org/r/715007

Change 715029 merged by Ssingh:

[operations/puppet@production] acme_chief: authorize durum1001 host for durum

https://gerrit.wikimedia.org/r/715029

Change 715038 merged by Ssingh:

[operations/puppet@production] durum: intial commit

https://gerrit.wikimedia.org/r/715038

Change 715260 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] durum: update results page and remove redundant code

https://gerrit.wikimedia.org/r/715260

Change 715260 merged by Ssingh:

[operations/puppet@production] durum: update results page and remove redundant code

https://gerrit.wikimedia.org/r/715260

Change 715285 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] durum: bikeshedding CSS fixes

https://gerrit.wikimedia.org/r/715285

Change 715285 merged by Ssingh:

[operations/puppet@production] durum: bikeshedding CSS fixes

https://gerrit.wikimedia.org/r/715285

Note that DNS PTRs are missing at least for:

185.71.138.139
185.71.138.141
185.71.138.140

Note that DNS PTRs are missing at least for:

185.71.138.139
185.71.138.141
185.71.138.140

Oh right, good catch, thanks! I will fix this today.

Change 715499 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/dns@master] wikimedia-dns: update PTR records for durum

https://gerrit.wikimedia.org/r/715499

Change 715499 merged by Ssingh:

[operations/dns@master] wikimedia-dns: update PTR records for durum

https://gerrit.wikimedia.org/r/715499

Change 715561 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/software/knead-wikidough@master] test_dns: add tests for durum check service

https://gerrit.wikimedia.org/r/715561

Change 715561 merged by Ssingh:

[operations/software/knead-wikidough@master] test_dns: add tests for durum check service

https://gerrit.wikimedia.org/r/715561

Change 719538 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] durum: switch to client-side UUID generation

https://gerrit.wikimedia.org/r/719538

Change 719538 merged by Ssingh:

[operations/puppet@production] durum: switch to client-side UUID generation

https://gerrit.wikimedia.org/r/719538

Change 720368 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/software/knead-wikidough@master] test_dns: update test_durum() to test the web application

https://gerrit.wikimedia.org/r/720368

Change 720368 merged by Ssingh:

[operations/software/knead-wikidough@master] test_dns: update test_durum() to test the web application

https://gerrit.wikimedia.org/r/720368

Change 721018 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/homer/public@master] Add durum hosts durum[123]00[12] to BGP anycast in eqiad, codfw, esams

https://gerrit.wikimedia.org/r/721018

Change 721022 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] acme_chief: update authorized_regexes for durum hosts

https://gerrit.wikimedia.org/r/721022

Change 721022 merged by Ssingh:

[operations/puppet@production] acme_chief: update authorized_regexes for durum hosts

https://gerrit.wikimedia.org/r/721022

Change 721021 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] site: update role for durum[12345]00[12]

https://gerrit.wikimedia.org/r/721021

Change 721018 merged by jenkins-bot:

[operations/homer/public@master] Add durum hosts durum[12345]00[12] to BGP anycast

https://gerrit.wikimedia.org/r/721018

Change 721021 merged by Ssingh:

[operations/puppet@production] site: update role for durum[12345]00[12]

https://gerrit.wikimedia.org/r/721021

ssingh claimed this task.
ssingh added a subscriber: Dzahn.

durum has been deployed and is now running on all our PoPs. Marking this as closed. Thanks to @Dzahn for helping create all the VMs!

Volans reopened this task as Open.EditedSep 20 2021, 8:42 AM
Volans subscribed.

@ssingh I understand there was some issue with the DNS setup between Netbox automation and manual records. I'll try to shade some light here:

  • When an IP in Netbox has the DNS Name set, it will be used during the DNS zonefiles generation. As the durum records have the DNS Name set the wikimedia-dns.org-global and 138.71.185.in-addr.arpa zonefiles are generated when running the sre.dns.netbox cookbook.

So we need to decide if this zonefile should be managed manually or dynamically. If there is no specific reason not to, the dynamic way is the preferred way.

  • If we want to keep this zone manually managed for some reason, we should empty the DNS Name field on each of those records and run the cookbook, leaving the description as is that correctly marks it as manually managed.
  • If instead we want to use that data automatically we can just add the $INCLUDE line in the manual templates/wikimedia-dns.org zonefile and remove the manually added records that will be replaced by those dynamically included. And then empty the description field in Netbox as not valid anymore.

Feel free to ping me if you have any follow up question.

Thanks for the clarity, makes a lot of sense!

We can make this work in either direction, I think (manual or automatic for this handful of IPs/hostnames which occupy these two zones), I think it's mostly just a taste/design question at this point, with a couple of prickly points:

  • On the root entry for wikimedia-dns.org A 185.71.138.138 - Does the automation cover this correctly? I'm not certain if we have any existing cases of this nature, and the template output would be slightly different (the left hand name would need to be either the fully-spelled-out zone name with a terminal dot, as in wikimedia-dns.org., or it would need to be one of @ or @Z or @F (the latter of which might be the most universally-applicable here, since it would work for the "root" of a subdomain include file in other cases too, where the $INCLUDE is within a $ORIGIN).
  • The forward zone also has a special wildcarded geoip record which isn't being covered (yet) by netbox at all, which exists in the zonefile as: *.check 5/5 IN DYNA geoip!checkdoh-addrs. I'm kind of assuming this touches too many edge-cases (the DYNA format and data in general, the wildcard, and that the wildcard overlaps two explicit names which are in netbox as yes.check and no.check) to be worth adding new code with only one example case.
  • Style/clarity bikeshedding: given that the wildcard and the explicit yes/no records are part of a related scheme, having them exist together side-by-side in the zonefile makes a certain sense to me (I could go either way, I think, though), but if we leave them manual, we've removed two of the four cases that could've been automated in the forward zone (the others being the domain root from the first bullet point above and the record for "check" itself). edited here, I initially forgot about the fourth case
  • There's also sort of a broader question here about (especially small) public domains for public-facing canonical names like this vs domains that tend to have larger record counts, are less-public-facing, and/or contain mostly entries for actual hosts, etc. Is our intent with the netbox IPAM +automation to mostly focus on the host-like cases, or also to eventually include every case where an IP address is assigned, even for these more canonical/static/public cases? There's a part of my brain that says "if we have IPAM, it should cover everything", but that does raise the specter of all the edge cases above, at least. Parts of that will probably always realistically not be worth it, e.g. understanding or generating the underlying geoip config that ties real IPs to the DYNA record. There's also just the complexity/risk tradeoffs of introducing more automation into a critical recordset which is low-volume and slow-changing, where the impacts of any mis-automation are high (even with the high quality the netbox automation has demonstrated to date!).

I'm not really offering solutions or decisions here, this is more just an expansive-phase / questioning sort of response (sorry!)

Thanks for the clarity, makes a lot of sense!

We can make this work in either direction, I think (manual or automatic for this handful of IPs/hostnames which occupy these two zones), I think it's mostly just a taste/design question at this point, with a couple of prickly points:

  • On the root entry for wikimedia-dns.org A 185.71.138.138 - Does the automation cover this correctly? I'm not certain if we have any existing cases of this nature, and the template output would be slightly different (the left hand name would need to be either the fully-spelled-out zone name with a terminal dot, as in wikimedia-dns.org., or it would need to be one of @ or @Z or @F (the latter of which might be the most universally-applicable here, since it would work for the "root" of a subdomain include file in other cases too, where the $INCLUDE is within a $ORIGIN).

No, the automation doesn't generate right now the record for the zone itself right now. This is good and bad at the same time. It leaves us the flexibility to manage that manually and doesn't interfere with that, allowing to $INCLUDE a file even outside of any $ORIGIN stanza. The bad part is that it leaves a manual record.
The only 2 forward zones we're covering right now with automation are wmnet and wikimedia.org (where we have a manually crafted @ and include netbox-generated files.

  • The forward zone also has a special wildcarded geoip record which isn't being covered (yet) by netbox at all, which exists in the zonefile as: *.check 5/5 IN DYNA geoip!checkdoh-addrs. I'm kind of assuming this touches too many edge-cases (the DYNA format and data in general, the wildcard, and that the wildcard overlaps two explicit names which are in netbox as yes.check and no.check) to be worth adding new code with only one example case.

Absolutely, that's not an IP but a special syntax and Netbox is not capable of managing anything more complex that basic IP <-> DNS name mappings. So that will stay manual anyway.

  • Style/clarity bikeshedding: given that the wildcard and the explicit yes/no records are part of a related scheme, having them exist together side-by-side in the zonefile makes a certain sense to me (I could go either way, I think, though), but if we leave them manual, we've removed two of the three cases that could've been automated in the forward zone (the other being the domain root from the first bullet point above).

The current content of the automated zonefiles are:

$ cat wikimedia-dns.org-global
check                                    1H IN A 185.71.138.139
no.check                                 1H IN A 185.71.138.141
yes.check                                1H IN A 185.71.138.140

$ cat 138.71.185.in-addr.arpa
138 1H IN PTR wikimedia-dns.org.
139 1H IN PTR check.wikimedia-dns.org.
140 1H IN PTR yes.check.wikimedia-dns.org.
141 1H IN PTR no.check.wikimedia-dns.org.

Anything else will need to be manual in the zonefile.

  • There's also sort of a broader question here about (especially small) public domains for public-facing canonical names like this vs domains that tend to have larger record counts, are less-public-facing, and/or contain mostly entries for actual hosts, etc. Is our intent with the netbox IPAM +automation to mostly focus on the host-like cases, or also to eventually include every case where an IP address is assigned, even for these more canonical/static/public cases? There's a part of my brain that says "if we have IPAM, it should cover everything", but that does raise the specter of all the edge cases above, at least. Parts of that will probably always realistically not be worth it, e.g. understanding or generating the underlying geoip config that ties real IPs to the DYNA record. There's also just the complexity/risk tradeoffs of introducing more automation into a critical recordset which is low-volume and slow-changing, where the impacts of any mis-automation are high (even with the high quality the netbox automation has demonstrated to date!).

I'd like @ayounsi and @cmooney to weigh in on the IPAM should cover everything part. I think that we agreed that all should be in Netbox (with the major exception of fr-tech so far) and that we'll generate as much as possible records from there.
That said Netbox is not and will probably never be (from upstream comments) a DNS source of truth. We already have cases not well covered that causes us confusion (see T270071 ). But in theory anything that is pure IP <-> DNS record mapping should be manageable by the automation.
This should be the list of the actual exceptions:
https://netbox.wikimedia.org/ipam/ip-addresses/?q=keep%20manual
(as you see the wikimedia-dns ones are the exceptions as they have both the comment and the DNS name, they should have either one or the other).

I'm not really offering solutions or decisions here, this is more just an expansive-phase / questioning sort of response (sorry!)

That said Netbox is not and will probably never be (from upstream comments) a DNS source of truth. We already have cases not well covered that causes us confusion (see T270071 ). But in theory anything that is pure IP <-> DNS record mapping should be manageable by the automation.

Thanks for clarifying this part. To me, this was indeed a source of confusion, especially after I did the wikimedia-dns.org change manually through operations/dns instead of Netbox (where it didn't work for reasons we have discussed but that's besides the point since I wasn't even aware of it :).

At least in this case and pending further comments on this topic, the fact that we can't handle the special DYNA record, the IP addresses are anycasted and are unlikely to change, the small number of records (that won't grow), perhaps it does make sense to keep on managing this manually and to run the cookbook to reflect that.

Thanks for all the background info here.

Regarding the use cases for the manual entries, yes there are probably some (like wikimedia-dns.org) that we could adjust the scripting to manage directly from Netbox (associating that name with each Anycast IP object in Netbox, and working out a way to support records at a zone apex). But then there are others, like the DYNA entries, that are not so straightforward, and we can't represent using the built-in Netbox types. Same goes for MX entries, CNAMEs, TXT, SRV records etc.

So I think we will always need some way to augment or override the Netbox data with "manual" entries. In my last place we came to a similar conclusion, building our zones largely from Netbox IP objects, with some additional YAML data (stored in Netbox as a config context) for a small number of edge cases.

Is our intent with the netbox IPAM +automation to mostly focus on the host-like cases, or also to eventually include every case where an IP address is assigned, even for these more canonical/static/public cases?

I think with these, even if there is no associated "host" or "interface" in Netbox, the IP address should be assigned there. In that case, if the DNS is just a regular A/AAAA record and associated reverse, my sense is to record the DNS entry in Netbox and allow it to generate the entries. Basically just use the manual repo for things that don't fit neatly into the Netbox schema.

If we wanted to spend time on it we could maybe rework the scripts generating the entries from Netbox, so they could read the data in the manual repo, and produce a full, unified zone file (no INCLUDEs etc.). But I'm not sure it'd be worth the effort, what we have now seems fairly straightforward.

Change 722926 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] durum: add and set CSP headers for check.wikimedia-dns.org

https://gerrit.wikimedia.org/r/722926

Change 722926 merged by Ssingh:

[operations/puppet@production] durum: add and set CSP headers for check.wikimedia-dns.org

https://gerrit.wikimedia.org/r/722926

Note that a few of the durum IPs have both the "DNS name" field set, and "Keep manual DNS" as comment, which I think are mutually exclusive (but not enforced).
https://netbox.wikimedia.org/ipam/ip-addresses/?q=Keep+manual+DNS

Note that a few of the durum IPs have both the "DNS name" field set, and "Keep manual DNS" as comment, which I think are mutually exclusive (but not enforced).
https://netbox.wikimedia.org/ipam/ip-addresses/?q=Keep+manual+DNS

Yes, this was mentioned above and was pending the decision on which way to go (manual zonefile or semi-automated) before fixing them. Any news on the decision?

Thanks to everyone for helping with the task. We just discussed this in IRC but for those following along: we have decided to go with managing the records manually and Netbox has been updated as well. Thanks to @Volans for taking care of that and for flagging this issue!