Page MenuHomePhabricator

Enable DNSSEC validation in Wikidough
Closed, ResolvedPublic

Description

Wikidough currently does not perform any DNSSEC validation through pdns-recursor (dnssec configuration option is set to off). To be good internet citizens, we should enable DNSSEC for Wikidough by fetching and validating DNSSEC signatures just like other major resolver services. However, there are a few concerns we should address and also think about the level of DNSSEC validation we should support.

Even though Wikidough provides confidentiality and may further check for integrity of the DNS records from auth servers by validating the DNSSEC signatures, clients like Firefox do not perform end-to-end validation and therefore rely on the TRR (trusted recursive resolver; Wikidough) to validate the records for them. Cloudflare as the TRR in Firefox does the same so while this seems to be the standard and clients are already trusting their TRRs, it is important to note that there is currently no end-to-end validation of these records in the browsers themselves unless external extensions/add-ons are used.

As per https://support.mozilla.org/en-US/kb/dns-over-https-doh-faqs#w_do-you-validate-dnssec, if Wikidough (or some other TRR) returns a SERVFAIL response in case it fails to validate the DNSSEC signature due to a misconfigured DNSSEC domain (see below), Firefox falls back to the native resolver to complete the request. (This behaviour is defined by the network.trr.mode preference, set to 2 by default. If set to 3, Firefox will only use TRR and will never use the native resolver. See https://wiki.mozilla.org/Trusted_Recursive_Resolver for more details.) The user is not made aware of this transition from TRR to the native resolver, so while the user may think that they are resolving domains through Wikidough, in such a case when Firefox gets a SERVFAIL, the domain name resolution may happen through the native resolver instead, which may not be desired as the native resolver may leak the query in plain text.

Given that outages due to misconfigured DNSSEC domains are all too common (see https://ianix.com/pub/dnssec-outages.html for a list) and that Firefox will default to the native resolver in case it gets a SERVFAIL response, we should not enable strict validation in pdns-recursor, where all queries are validated regardless of the client's intention to validate, and a SERVFAIL response is returned in case of an incorrect validation. Firefox has no way of distinguishing between a SERVFAIL response that resulted from a misconfigured auth server or from an actual bogus response.

To start with, we should enable DNSSEC for Wikidough as an experiment. pdns-recursor has a log-fail mode (see https://docs.powerdns.com/recursor/dnssec.html), in which it validates all DNSSEC data it retrieves from authoritative servers and logs the validation result, irrespective of whether a client like Firefox asked for it but it doesn't send SERVFAIL response (or the AD bit) unless the client set the AD and/or DO bits. This allows us to experiment with DNSSEC validation to measure what percentage of validations actually fail while not affecting the experience for Firefox/Android users. The other option is the "full blow validation", the validate mode in pdns-recursor, which validates all queries and responds with a SERVFAIL irrespective of whether the client requested for the DNSSEC records and/or the validation.

To summarize:

  • If we set log-fail, we will always perform validation and log the result, send SERVFAIL in case of invalid response iff the client set +AD or +DO. Since Firefox does not set this, nothing changes for its users. For users who care about DNSSEC, they can set the bits and let Wikidough perform validation for them.
  • If we enable strict validation, we will always perform validation, send SERVFAIL in case of invalid response irrespective of the client's request. But in case of misconfigured auth servers, users on Firefox will get SERVFAIL responses and then lookup the domain using their native resolver.
    • Or, we can enable strict validation and ask users to change the network.trr.mode preference to 3. This makes their entire experience more secure even outside of the DNSSEC issue as in this case if TRR lookup fails, the complete lookup fails with no fallback, but requiring special configurations may not scale to all users.

The majority of Wikidough users will be on Firefox and Chrome (once it has proper DoH support) so the log-fail is an acceptable compromise for now; more advanced users who run their own stub resolvers can can set the bits accordingly, in which case Wikidough will respond with the validation data. Depending on how this experiment goes, we can switch from log-fail to complete validation, or switch to process-no-validate, in which we send the DNSSEC RRSIGs in the reponse but do not perform any validation and don't set the AD bit.

Or we can continue to keep DNSSEC disabled!

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Given that outages due to misconfigured DNSSEC domains are all too common (see https://ianix.com/pub/dnssec-outages.html for a list)

Im not sure i would agree that they are "all to common". The list referenced shows failures in less the 30 "Major sites"*. Further the TLD's shown, other then ru and pl, seem pretty small.

Firefox has no way of distinguishing between a SERVFAIL response that resulted from a misconfigured auth server or from an actual bogus response.

the client can ask the question again with +CD to rule out validation or to try and preform it themself, however i doubt FF/Chrome do this

unless the client set the AD and/or DO bits

The DO bit should have no affect on if the recursor preforms DNSSEC validation or not. it just determines if the recursor should return the DNSSEC records (i.e. the RRSIG). The CD bit is what is used to request no validation and is mostly used for debugging (i doubt FF ever set it). check the response of the following to see the various differences (+[no]dnssec toggels the DO bit)

dig +cd  +nodnssec ns dnssec-failed.org @1.1.1.1   
dig +cd  +dnssec ns dnssec-failed.org @1.1.1.1   
dig  +nodnssec ns dnssec-failed.org @1.1.1.1   
dig  +dnssec ns dnssec-failed.org @1.1.1.1

irrespective of whether a client like Firefox asked for it [DNSSEC validation]

I'm not sure that a client (as specified in the RFC's) can explicitly ask for DNSSEC validation. The AD bit asks resolvers to also set the AD bit in responses if validation is successful but the lack of it in a query does not indicate no DNSSEC validation.

That said in pdns-recursor with log-fail the lack of the AD bit means that the recursor will validate and on failure will return the unvalidated answer with the AD bit unset. This is different from cloudflare which will respond with SERVFAIL in both cases. PDNS with log-fail ultimately leaves the client with no way of knowing if the query is protected by DNSSEC or not

dig +ad  ns dnssec-failed.org @1.1.1.1  
dig +noad  ns dnssec-failed.org @1.1.1.1

unless the client set the AD and/or DO bits

do we know what chrome/FF set on queries? if they set either of theses bits (and i can see how AD could be useful) this conversation is a bit redundant

This allows us to experiment with DNSSEC validation to measure what percentage of validations actually fail while not affecting the experience for Firefox/Android users.

I think that the log-fail config in pdns is very much an experimentation parameter to ease deployment and if we use it we should make it clear that we are using it as it is a potential downgrade to user security. we have no way of knowing if a user has enabled network.trr.mode: 3 or what the local resolver would do if queries fall back to it. i.e. the local resolver may also return SERVFIAL but log the transaction.

So in brief yes it allows us to experiment however it would not be in line with a users expectation of how DNSSEC should work/fail and could give a false senses of security in the event of wikidough or an upstream domain being subject to some type of attack.

If we set log-fail

Although this is useful for us and would potentially reduce user complaints, It also hides issues to the user and prevents them from making there own local decisions on how to handle DNSSEC failuress.

If we enable strict validation

From my testing this seems to be how CF is configured. I also think that the expectation of users (at least be mine) is that a public DoH server would be preforming DNSSEC and as such it should respond with SERVFAIL for domains which fail validation.

I have also checked google (8.8.8.8), quad9 (9.9.9.9), verisign (64.6.64.6) and cisco (208.67.220.220) all of which seem to respond with SERVFAIL on DNSSEC failures regardless of if the AD/DO bit is set or not. As such my vote would be to use validate

*Im not sure how theses where classed as major sites but eyeballing the list i count somewhere between 3 and 10 "major sites". ironicly the first site i tried nohats.ca (paul wouters site) is currently failing HSTS

unless the client set the AD and/or DO bits

do we know what chrome/FF set on queries?

Really not familiar with FF/chrome code but this looks like a no
FF: https://searchfox.org/mozilla-central/source/netwerk/dns/TRR.cpp#74
Chrome: https://github.com/chromium/chromium/blob/master/net/dns/dns_query.cc#L116

Given that outages due to misconfigured DNSSEC domains are all too common (see https://ianix.com/pub/dnssec-outages.html for a list)

Im not sure i would agree that they are "all to common". The list referenced shows failures in less the 30 "Major sites"*. Further the TLD's shown, other then ru and pl, seem pretty small.

Firefox has no way of distinguishing between a SERVFAIL response that resulted from a misconfigured auth server or from an actual bogus response.

the client can ask the question again with +CD to rule out validation or to try and preform it themself, however i doubt FF/Chrome do this

Firefox and Chrome do not perform any validation by themselves, unless and through external extensions. Also, at least in the case of Firefox (given that's our target browser for now as it allows users to specify custom DoH providers), https://support.mozilla.org/en-US/kb/dns-over-https-doh-faqs#w_do-you-validate-dnssec indicates that they leave the validation to the trusted recursive resolver. (Also confirmed by the code, as your comment below says.)

unless the client set the AD and/or DO bits

The DO bit should have no affect on if the recursor preforms DNSSEC validation or not. it just determines if the recursor should return the DNSSEC records (i.e. the RRSIG). The CD bit is what is used to request no validation and is mostly used for debugging (i doubt FF ever set it). check the response of the following to see the various differences (+[no]dnssec toggels the DO bit)

dig +cd  +nodnssec ns dnssec-failed.org @1.1.1.1   
dig +cd  +dnssec ns dnssec-failed.org @1.1.1.1   
dig  +nodnssec ns dnssec-failed.org @1.1.1.1   
dig  +dnssec ns dnssec-failed.org @1.1.1.1

That was my understanding as well, however, the pdns-recursor documentation (https://docs.powerdns.com/recursor/dnssec.html) says this:

However, the recursor will try to validate the data if at least one of the DO or AD bits is set in the query; in that case,

So this means that they treat the DO bit to not only return the DNSSEC records but also to validate them? I can check this in the code but I just wanted to confirm if I am understanding this correctly. There is a table here https://docs.powerdns.com/recursor/dnssec.html#what-when that explains what happens and when re: DNSSEC validations.

irrespective of whether a client like Firefox asked for it [DNSSEC validation]

I'm not sure that a client (as specified in the RFC's) can explicitly ask for DNSSEC validation. The AD bit asks resolvers to also set the AD bit in responses if validation is successful but the lack of it in a query does not indicate no DNSSEC validation.

My understanding was that the role of the AD flag was redefined can now also be used by the client in the DNS query to explicitly ask for validation, meaning that it signals that it can understand and is interested in the response of the AD bit. (https://tools.ietf.org/html/rfc6840#section-5.7)

That said in pdns-recursor with log-fail the lack of the AD bit means that the recursor will validate and on failure will return the unvalidated answer with the AD bit unset. This is different from cloudflare which will respond with SERVFAIL in both cases. PDNS with log-fail ultimately leaves the client with no way of knowing if the query is protected by DNSSEC or not

dig +ad  ns dnssec-failed.org @1.1.1.1  
dig +noad  ns dnssec-failed.org @1.1.1.1

Yes, that's true. log-fail validates all data but only returns SERVFAIL in case the client sets the AD bit. (The documentation again mentions the DO bit here and say this, "Only on +AD or +DO from client".)

unless the client set the AD and/or DO bits

do we know what chrome/FF set on queries? if they set either of theses bits (and i can see how AD could be useful) this conversation is a bit redundant

Neither Firefox nor Chrome set these bits, and leave the validation to the recursor.

This allows us to experiment with DNSSEC validation to measure what percentage of validations actually fail while not affecting the experience for Firefox/Android users.

I think that the log-fail config in pdns is very much an experimentation parameter to ease deployment and if we use it we should make it clear that we are using it as it is a potential downgrade to user security. we have no way of knowing if a user has enabled network.trr.mode: 3 or what the local resolver would do if queries fall back to it. i.e. the local resolver may also return SERVFIAL but log the transaction.

That is true. I feel that *most* users will not have network.trr.mode set to 3 as that is not the default and we cannot and should not expect users to manually make this change, unless really desired.

So in brief yes it allows us to experiment however it would not be in line with a users expectation of how DNSSEC should work/fail and could give a false senses of security in the event of wikidough or an upstream domain being subject to some type of attack.

Yeah, the part of the fallback to the native resolver is what worries me the most about this.

If we set log-fail

Although this is useful for us and would potentially reduce user complaints, It also hides issues to the user and prevents them from making there own local decisions on how to handle DNSSEC failuress.

If we enable strict validation

From my testing this seems to be how CF is configured. I also think that the expectation of users (at least be mine) is that a public DoH server would be preforming DNSSEC and as such it should respond with SERVFAIL for domains which fail validation.

I have also checked google (8.8.8.8), quad9 (9.9.9.9), verisign (64.6.64.6) and cisco (208.67.220.220) all of which seem to respond with SERVFAIL on DNSSEC failures regardless of if the AD/DO bit is set or not. As such my vote would be to use validate

Thanks very much for the response and your feedback, as well as confirming the behaviour of the different resolvers. Let's do validate.

Change 621531 had a related patch set uploaded (by Ssingh; owner: Ssingh):
[operations/puppet@production] wikidough: enable DNSSEC validation in pdns-recursor

https://gerrit.wikimedia.org/r/621531

So this means that they treat the DO bit to not only return the DNSSEC records but also to validate them? I can check this in the code but I just wanted to confirm if I am understanding this correctly. There is a table here https://docs.powerdns.com/recursor/dnssec.html#what-when that explains what happens and when re: DNSSEC validations.

DO or AD but yes i agree however i believe this is a PDNS implementation detail and not something specified in the RFC's

My understanding was that the role of the AD flag was redefined can now also be used by the client in the DNS query to explicitly ask for validation, meaning that it signals that it can understand and is interested in the response of the AD bit. (https://tools.ietf.org/html/rfc6840#section-5.7)

Indeed it was redefined however its new definition does not ask for validation it just asks for the AD bit to be returned in the answer. This could be used for instance to indicate in the browser tool bar that a domain is DNSSEC validated. i.e. if i do dig +noad ns ripe.net @1.1.1.1 dnssec validation is still preformed by the cloudflare servers but the AD bit is not set in the answer so the client has no way of knowing that.

Yeah, the part of the fallback to the native resolver is what worries me the most about this.

Yes i understand that worry. i.e. if a user has chosen to use an encrypted channel for there DNS then in the event of a DNSSEC failure the DNS query could leak. however we have to consider what the failure is and what is the best behaviour. Is someone is actively attacking a DNSSEC enabled domain and we ignore the invalid validation then we are aiding this attacker and in affect disabling all the protections that DNSSEC enables, as such we may as well leave it off.

Yeah, the part of the fallback to the native resolver is what worries me the most about this.

It seems like you are trying to overcome a weakness in the FF default by weakening a PDNS default. Ultimately, IMO, all we can and should do is make our side as secure as we can and advice on a sane configuration. It is upto either firefox to change its default and use network.trr.mode: 3 or for the user to make that choose and update the default themself.

Thanks very much for the response and your feedback, as well as confirming the behaviour of the different resolvers. Let's do validate.

Cool and feel free to ping me on irc if you want to go through this anymore

Change 621531 merged by Ssingh:
[operations/puppet@production] wikidough: enable DNSSEC validation in pdns-recursor

https://gerrit.wikimedia.org/r/621531

So this means that they treat the DO bit to not only return the DNSSEC records but also to validate them? I can check this in the code but I just wanted to confirm if I am understanding this correctly. There is a table here https://docs.powerdns.com/recursor/dnssec.html#what-when that explains what happens and when re: DNSSEC validations.

DO or AD but yes i agree however i believe this is a PDNS implementation detail and not something specified in the RFC's

My understanding was that the role of the AD flag was redefined can now also be used by the client in the DNS query to explicitly ask for validation, meaning that it signals that it can understand and is interested in the response of the AD bit. (https://tools.ietf.org/html/rfc6840#section-5.7)

Indeed it was redefined however its new definition does not ask for validation it just asks for the AD bit to be returned in the answer. This could be used for instance to indicate in the browser tool bar that a domain is DNSSEC validated. i.e. if i do dig +noad ns ripe.net @1.1.1.1 dnssec validation is still preformed by the cloudflare servers but the AD bit is not set in the answer so the client has no way of knowing that.

Got it, thanks.

Yeah, the part of the fallback to the native resolver is what worries me the most about this.

Yes i understand that worry. i.e. if a user has chosen to use an encrypted channel for there DNS then in the event of a DNSSEC failure the DNS query could leak. however we have to consider what the failure is and what is the best behaviour. Is someone is actively attacking a DNSSEC enabled domain and we ignore the invalid validation then we are aiding this attacker and in affect disabling all the protections that DNSSEC enables, as such we may as well leave it off.

Yeah, the part of the fallback to the native resolver is what worries me the most about this.

It seems like you are trying to overcome a weakness in the FF default by weakening a PDNS default. Ultimately, IMO, all we can and should do is make our side as secure as we can and advice on a sane configuration. It is upto either firefox to change its default and use network.trr.mode: 3 or for the user to make that choose and update the default themself.

Yeah, makes sense. Hopefully Firefox or Chrome start doing proper DNSSEC validation :)

Thanks very much for the response and your feedback, as well as confirming the behaviour of the different resolvers. Let's do validate.

Cool and feel free to ping me on irc if you want to go through this anymore

Change applied, thanks!

$ kdig @208.80.153.43 +tls-ca +tls-host=malmok.wikimedia.org dnssec-failed.org +short
$ kdig @208.80.153.43 +tls-ca +tls-host=malmok.wikimedia.org dnssec-failed.org +short +cd
69.252.80.75

Seems to be working. Can you please confirm as well before I mark this as resolved in case I am missing something?

Seems to be working. Can you please confirm as well before I mark this as resolved in case I am missing something?

LGTM thanks