Page MenuHomePhabricator

SOA serial numbers returned by authoritative nameservers differ
Closed, ResolvedPublic

Description

The .IS registry checks if SOA serial numbers returned by nameservers are the same.

For the domains wikipedia.is and wikimedia.is they reported an issue to MarkMonitor who then pinged WMF Legal who then pinged me. (Room for optimizing the workflow there).

wikipedia.is
++++++++++++++++++++
The following errors are registered when ISNIC attempts to check the zone for domain wikimedia.is:

SOA serial numbers returned by authoritative nameservers differ - ns2.wikimedia.org:2018080211 ns1.wikimedia.org:2018080211 ns0.wikimedia.org:2018080223

++++++++++++++++++++

I can confirm this, for both wikimedia.is and wikipedia.is and also wikipedia.org itself.

for ns in 0 1 2; do dig SOA wikimedia.is @ns${ns}.wikimedia.org | grep hostmaster; done

wikimedia.is. 3600 IN SOA ns0.wikimedia.org. hostmaster.wikimedia.org. 2018080223 43200 7200 1209600 3600
wikimedia.is. 3600 IN SOA ns0.wikimedia.org. hostmaster.wikimedia.org. 2018080211 43200 7200 1209600 3600
wikimedia.is. 3600 IN SOA ns0.wikimedia.org. hostmaster.wikimedia.org. 2018092019 43200 7200 1209600 3600

for ns in 0 1 2; do dig SOA wikipedia.is @ns${ns}.wikimedia.org | grep hostmaster; done

wikipedia.is. 3600 IN SOA ns0.wikimedia.org. hostmaster.wikimedia.org. 2018081012 43200 7200 1209600 3600
wikipedia.is. 3600 IN SOA ns0.wikimedia.org. hostmaster.wikimedia.org. 2018081012 43200 7200 1209600 3600
wikipedia.is. 3600 IN SOA ns0.wikimedia.org. hostmaster.wikimedia.org. 2018092019 43200 7200 1209600 3600

for ns in 0 1 2; do dig SOA wikipedia.org @ns${ns}.wikimedia.org | grep hostmaster; done

wikipedia.org. 3600 IN SOA ns0.wikimedia.org. hostmaster.wikimedia.org. 2018081012 43200 7200 1209600 3600
wikipedia.org. 3600 IN SOA ns0.wikimedia.org. hostmaster.wikimedia.org. 2018081012 43200 7200 1209600 3600
wikipedia.org. 3600 IN SOA ns0.wikimedia.org. hostmaster.wikimedia.org. 2018092019 43200 7200 1209600 3600

This is not very surprising since both are just symlinks to wikimedia.com resp. wikipedia.org:

wikimedia.is -> wikimedia.com
wikipedia.is -> wikipedia.org

So that does not seem to be specific to .is at all.. just that they are the only registrar checking this(?).

There is some urgency here since "Please note that the .IS registry will suspend the domain if this problem is not fixed within the next 6 weeks. "

Event Timeline

SOA Serial values only have meaning to the administrators of a zone, and to servers with which they authorize legacy zone transfers. The registrar is neither of these parties. We could serve randomly garbage digits for a serial number that are unique on every request and never match across our servers or across time, and this has absolutely no bearing on correct operations. Therefore, it's a pretty silly policy for the .IS registry to care about them, or warn about them, or especially to threaten registry suspension over them.

All of that aside: the crux of the issue is we auto-generate serial numbers (using our templated {{serial}} field). The workflow here is that authdns-update (executed from any nameserver when we push a change) executes authdns-local-update on each of our nameservers. This in turn syncs down the zone templates via git to the local filesystem on a server, and then runs authdns-gen-zones to generate the files from the templates using jinja templating.

The value templated in comes from this line of python: context['serial'] = time.strftime('%Y%m%d%H', time.gmtime()), which means it reflects the current time when authdns-gen-zones was run. The script also saves time by skipping the regeneration of files whose template content hasn't changed since it last templated them with:

try:
    if not args.force and (
            os.path.getmtime(templatepath) <=
            os.path.getmtime(zonepath)):
        continue

The net of all of these things is that if you reinstall all your nameservers (which we recently did), but they're initially installed at different dates/times (also true!) and thus do their initial pull of the zone data at different times, they'll each pick different serial values for all the initially-generated zone data, and because of the optimization against re-doing work above, the servers won't attempt to re-generate the serials until a file is actually touched by changes (after the point of all the servers being reinstalled), or it's run with the -f flag to force regeneration.

Anyways, it's fixed for now.

I fixed it by running: authdns-gen-zones -f /srv/authdns/git/templates /etc/gdnsd/zones && gdnsdctl reload-zones manually on each authdns server as root, which rolled all SOA serial values on all of our zones up to the present hour.

We'll need to remember to do this again going forward after any authdns server install or reinstall, until we fix the underlying problems here in a more elegant way. This is also a variant of the same underlying templating issue as T97051 .

I'd like to take advantage of a few features in gdnsd-3.x while refactoring to fix some of these things, but before we can move any further forward on all things authdns, we have a CI-level blocker at T205439 !

ema triaged this task as Medium priority.Oct 11 2018, 9:45 AM

Thank you for the detailed explanation. I will get back to Legal and MarkMonitor about it.

I have replied by email and referred to this ticket. I also pointed out we don't agree that this check is even useful (but that it was fixed nevertheless).

I can also confirm all the numbers are the same now.

wikipedia.is.		3600	IN	SOA	ns0.wikimedia.org. hostmaster.wikimedia.org. 2018101019 43200 7200 1209600 3600
wikipedia.is.		3600	IN	SOA	ns0.wikimedia.org. hostmaster.wikimedia.org. 2018101019 43200 7200 1209600 3600
wikipedia.is.		3600	IN	SOA	ns0.wikimedia.org. hostmaster.wikimedia.org. 2018101019 43200 7200 1209600 3600

So there should be no more issue and we can resolve this.

Thanks again. As always it's greatly appreciated that you add all the technical details and background.

They mailed again with the same stuff as before.. wikimedia.is isn't compliant because the SOAs differ etc..

Then MarkMonitor mailed me about it.

SOA serial numbers returned by authoritative nameservers differ - ns2.wikimedia.org:2018101917 ns0.wikimedia.org:2018101919 ns1.wikimedia.org:2018101919 .

Fixed again. Copying my whole terminal output for posterity. This runs a readonly command that md5sum's the zones directory to check whether all servers have the same exact zone data, then runs the same regeneration command that fixed them before, then confirms the hashes are aligned now:

bblack@cumin1001:~$ sudo cumin 'C:role::authdns::server' 'find /etc/gdnsd/zones -type f -exec md5sum {} \; |sort -k 2|md5sum'
3 hosts will be targeted:
authdns[1001,2001].wikimedia.org,multatuli.wikimedia.org
Confirm to continue [y/n]? y
===== NODE GROUP =====                                                                                                                                                        
(1) multatuli.wikimedia.org                                                                                                                                                   
----- OUTPUT of 'find /etc/gdnsd/...sort -k 2|md5sum' -----                                                                                                                   
ab66d08220b2475065d38c7c3bffc311  -                                                                                                                                           
===== NODE GROUP =====                                                                                                                                                        
(2) authdns[1001,2001].wikimedia.org                                                                                                                                          
----- OUTPUT of 'find /etc/gdnsd/...sort -k 2|md5sum' -----                                                                                                                   
749a6448e31706eab82740cfdab0cf5a  -                                                                                                                                           
================                                                                                                                                                              
PASS:  |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (3/3) [00:01<00:00,  2.01hosts/s]     
FAIL:  |                                                                                                                                 |   0% (0/3) [00:01<?, ?hosts/s]     
100.0% (3/3) success ratio (>= 100.0% threshold) for command: 'find /etc/gdnsd/...sort -k 2|md5sum'.
100.0% (3/3) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
bblack@cumin1001:~$ sudo cumin 'C:role::authdns::server' 'authdns-gen-zones -f /srv/authdns/git/templates /etc/gdnsd/zones && gdnsdctl reload-zones'
3 hosts will be targeted:
authdns[1001,2001].wikimedia.org,multatuli.wikimedia.org
Confirm to continue [y/n]? y
===== NODE GROUP =====                                                                                                                                                        
(3) authdns[1001,2001].wikimedia.org,multatuli.wikimedia.org                                                                                                                  
----- OUTPUT of 'authdns-gen-zone...ctl reload-zones' -----                                                                                                                   
info: Zone data reloaded                                                                                                                                                      
================                                                                                                                                                              
PASS:  |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (3/3) [00:05<00:00,  1.96s/hosts]     
FAIL:  |                                                                                                                                 |   0% (0/3) [00:05<?, ?hosts/s]     
100.0% (3/3) success ratio (>= 100.0% threshold) for command: 'authdns-gen-zone...ctl reload-zones'.
100.0% (3/3) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
bblack@cumin1001:~$ sudo cumin 'C:role::authdns::server' 'find /etc/gdnsd/zones -type f -exec md5sum {} \; |sort -k 2|md5sum'
3 hosts will be targeted:
authdns[1001,2001].wikimedia.org,multatuli.wikimedia.org
Confirm to continue [y/n]? y
===== NODE GROUP =====                                                                                                                                                        
(3) authdns[1001,2001].wikimedia.org,multatuli.wikimedia.org                                                                                                                  
----- OUTPUT of 'find /etc/gdnsd/...sort -k 2|md5sum' -----                                                                                                                   
edb7c18c736c92f6f34fd73850a001b5  -                                                                                                                                           
================                                                                                                                                                              
PASS:  |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (3/3) [00:01<00:00,  2.01hosts/s]     
FAIL:  |                                                                                                                                 |   0% (0/3) [00:01<?, ?hosts/s]     
100.0% (3/3) success ratio (>= 100.0% threshold) for command: 'find /etc/gdnsd/...sort -k 2|md5sum'.
100.0% (3/3) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

Update for the record: with recent changes to authdns CI and deployment scripts, this scenario should no longer be possible and workarounds shouldn't be necessary! (see also related distant past incident T103915)

all these cool little updates before year end, nice!