Page MenuHomePhabricator

Enable NTP for drmrs network devices
Closed, ResolvedPublic

Description

The network devices currently installed in drmrs do not seem to be sucessfully syncing / getting NTP time from our DNS servers.

For mr1-drmrs this seems fairly straightforward, it is using a public IPv4 address in 185.15.58.128/27 to query the servers, and this is not allowed in /etc/ntp.conf on the dns servers.

For asw1-b12-drmrs and asw1-b13-drmrs the situation is slightly more confusing. When you try to query the status or associations it fails like this:

cmooney@asw1-b13-drmrs> show ntp associations no-resolve    
localhost: timed out, nothing received
***Request timed out

But that is true of other CR routers, for instance cr1-eqiad:

cmooney@re0.cr1-eqiad> show ntp associations 
localhost: timed out, nothing received
***Request timed out

So I'm not sure if that's simply cosmetic. The switches do seem to have the correct time.

Event Timeline

cmooney created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Ok so on the switches I can see requests hitting the dns servers and they are responding:

cmooney@dns1001:~$ sudo tcpdump -i ens2f0np0 -l -p -nn host 10.136.128.4
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens2f0np0, link-type EN10MB (Ethernet), capture size 262144 bytes
11:15:14.446038 IP 10.136.128.4.123 > 208.80.154.10.123: NTPv4, Client, length 48
11:15:14.446209 IP 208.80.154.10.123 > 10.136.128.4.123: NTPv4, Server, length 48

So I think NTP is working, but we have hit this known issue:

https://kb.juniper.net/InfoCenter/index?page=content&id=KB11436

Solution seems to be to modify our loopback firewall filter so I will investigate how we can do that.

Ok yes it seems to be the loopback filter alright, testing the change on asw1-b13-drmrs adding a new term as advised in the KB article fixed it:

cmooney@asw1-b13-drmrs# show | compare 
[edit firewall family inet filter loopback4]
   term allow_ntp4 { ... }
+  term allow_ntp4_local {
+      from {
+          source-address {
+              185.15.58.132/32;
+          }
+          protocol udp;
+          port ntp;
+      }
+      then accept;
+  }
   term allow_snmp4 { ... }

{master:0}[edit firewall family inet filter loopback4]
cmooney@asw1-b13-drmrs# commit
cmooney@asw1-b13-drmrs> show ntp associations                                 
   remote         refid           st t when poll reach   delay   offset  jitter
===============================================================================
*dns1001.wikimedia.org
                  138.236.128.36   3 -    2   64    1   85.259    0.337   0.046
+dns1002.wikimedia.org
                  64.79.100.196    3 -    1   64    1   85.332   -0.671   0.023
-dns2001.wikimedia.org
                  47.190.36.235    3 -    2   64    1  116.595    2.118   0.026
+dns2002.wikimedia.org
                  209.58.140.18    3 -    1   64    1  116.624   -0.425   0.082
+dns3001.wikimedia.org
                  94.228.220.14    3 -    2   64    1   20.035    0.374   0.048
-dns3002.wikimedia.org
                  83.163.190.85    3 -    1   64    1   20.026    1.237   0.022

I'll prep a change to add this via homer to relevant devices.

Change 742460 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Modified loopback4 filter to allow NTP commands to run

https://gerrit.wikimedia.org/r/742460

This comment was removed by cmooney.

Change 742462 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Add drmrs loopbacks and interconnect range to ntp allowed config

https://gerrit.wikimedia.org/r/742462

Change 742462 merged by Cathal Mooney:

[operations/puppet@production] Add drmrs public prefix to ntp allowed config

https://gerrit.wikimedia.org/r/742462

Change 742460 merged by jenkins-bot:

[operations/homer/public@master] Modified loopback4 filter to allow NTP commands to run

https://gerrit.wikimedia.org/r/742460

Mentioned in SAL (#wikimedia-operations) [2021-11-30T13:05:02Z] <topranks> Running homer against CR routers to adjust loopback4 filter enabling local NTP queries for status. T296623

Ok so this has been addressed for CR routers. You can view the NTP status as follows:

cmooney@cr2-eqord> show ntp associations 
   remote         refid           st t when poll reach   delay   offset  jitter
===============================================================================
-dns1001.wikimedia.org
                  108.61.73.243    3 -   27   64  377   28.531   -2.562  10.667
+dns1002.wikimedia.org
                  108.61.73.243    3 -   16   64  377   28.589    0.174   1.028
+dns2001.wikimedia.org
                  104.131.155.175  3 -   22   64  377   23.685    0.677   0.271
*dns2002.wikimedia.org
                  217.180.209.214  2 -   23   64  377   23.661    1.083   9.276
+dns3001.wikimedia.org
                  84.245.9.254     2 -    8   64  377  136.809    0.603   3.327
-dns3002.wikimedia.org
                  94.198.159.10    2 -   18   64  377  136.380    2.225   3.325

There were some complications on cr1-eqiad, as this device had two IPv4 addresses configured on its loopback interface. The second of these, 185.212.145.2, was in place to terminate tunneled traffic over a private peering service, but this had never been used. For whatever reason cr1-eqiad was using this to make the NTP query to itself, despite the actual loopback being configured as "preferred". Anyway after discussion with a.younsi on IRC we decided to remove this second IP, after which the command now works on cr1-eqiad.

cmooney@re0.cr1-eqiad> show ntp associations no-resolve    
   remote         refid           st t when poll reach   delay   offset  jitter
===============================================================================
+208.80.154.10    108.61.73.243    3 -   64   64    1    0.413   -0.208   0.028
+208.80.155.108   108.61.73.243    3 -   65   64    1    0.407    1.992   0.214
 208.80.153.77    104.131.155.175  3 -   15  128    7    0.000  -63136. 63138.2
*208.80.153.111   217.180.209.214  2 -   65   64    1   31.915    2.016   0.647
 91.198.174.61    84.245.9.254     2 -   14  128    7    0.000  -63135. 63137.7
+91.198.174.62    94.198.159.10    2 -   62   64    1   79.570    3.794   0.724

Worth mentioning that this problem meant that cr1-eqiad wasn't actually synced to NTP before the change, presumably as it was using the wrong IP to query and getting blocked on the dns servers. This was evident in the first check I did after making the change, some peers weren't up the offset to others was large:

cmooney@re0.cr1-eqiad> show ntp associations    
   remote         refid           st t when poll reach   delay   offset  jitter
===============================================================================
 208.80.154.10    .STEP.          16 - 1962   64    0    0.000    0.000 4000.00
 208.80.155.108   .STEP.          16 -  914   64    0    0.000    0.000 4000.00
 208.80.153.77    104.131.155.175  3 -   45   64    1    0.000  -63136.   0.000
 208.80.153.111   .STEP.          16 -  439   64    0    0.000    0.000 4000.00
 91.198.174.61    84.245.9.254     2 -   12   64    3    0.000  -63135. 63137.9
+91.198.174.62    94.198.159.10    2 -   10   64    1   79.570    3.794   0.724

The change to the DNS servers has also been made, so they should now allow queries from the DRMRS range:

cmooney@dns1001:~$ grep 185.15.58 /etc/ntp.conf 
restrict 185.15.58.0 mask 255.255.255.0 notrap nomodify noquery nopeer

mr1-drmrs can now query for NTP to the dns servers:

cmooney@mr1-drmrs> set date ntp 208.80.154.10     
30 Nov 13:57:14 ntpdate[75720]: step time server 208.80.154.10 offset -15.965107 sec

Despite this I still cannot verify the system time is synced to NTP from the mr1-drmrs command line. Same problem as on the CRs. There is no filter applied to the loopback interface but obviously the SRX works slightly different, I will dig into it.

Scrap that it does seem to be working, perhaps it only failed to query against itself after the initial change.

cmooney@mr1-drmrs> show ntp associations 
   remote         refid           st t when poll reach   delay   offset  jitter
===============================================================================
+dns1001.wikimedia.org
                  108.61.73.243    3 -   64   64    3   86.000    0.284   0.985
+dns1002.wikimedia.org
                  198.137.202.11   3 -   61   64    3   86.143    2.666  34.060
+dns2001.wikimedia.org
                  104.131.155.175  3 -   63   64    3  117.313    2.599   3.992
*dns2002.wikimedia.org
                  217.180.209.214  2 -   62   64    3  117.322    3.215  40.739
-dns3001.wikimedia.org
                  84.245.9.254     2 -   59   64    3   94.428  -32.151   3.965
-dns3002.wikimedia.org
                  94.198.159.10    2 -   58   64    3   94.459  -30.902  47.348