Maniphest T208244

ntp broken in new region
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Krenair
	Oct 29 2018, 5:26 PM

Description

Apparently xtools-prod03.xtools.eqiad.wmflabs which sits in eqiad1 has problems with systemd-timesyncd:
Timed out waiting for reply from dns2002.wikimedia.org
same for dns1001

I bet the prod DNS servers won't let the new region talk NTP. In the main deployment I get Synchronized to time server 208.80.154.10:123 (dns1001.wikimedia.org).

Thanks to @Tgr, @fgiunchedi and @MusikAnimal for help finding this.

Details

Subject	Repo	Branch	Lines +/-
ntp: use cloud-specific ntp servers for cloud VMS	operations/puppet	production	+17 -0
cloudservices: include ntp servers.	operations/puppet	production	+3 -0
cloud ntp: allow eqiad VMs to access these servers	operations/puppet	production	+1 -1
Add role/profile for a set of in-cloud ntp servers	operations/puppet	production	+38 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T53494 Use Beta cluster as a true canary for code deployments (epic)
Open	None	T87220 Minimize infrastructure differences between Beta Cluster and production
Open	None	T196662 Set up LVS in beta like prod
Resolved	bd808	T166396 Program 1 Outcome 4: VPS hosting
Resolved	None	T167293 Nova-network to Neutron migration
Resolved	MusikAnimal	T208135 OAuth login failing
Resolved	Andrew	T208244 ntp broken in new region

Event Timeline

Krenair triaged this task as High priority.Oct 29 2018, 5:26 PM

Krenair created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 29 2018, 5:26 PM

Krenair added a parent task: T167293: Nova-network to Neutron migration.Oct 29 2018, 5:28 PM

Paladox subscribed.Oct 29 2018, 5:28 PM

So prod dns* servers are assumed to include role::recursor (this holds true for nescio and maerlant, the others are all hard to track down to puppet nodes), which includes not just role::dnsrecursor but also role::ntp. role::ntp includes profile::ntp and that gives us:

ferm::service { 'ntp':
    proto => 'udp',
    port  => 'ntp',
}

which has no saddr. So is it some 10/8 ACL on networking equipment somewhere doing this?

The ferm rules in the ntp profile (and the dnsrecursor profile) don't appear to use any "srange" limiting where a connection can come from:

ferm::service { 'ntp':
    proto => 'udp',
    port  => 'ntp',
}

Also on dns2001 there is:

ACCEPT     udp  --  anywhere             anywhere             udp dpt:ntp

Therefore i would think it's ACLs on network gear and recommend to ping netops to take a look.

EDIT: made my comment before reading the previous comment.. ACK.. that.

Dzahn added a project: netops.Oct 29 2018, 5:48 PM

Restricted Application added a project: SRE. · View Herald TranscriptOct 29 2018, 5:48 PM

@ayounsi @faidon are there router ACLs to allow udp/123 NTP and these don't have the new cloud IP ranges but do have old cloud IP ranges ?

Change 470445 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloudservices: include ntp servers.

https://gerrit.wikimedia.org/r/470445

Change 470446 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] ntp: use cloud-specific ntp servers for cloud VMS

https://gerrit.wikimedia.org/r/470446

I've attached patches that propose running a cloud-specific NTP server. I'd also be OK with changing the network ACLs to allow the new region to access the standard NTP servers (which were being used by the old region).

Can we set up a couple of NTP servers within VPS e.g. in the cloudinfra project instead? Should be just a couple of generic instances with role::ntp applied, right?

I just noticed profile::ntp also contains some ACLs of it's own which restrict source addresses and don't use ferm::service

@faidon: Well, standard::ntp and profile::ntp would need some changes (they're very prod-specific with no hiera), but fundamentally that sort of setup sounds like it should work.

Yeah I'd agree that's the direction we should go. We don't offer our ntp servers to the globe for good reasons, and we similarly probably shouldn't be offering them to WMCS, plus it's pretty easy to build new NTP servers there. There's a whole bunch of prod-specific stuff in modules/profile/manifests/ntp.pp that needs hiera-izing to override it for WMCS use-case. The structural stuff there is far more complex than the WMCS NTP servers would need, but I think it could be adapted to that case.

(or alternatively, we could look at this as one of the clear examples where a separate WMCS puppetization would be far simpler).

In T208244#4703730, @Krenair wrote:

I just noticed profile::ntp also contains some ACLs of it's own which restrict source addresses and don't use ferm::service

Good find!

In T208244#4703694, @Dzahn wrote:

@ayounsi @faidon are there router ACLs to allow udp/123 NTP and these don't have the new cloud IP ranges but do have old cloud IP ranges ?

Not that I'm aware of.

Due to T174596, the source IP visible to the NTP server is the internal 172.16/12 one, which is not in the ACLs listed on https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/ntp.pp#L37

Adding it might be the best short term fix until NTP runs in OpenStack.

Running an ntp server or two on a cloud VM is probably not a big deal. But, before I go down that road... does anyone want to argue against us just using pool.ntp.org for VMs? And, what is the external source of ntp authority that the production NTP servers use?

So, a few things I can say along those lines:

Production ultimately derives its clock sources from various pool.ntp.org sources (mediated by our per-DC server pools). pool.ntp.org is a pretty decent and reliable and free resource (as opposed to manual server lists, or installing hardware GPS clocks, etc).
However, using pool.ntp.org directly from a bunch of local clients probably isn't a great idea, vs having your own set of 2-3 servers mediating the synchronization (your 2-3 hit pool.ntp.org and peer with each other, then your bulk clients act as simple clients of those). Going direct to pool.ntp.org is abusive of pool resources, and wide-area sync isn't as tight as local anyways (meaning your local machines would have more random time diff between them than is necessary).

But then this all reminds me of something I should've thought about earlier: NTP and VMs haven't historically blended well anyways, so there might be other gremlins to look out for while you're staring at all of this. The issues have surely evolved since I last looked, and can vary a lot by what hypervisor you're using (KVM?). There's all kinds of conflicting advice on a quick google search right now, and it all looks more complicated than I can wade into at the moment. It sounds like to get really accurate time everywhere, you do want NTP inside your guests, but that there may be some (non-standard?) configuration tricks to make it work well (the various bits about pvclock and kvm-clock and whether you have a constant-time TSC in your host hardware, etc):

https://access.redhat.com/discussions/686523
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_deployment_and_administration_guide/chap-kvm_guest_timing_management
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/5/html/virtualization/chap-virtualization-kvm_guest_timing_management

In T208244#4704231, @Andrew wrote:

what is the external source of ntp authority that the production NTP servers use?

seems it's all 0.*.pool.ntp.org. for eqiad it's 0.us.pool.ntp.org: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/ntp.pp#49

In T208244#4704313, @BBlack wrote:

But then this all reminds me of something I should've thought about earlier: NTP and VMs haven't historically blended well anyways, so there might be other gremlins to look out for while you're staring at all of this. The issues have surely evolved since I last looked, and can vary a lot by what hypervisor you're using (KVM?). There's all kinds of conflicting advice on a quick google search right now, and it all looks more complicated than I can wade into at the moment. It sounds like to get really accurate time everywhere, you do want NTP inside your guests, but that there may be some (non-standard?) configuration tricks to make it work well (the various bits about pvclock and kvm-clock and whether you have a constant-time TSC in your host hardware, etc):

I think things have improved considerably in the past few years, but in any case... the only clients of said servers will be VMs anyway, so I don't think it will make a difference anyway. If the amount of effort to set up NTP servers isn't huge, I'd give it a shot :)

Andrew claimed this task.Oct 30 2018, 8:14 PM

Andrew added a project: cloud-services-team (Kanban).

Andrew moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

Change 470751 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Add role/profile for a set of in-cloud ntp servers

https://gerrit.wikimedia.org/r/470751

Change 470751 merged by Andrew Bogott:
[operations/puppet@production] Add role/profile for a set of in-cloud ntp servers

https://gerrit.wikimedia.org/r/470751

Change 471343 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloud ntp: allow eqiad VMs to access these servers

https://gerrit.wikimedia.org/r/471343

Change 471343 merged by Andrew Bogott:
[operations/puppet@production] cloud ntp: allow eqiad VMs to access these servers

https://gerrit.wikimedia.org/r/471343

Change 470445 abandoned by Andrew Bogott:
cloudservices: include ntp servers.

https://gerrit.wikimedia.org/r/470445

Change 470446 merged by Andrew Bogott:
[operations/puppet@production] ntp: use cloud-specific ntp servers for cloud VMS

https://gerrit.wikimedia.org/r/470446

I built a couple of ntp servers in the cloudinfra project and we pointed all VMs at those servers.

root@xtools-prod03:/var/log# timedatectl status
      Local time: Sun 2018-11-04 18:02:25 UTC
  Universal time: Sun 2018-11-04 18:02:25 UTC
        RTC time: Sun 2018-11-04 18:02:25
       Time zone: Etc/UTC (UTC, +0000)
     NTP enabled: yes
NTP synchronized: yes
 RTC in local TZ: no
      DST active: n/a

Seems promising! Let me know if you see new problems (or if you think this isn't really fixed... I don't know for sure that this is an adequate fix.)

ntp broken in new regionClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

ntp broken in new region
Closed, ResolvedPublic
Actions

Related Objects
Search...