Page MenuHomePhabricator

Determine how to monitor services in cloud-private / cloudlb
Open, Needs TriagePublic

Description

The Cloud-VPS infrastructure is now hosting certain services (including the OpenStack API and our DNS servers) using cloud realm IP addresses via servers with both wikiprod and cloud realm connectivity using our cloudsw switches. This setup is following case 4 documented on https://wikitech.wikimedia.org/wiki/Cross-Realm_traffic_guidelines#Case_4:_cloud-dedicated_hardware.

An open question is how to implement (Prometheus-based) monitoring for those services. Current prometheus hardware does not currently have connectivity to the cloud realm addresses. We have cloudmetrics hardware that's currently running the cloud vps prometheus instance, however those are not in WMCS dedicated racks so they're currently relying on a hack to permit access to the cloud-realm addresses.

The simplest solution for now would be to relocate cloudmetrics hosts to WMCS racks and give them addresses in cloud-private, however T336854: Move labs/wmcs (OpenStack) Prometheus instance off cloudmetrics hosts to prometheus* hosts has plans to consolidate the cloud vps Prometheus instance to current prometheus hosts which needs a different solution.

Event Timeline

@taavi thanks for the task.

Firstly, moving the cloudmetrics hosts so they are in the WMCS racks and connecting them to the cloud-private network makes sense. If those hosts are to remain longer term.

If the plan is to instead decom them, and move all monitoring to the WMF prometheusXXXX hosts then that can probably also be supported. The prometheus hosts can:

  1. Poll the cloud hosts' 10.x IPs to pull all relevant statistics for the host
  2. Connect to cloud hosts' public VIPs to monitor actual service state / connectivity to those endpoints (blackbox exporter etc)

There is a slight problem with number 2, however. This would be a connection from private WMF space (10.x network), to public internet addresses. As we all know, you can't connect from a RFC1918 address to an internet host. However, we are already doing that from prometheus hosts to public WMF endpoints:

cmooney@prometheus1005:~$ ss -tulpna4 | grep 208\.80 | column -t  | head -10 
tcp  TIME-WAIT  0  0  10.64.0.82:40026  208.80.154.252:389
tcp  ESTAB      0  0  10.64.0.82:43294  208.80.154.152:9105
tcp  ESTAB      0  0  10.64.0.82:37622  208.80.154.142:4194
tcp  ESTAB      0  0  10.64.0.82:52424  208.80.154.78:4194
tcp  TIME-WAIT  0  0  10.64.0.82:47856  208.80.153.110:22
tcp  ESTAB      0  0  10.64.0.82:59694  208.80.154.30:9105
tcp  ESTAB      0  0  10.64.0.82:39348  208.80.154.146:9631
tcp  ESTAB      0  0  10.64.0.82:39540  208.80.154.132:9290
tcp  ESTAB      0  0  10.64.0.82:53364  208.80.154.31:9105
tcp  ESTAB      0  0  10.64.0.82:46048  208.80.154.6:9105

As the public IPs in question (WMF or Cloud) are internal to our network the normal "impossibility" of a private IP connecting to public is not actually a technical constraint. Any restriction on the prometheus hosts connecting to public WMCS IPs is thus only a matter of policy.

I see two ways forward in that case:

  1. Allow the prometheus hosts to poll WMCS public IPs
  2. Give the prometheus hosts public IPs, so they can poll internet endpoints (including WMCS public IPs)

Personally I've no objection to the first option, just allowing it. But as you mention the policy and overall shape of things in terms of the "cross realm guidelines" needs to be considered. @ayounsi have you any thoughts here?

There is a slight problem with number 2, however. This would be a connection from private WMF space (10.x network), to public internet addresses. As we all know, you can't connect from a RFC1918 address to an internet host. However, we are already doing that from prometheus hosts to public WMF endpoints:

What about services that were previously using WMF public IPs but are now using WMCS private IPs, for example the ns-recursor VIP? I don't think we can move the monitoring for that to Cloud VPS, since an outage might also prevent cloud vps hosted monitoring systems from notifying us about said outage.

The other complication I forsee here is that the cloudmetrics currently have some connections from VM instances? For example:

cmooney@cloudmetrics1003:~$ ss -tulpna | grep 172.16 | column -t 
tcp  ESTAB  0  0  10.64.4.6:2003  172.16.6.142:53712

The above IP resolves to traffic-cptext.traffic.eqiad1.wikimedia.cloud. TBH such traffic patterns have long been allowed, but our goal is to try to reduce them longer term. As the above connection seems to be initiated from inside the cloud, it potentially could be NAT'd by the cloudgw to 185.15.56.1?

But yeah, the other consideration is if there are things that need to be monitored / connected to that aren't physical servers.

The other complication I forsee here is that the cloudmetrics currently have some connections from VM instances? For example:

That was a long-lived connection that would have been blocked by https://gerrit.wikimedia.org/r/c/operations/puppet/+/942691/ today. That VM seems to be having Puppet issues, but I fixed the service that was causing it.

What about services that were previously using WMF public IPs but are now using WMCS private IPs, for example the ns-recursor VIP? I don't think we can move the monitoring for that to Cloud VPS, since an outage might also prevent cloud vps hosted monitoring systems from notifying us about said outage.

That's a good point. I don't think it makes sense to allow WMF servers reach the cloud-private address space, the entire point of the whole project was to create a new, private network between cloud nodes to make internal services available (rather than have those services sit on public WMF space, and have to route through our core routers on low-bandwidth links to get there).

Moving, for instance, the recursor IP to a public one is an obvious answer. But you lose certain security benefits by putting something that doesn't need to be reachable from outside "on the internet", just to allow monitoring.

Overall this requirement makes me think that something within the cloud-network, connected to a cloud-private subnet, should monitor those.

Personally I've no objection to the first option, just allowing it. But as you mention the policy and overall shape of things in terms of the "cross realm guidelines" needs to be considered. @ayounsi have you any thoughts here?

Prometheus monitors endpoints outside of WMF's network through the proxies, see T303803: Prometheus use of Squid proxies. Would that work for that usecase?

What about services that were previously using WMF public IPs but are now using WMCS private IPs, for example the ns-recursor VIP?

Do we have a full list (current and future) of such services? Not sure that was taken into considerations while evaluating T336854 (and parent)

I agree that exposing a service from a realm to the internet just to be able to monitor it from a different realm seems overkill and might cause more issues that it solves. It could also show false positive/negative if the service is reachable from outside without being reachable from its legit clients.

Maybe an idea is to to monitor it from inside WMCS using a lightweight process and only expose the monitoring status through the internet or through a cloud-hosts (like old Icinga's NRPE or a Prometheus exporter). Probably to see on a case by case basis.

Thank you for raising this @taavi !

I can confirm that the plan is to ditch cloudmetrics completely and consolidate cloud monitoring to prometheus hosts.

In terms of addressing I'm +1 on keeping prometheus on internal IPs and allowing those hosts to reach cloud hosts (I mean the cloud hosts that run openstack, not cloud VPS addresses/hosts, not sure if we have a denomination for either?)

Being able to reach cloud public IPs too I think is handy and mirrors what's already possible in production as you pointed out @cmooney, so I'm +1 on that too.

There is a slight problem with number 2, however. This would be a connection from private WMF space (10.x network), to public internet addresses. As we all know, you can't connect from a RFC1918 address to an internet host. However, we are already doing that from prometheus hosts to public WMF endpoints:

What about services that were previously using WMF public IPs but are now using WMCS private IPs, for example the ns-recursor VIP? I don't think we can move the monitoring for that to Cloud VPS, since an outage might also prevent cloud vps hosted monitoring systems from notifying us about said outage.

ns-recursor VIP uses cloud VPS addressing, correct? If so I think we should keep cloud vps monitoring within cloud vps, and then run a few high level checks from prometheus* talking to public WMCS / cloud vps addresses as effectively an "external monitor" for cloud VPS general operation.

HTH

What about services that were previously using WMF public IPs but are now using WMCS private IPs, for example the ns-recursor VIP?

Do we have a full list (current and future) of such services? Not sure that was taken into considerations while evaluating T336854 (and parent)

Currently:

  • ns-recursor
    • formerly a host alias for cloudservices hosts in a public VLAN

In the future:

  • wiki replicas (T346947)
  • cloudelastic (T346946)
    • both currently behind LVS

Dumps (NFS and HTTPS) is still a question mark but it might fall into this category.

ns-recursor is the most critical one of those.

ns-recursor VIP uses cloud VPS addressing, correct? If so I think we should keep cloud vps monitoring within cloud vps, and then run a few high level checks from prometheus* talking to public WMCS / cloud vps addresses as effectively an "external monitor" for cloud VPS general operation.

It's in "WMCS private service VIPs" range (172.20.255.0/24), which are considered cloud-realm addresses for policy purposes but hosted on dual-homed hardware (cloudservices in this case).

Monitoring those via metricsinfra and monitoring the monitoring setup from production might be the way to go here indeed. T288053: Add external meta-monitoring for metricsinfra is relevant, but I haven't given it much thought yet.

ns-recursor VIP uses cloud VPS addressing, correct?

No, it's a service hosted by the cloudservices bare-metal nodes. These were previously using public IP addressing in the WMF production realm for all services, and thus reachable from everywhere.

Now the cloudservices nodes are connected to 10.x addressing (same as any other wmf private host basically), as well as having a leg in a new cloud-only network using 172.20.x.x. Their 10.x IPs can be polled by prometheus no problem. They also announce some public IPs in BGP, for services that need to be available from the internet, which are currently reachable from private WMF space (unless we put ACLs in to block it). They use other, 172.20.x private IPs to host services that are internal to cloud and don't need to be exposed to the outside.

It's those IPs, like 172.20.255.1 (ns-recursor.openstack.eqiad1.wikimediacloud.org), that are not reachable from WMF prod and we don't really want to make directly routable.

Prometheus monitors endpoints outside of WMF's network through the proxies, see T303803: Prometheus use of Squid proxies. Would that work for that usecase?

Not 100%, depends on the protocols used I guess. They do run DNS and some other non-TCP stuff, so a HTTP or TCP-only proxy won't work.

Prometheus itself, scraping metrics, is not an issue. Any prometheus exporter can bind to the 10.x (or 0.0.0.0) address and make metrics available to our prometheus nodes. Metrics relating to services only running in the cloud realm even. The problem is direct service monitoring (say DNS request, or DB connection) to the cloud-private IPs.

Do we have a full list (current and future) of such services? Not sure that was taken into considerations while evaluating T336854 (and parent)

Only one right now is the DNS recursor:

https://netbox.wikimedia.org/ipam/prefixes/666/ip-addresses/
https://netbox.wikimedia.org/ipam/prefixes/665/ip-addresses/

However, I'd argue it makes no sense to use public IPs, and expose services to the internet, if they only need to be used internally in the cloud. The concept of "internal only" cloud services makes sense to me, I'd be wary of making that approach less useful, causing people to put things on public space instead. The existing widespread use of WMF public vlans for cloud services happened in a similar way I think which is a warning.

Maybe an idea is to to monitor it from inside WMCS using a lightweight process and only expose the monitoring status through the internet or through a cloud-hosts (like old Icinga's NRPE or a Prometheus exporter). Probably to see on a case by case basis.

Yeah, a cloud host (or two), with a 10.x prod-realm IP and one in 172.20.x cloud-private (standard cloud host setup now) seems to make sense to me. That host can monitor all private service endpoints, and export them via the 10.x interface to anything in WMF-prod, or make them available via some kind of proxy that way.

I can confirm that the plan is to ditch cloudmetrics completely and consolidate cloud monitoring to prometheus hosts.

What kind of time frame do you have in mind for this? Can we push things directly to the end state or do we need to migrate cloudmetrics hosts to the new network layout first and move things from cloudmetrics to prometheus later?

ns-recursor VIP uses cloud VPS addressing, correct?

No, it's a service hosted by the cloudservices bare-metal nodes. These were previously using public IP addressing in the WMF production realm for all services, and thus reachable from everywhere.

Now the cloudservices nodes are connected to 10.x addressing (same as any other wmf private host basically), as well as having a leg in a new cloud-only network using 172.20.x.x. Their 10.x IPs can be polled by prometheus no problem. They also announce some public IPs in BGP, for services that need to be available from the internet, which are currently reachable from private WMF space (unless we put ACLs in to block it). They use other, 172.20.x private IPs to host services that are internal to cloud and don't need to be exposed to the outside.

Thank you for explanation, I said it poorly with "cloud vps addressing" though yes I had 172.20 in mind. Therefore my comment re: keeping cloud vps monitoring within cloud vps applies. Then we can check from prometheus prod that the whole cloud vps stack works with just a few high level checks, to cover the meta monitoring case.

Maybe an idea is to to monitor it from inside WMCS using a lightweight process and only expose the monitoring status through the internet or through a cloud-hosts (like old Icinga's NRPE or a Prometheus exporter). Probably to see on a case by case basis.

Yeah, a cloud host (or two), with a 10.x prod-realm IP and one in 172.20.x cloud-private (standard cloud host setup now) seems to make sense to me. That host can monitor all private service endpoints, and export them via the 10.x interface to anything in WMF-prod, or make them available via some kind of proxy that way.

Indeed @ayounsi idea sounds similar to what I suggested above, and to clarify what I'm suggesting wouldn't require cloud-private to be involved but only VMs

I can confirm that the plan is to ditch cloudmetrics completely and consolidate cloud monitoring to prometheus hosts.

What kind of time frame do you have in mind for this? Can we push things directly to the end state or do we need to migrate cloudmetrics hosts to the new network layout first and move things from cloudmetrics to prometheus later?

It is getting at the top of my TODO list, I'm aiming at start working on it in the next 2-4 weeks. I'd go with pushing things to their end state.