Page MenuHomePhabricator

Figure out how to deal with security groups when rolling out metricsinfra scraping
Closed, ResolvedPublic

Description

Currently all projects monitored by the Prometheus instance in metricsinfra have manual security group rules to allow scraping. We'll need a way to automate managing those when rolling out to projects not managed by WMCS staff or active trusted volunteers.

I don't see an option in horizon to permit traffic from a security group on a separate project, but there are a few alternative options that come to my mind:

  • The current monitoring host was given a reserved address in T250206#6056467. Expand that to a larger block of reserved addresses, say a /29 or a /28, (so that we can add redundancy and scale beyond one box) and add a rule that permits traffic from that block to all projects and the defaults for new projects. If it's restricted to the specific node-exporter port, we'd need to tell everyone to manually add rules for scraping non-default targets. This requires a production root to create new Prometheus VMs.
  • Give the configuration tooling in metricsinfra powers to manage security groups in all projects. This is dangerous, but would let us automate most actions.
  • Add the missing feature to Neutron to let us add a security group rule that permits traffic from a security group on a separate project. Add a rule using that to all existing projects and to the defaults of any new projects.

Event Timeline

Neutron sg-to-sg firewalling is very much intended not to cross tenants. Anything in another tenant is meant to represent "internet" and potentially require a router to reach it (should we build out neutron properly, that could even be true).

We currently have a set of hooks that run when a project is created that do things like set up the default security group. We could tweak that to open basic node monitoring to metricsinfra's reserved IP range. Any deeper functionality being opt-in, seems totally reasonable for CloudVPS.

The place to add a hook to include that rule would be "keystonehooks" in puppet. Backfilling things can be scripted if need be.

nskaggs lowered the priority of this task from High to Medium.
nskaggs moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

The current monitoring host was given a reserved address in T250206#6056467. Expand that to a larger block of reserved addresses, say a /29 or a /28, (so that we can add redundancy and scale beyond one box) and add a rule that permits traffic from that block to all projects and the defaults for new projects. If it's restricted to the specific node-exporter port, we'd need to tell everyone to manually add rules for scraping non-default targets. This requires a production root to create new Prometheus VMs.

I think this is the realistic path forward. We won't really be able to reserve addresses in a block via the linked method but we can set up stable IPs for the monitoring hosts, at which point we can inject those IPs into existing and future security groups.

@Majavah I realize that much of the above requires openstack admin privs; let me know when you're ready to move and what you need and I can start creating things.

@Andrew Hello! I think T310799: Upgrade metricsinfra prometheus to bullseye gives me the reason I needed to finally get this done. In short:

  • I want to create a new pair of instances, metricsinfra-prometheus-2 and metricsinfra-prometheus-3, with the g3.cores4.ram8.disk20 flavor. Those new instances should have access to the node-exporter port on all VMs on all projects.
  • In addition to Neutron security groups, the new instances also need to be added to the profile::wmcs::instance::metricsinfra_prometheus_nodes hiera key.
  • The existing instance was given a reserved address in T250206#6056467. We can re-use that for one of the instances and create a new similarly reserved port for the other one. That gives us stable IPs that don't change the next time we're upgrading these VMs.
  • I believe we can update wmfkeystonehooks to create the needed security group rules by default and write some scripts to backfill it to existing projects.

Does the above sound good? Do you want to work on it or should I? I believe I have enough access for everything except for merging the needed puppet patches.

Decided to be bold and created the second address:

taavi@cloudcontrol1004 ~ $ os port create --network 7425e328-560c-4f00-8e99-706f3fb90bb4 --description "reserved address for monitoring" "metricsinfra reserved address #2"

This means we now have the following:

taavi@cloudcontrol1004 ~ $ os port list | grep metricsinfra
| 02099d16-afab-4f27-83fc-8ffe0663e421 | metricsinfra reserved address #2                   | fa:16:3e:f8:4f:36 | ip_address='172.16.6.65', subnet_id='a69bdfad-d7d2-4cfa-8231-3d6d3e0074c9'   | DOWN   |
| 2e67a486-e840-4800-b974-d9220f5e107a | metricsinfra reserved address #1                   | fa:16:3e:c1:4d:69 | ip_address='172.16.0.229', subnet_id='a69bdfad-d7d2-4cfa-8231-3d6d3e0074c9'  | ACTIVE |

Change 806418 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] openstack::keystone: provision new security group rules for metricsinfra

https://gerrit.wikimedia.org/r/806418

In T288108#8009416, @Majavah wrote:

Does the above sound good? Do you want to work on it or should I? I believe I have enough access for everything except for merging the needed puppet patches.

This mostly sounds good. I'm a little concerned about adding a default behavior that breaks multi-tenancy... I'm tempted to say that this should remain a manual opt-in for new and old projects alike. Other than adding additional work for project admins that doesn't interfere with your overall design does it?

It does a bit, but that can be worked around if needed. Previously I'd been working with the assumption that we do want to keep basic health metrics for all projects. If that's not something we want, that's fine, I just want that explicitely documented somewhere.

Change 806439 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] metricsinfra: add metricsinfra-prometheus-2

https://gerrit.wikimedia.org/r/806439

Andrew raised the priority of this task from Medium to High.Oct 28 2022, 2:39 PM

I am re-reading this task and think i should amend my 'little bit concerned' comment with "but monitoring sounds great so let's do it."

I don't remember exactly how default service groups are handled for new projects but I'm sure it's straightforward. I see the two service IPs (172.16.6.65 and 172.16.0.229) -- which exact port(s) need to be opened?

I am re-reading this task and think i should amend my 'little bit concerned' comment with "but monitoring sounds great so let's do it."

I don't remember exactly how default service groups are handled for new projects but I'm sure it's straightforward. I see the two service IPs (172.16.6.65 and 172.16.0.229) -- which exact port(s) need to be opened?

For node-exporter (what I'd like to monitor by default) it's 9100. https://gerrit.wikimedia.org/r/806418 adds it to the defaults, but someone still needs to backfill it to all existing projects.

Change 806418 merged by Andrew Bogott:

[operations/puppet@production] openstack::keystone: provision new security group rules for metricsinfra

https://gerrit.wikimedia.org/r/806418

Change 850539 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] wmfkeystonehooks: fix a copy/paste error

https://gerrit.wikimedia.org/r/850539

Change 850540 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] wmfkeystonehooks: convert a couple of config opts to StrOpt

https://gerrit.wikimedia.org/r/850540

Change 850539 merged by Andrew Bogott:

[operations/puppet@production] wmfkeystonehooks: fix a copy/paste error

https://gerrit.wikimedia.org/r/850539

Change 850592 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] add wmcs-securitygroup-backfill

https://gerrit.wikimedia.org/r/850592

Change 850540 merged by Andrew Bogott:

[operations/puppet@production] wmfkeystonehooks: convert a couple of config opts to StrOpt

https://gerrit.wikimedia.org/r/850540

Mentioned in SAL (#wikimedia-cloud) [2022-11-08T11:17:48Z] <taavi> backfilling security groups for metricsinfra access on all projects T288108

Change 850592 merged by Andrew Bogott:

[operations/puppet@production] add wmcs-securitygroup-backfill

https://gerrit.wikimedia.org/r/850592