Figure out how to deal with security groups when rolling out metricsinfra scraping
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• taavi
	Aug 4 2021, 2:18 PM

Description

Currently all projects monitored by the Prometheus instance in metricsinfra have manual security group rules to allow scraping. We'll need a way to automate managing those when rolling out to projects not managed by WMCS staff or active trusted volunteers.

I don't see an option in horizon to permit traffic from a security group on a separate project, but there are a few alternative options that come to my mind:

The current monitoring host was given a reserved address in T250206#6056467. Expand that to a larger block of reserved addresses, say a /29 or a /28, (so that we can add redundancy and scale beyond one box) and add a rule that permits traffic from that block to all projects and the defaults for new projects. If it's restricted to the specific node-exporter port, we'd need to tell everyone to manually add rules for scraping non-default targets. This requires a production root to create new Prometheus VMs.
Give the configuration tooling in metricsinfra powers to manage security groups in all projects. This is dangerous, but would let us automate most actions.
Add the missing feature to Neutron to let us add a security group rule that permits traffic from a security group on a separate project. Add a rule using that to all existing projects and to the defaults of any new projects.

Details

Subject	Repo	Branch	Lines +/-
add wmcs-securitygroup-backfill	operations/puppet	production	+108 -0
wmfkeystonehooks: convert a couple of config opts to StrOpt	operations/puppet	production	+8 -13
wmfkeystonehooks: fix a copy/paste error	operations/puppet	production	+3 -3
openstack::keystone: provision new security group rules for metricsinfra	operations/puppet	production	+318 -149

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	fgiunchedi	T205862 Expand modern metrics infrastructure coverage (2018-19 Q2 goal)
Resolved	colewhite	T183454 Deprovision Diamond collectors no longer in use
Resolved	• MoritzMuehlenhoff	T210993 Deprecate Diamond collectors in Cloud VPS
Resolved	• taavi	T336774 Current status of cloudmetrics and its components
Resolved	• taavi	T326266 Remove the WMCS statsd/Graphite service
Open	dcaro	T313444 Streamline WMCS Alerting and Paging
Resolved	• taavi	T317032 Remove Diamond?
Resolved	• taavi	T264920 Grafana "cloud-vps-project-board" needs to be migrated from Graphite to Prometheus
Open	None	T194333 [Epic] Provide logging/metrics/monitoring SaaS for Cloud VPS tenants
Resolved	• taavi	T266050 Build Prometheus service for use by all Cloud VPS projects and their instances
Resolved	• taavi	T310799 Upgrade metricsinfra prometheus to bullseye
Resolved	Andrew	T288108 Figure out how to deal with security groups when rolling out metricsinfra scraping

Event Timeline

• taavi created this task.Aug 4 2021, 2:18 PM

• taavi updated the task description. (Show Details)

• taavi updated the task description. (Show Details)Aug 4 2021, 2:22 PM

Neutron sg-to-sg firewalling is very much intended not to cross tenants. Anything in another tenant is meant to represent "internet" and potentially require a router to reach it (should we build out neutron properly, that could even be true).

We currently have a set of hooks that run when a project is created that do things like set up the default security group. We could tweak that to open basic node monitoring to metricsinfra's reserved IP range. Any deeper functionality being opt-in, seems totally reasonable for CloudVPS.

The place to add a hook to include that rule would be "keystonehooks" in puppet. Backfilling things can be scripted if need be.

Andrew claimed this task.Aug 10 2021, 4:18 PM

• nskaggs triaged this task as High priority.Aug 10 2021, 4:18 PM

• nskaggs lowered the priority of this task from High to Medium.

• nskaggs moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

The current monitoring host was given a reserved address in T250206#6056467. Expand that to a larger block of reserved addresses, say a /29 or a /28, (so that we can add redundancy and scale beyond one box) and add a rule that permits traffic from that block to all projects and the defaults for new projects. If it's restricted to the specific node-exporter port, we'd need to tell everyone to manually add rules for scraping non-default targets. This requires a production root to create new Prometheus VMs.

I think this is the realistic path forward. We won't really be able to reserve addresses in a block via the linked method but we can set up stable IPs for the monitoring hosts, at which point we can inject those IPs into existing and future security groups.

@Majavah I realize that much of the above requires openstack admin privs; let me know when you're ready to move and what you need and I can start creating things.

Andrew reassigned this task from Andrew to • taavi.Oct 15 2021, 7:55 PM

@Andrew Hello! I think T310799: Upgrade metricsinfra prometheus to bullseye gives me the reason I needed to finally get this done. In short:

I want to create a new pair of instances, metricsinfra-prometheus-2 and metricsinfra-prometheus-3, with the g3.cores4.ram8.disk20 flavor. Those new instances should have access to the node-exporter port on all VMs on all projects.
In addition to Neutron security groups, the new instances also need to be added to the profile::wmcs::instance::metricsinfra_prometheus_nodes hiera key.
The existing instance was given a reserved address in T250206#6056467. We can re-use that for one of the instances and create a new similarly reserved port for the other one. That gives us stable IPs that don't change the next time we're upgrading these VMs.
I believe we can update wmfkeystonehooks to create the needed security group rules by default and write some scripts to backfill it to existing projects.

Does the above sound good? Do you want to work on it or should I? I believe I have enough access for everything except for merging the needed puppet patches.

• taavi added a parent task: T310799: Upgrade metricsinfra prometheus to bullseye.Jun 16 2022, 7:15 PM

Decided to be bold and created the second address:

taavi@cloudcontrol1004 ~ $ os port create --network 7425e328-560c-4f00-8e99-706f3fb90bb4 --description "reserved address for monitoring" "metricsinfra reserved address #2"

This means we now have the following:

taavi@cloudcontrol1004 ~ $ os port list | grep metricsinfra
| 02099d16-afab-4f27-83fc-8ffe0663e421 | metricsinfra reserved address #2                   | fa:16:3e:f8:4f:36 | ip_address='172.16.6.65', subnet_id='a69bdfad-d7d2-4cfa-8231-3d6d3e0074c9'   | DOWN   |
| 2e67a486-e840-4800-b974-d9220f5e107a | metricsinfra reserved address #1                   | fa:16:3e:c1:4d:69 | ip_address='172.16.0.229', subnet_id='a69bdfad-d7d2-4cfa-8231-3d6d3e0074c9'  | ACTIVE |

Change 806418 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] openstack::keystone: provision new security group rules for metricsinfra

https://gerrit.wikimedia.org/r/806418

gerritbot added a project: Patch-For-Review.Jun 17 2022, 1:35 PM

In T288108#8009416, @Majavah wrote:

Does the above sound good? Do you want to work on it or should I? I believe I have enough access for everything except for merging the needed puppet patches.

This mostly sounds good. I'm a little concerned about adding a default behavior that breaks multi-tenancy... I'm tempted to say that this should remain a manual opt-in for new and old projects alike. Other than adding additional work for project admins that doesn't interfere with your overall design does it?

It does a bit, but that can be worked around if needed. Previously I'd been working with the assumption that we do want to keep basic health metrics for all projects. If that's not something we want, that's fine, I just want that explicitely documented somewhere.

Change 806439 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] metricsinfra: add metricsinfra-prometheus-2

https://gerrit.wikimedia.org/r/806439

I am re-reading this task and think i should amend my 'little bit concerned' comment with "but monitoring sounds great so let's do it."

I don't remember exactly how default service groups are handled for new projects but I'm sure it's straightforward. I see the two service IPs (172.16.6.65 and 172.16.0.229) -- which exact port(s) need to be opened?

In T288108#8352905, @Andrew wrote:

I am re-reading this task and think i should amend my 'little bit concerned' comment with "but monitoring sounds great so let's do it."

I don't remember exactly how default service groups are handled for new projects but I'm sure it's straightforward. I see the two service IPs (172.16.6.65 and 172.16.0.229) -- which exact port(s) need to be opened?

For node-exporter (what I'd like to monitor by default) it's 9100. https://gerrit.wikimedia.org/r/806418 adds it to the defaults, but someone still needs to backfill it to all existing projects.

Change 806418 merged by Andrew Bogott:

[operations/puppet@production] openstack::keystone: provision new security group rules for metricsinfra

https://gerrit.wikimedia.org/r/806418

Change 850539 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] wmfkeystonehooks: fix a copy/paste error

https://gerrit.wikimedia.org/r/850539

Change 850540 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] wmfkeystonehooks: convert a couple of config opts to StrOpt

https://gerrit.wikimedia.org/r/850540

Change 850539 merged by Andrew Bogott:

[operations/puppet@production] wmfkeystonehooks: fix a copy/paste error

https://gerrit.wikimedia.org/r/850539

Change 850592 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] add wmcs-securitygroup-backfill

https://gerrit.wikimedia.org/r/850592

Change 850540 merged by Andrew Bogott:

[operations/puppet@production] wmfkeystonehooks: convert a couple of config opts to StrOpt

https://gerrit.wikimedia.org/r/850540

Mentioned in SAL (#wikimedia-cloud) [2022-11-08T11:17:48Z] <taavi> backfilling security groups for metricsinfra access on all projects T288108

Change 850592 merged by Andrew Bogott: