Move mgmt SSH checks from Icinga to Prometheus/Alertmanager
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fgiunchedi
	Jun 9 2022, 11:41 AM

Description

See parent task for more context, this one will track moving SSH mgmt checks from Icinga to Prometheus/Alertmanager cc @Volans

Details

Subject	Repo	Branch	Lines +/-
hieradata: remove mgmt_contactgroups	operations/puppet	production	+0 -58
icinga: move mgmt_parents to icinga	operations/puppet	production	+14 -26
icinga: decom mgmt monitoring	operations/puppet	production	+1 -77
dcops: switch mgmt down alerts to open tasks	operations/alerts	master	+4 -4
team-dcops: add alerts for mgmt down	operations/alerts	master	+38 -0
hiera_export: skip mgmt for non-production tenants	operations/software/netbox-extras	master	+7 -1
prometheus: probe SSH on mgmt network	operations/puppet	production	+55 -0
mr: allow prometheus_group SSH access to mgmt	operations/homer/public	master	+1 -1
customscripts: exclude decommissioning hosts from mgmt data	operations/software/netbox-extras	master	+1 -0
prometheus: move service_catalog_targets under ::targets	operations/puppet	production	+7 -5
sre.puppet.sync-netbox-hiera: bump timeout to cater for longer script timeout	operations/cookbooks	master	+1 -1
C:puppetmaster: drop hieradata from the netbox common path	operations/puppet	production	+2 -2
sre.puppet.sync-netbox-hiera: write netbox-hiera common.yaml with mgmt data	operations/cookbooks	master	+8 -3
customscripts: export 'mgmt' entries from hiera_export	operations/software/netbox-extras	master	+38 -2

Related Objects
Search...

Status	Assigned	Task
Open	None	T321808 Port most/all Icinga checks to Prometheus/Alertmanager
Open	None	T288622 All Prometheus based alerts move from Icinga to alert manager exclusively
Open	None	T225140 Icinga alerts that should open tasks instead of alerting
Resolved	fgiunchedi	T310266 Move mgmt SSH checks from Icinga to Prometheus/Alertmanager
Invalid	None	T319299 Investigate longer run time for hiera_export netbox script
Resolved	Volans	T320721 Decide whether decom'ing hosts mgmt DNS entry should stay or not

Event Timeline

fgiunchedi created this task.Jun 9 2022, 11:41 AM

From what I gathered from @fgiunchedi the question is how to get the list of all available MGMT FQDNs exposed to blackbox exporter.
Some generic comment/question:

At which point we should start monitoring the MGMT interface? I think after the sre.hosts.provision cookbook has been run.
Do we need to monitor hosts that are Decommissioning in Netbox? So after they got decommissioned but are still racked and potentially re-usable. I think so but no strong opinion.
The workflow for decommissioning or offlining a host (based on the above) should include the update of the source of truth for Alertmanager to prevent false positives.

As mentioned in IRC I think that there are various options here:

Call the Netbox APIs with pynetbox and generate the list of available FQDNs filtering them based on the decided parameters.
Use the already existing auto-generated dns repository based on netbox data . Anything committed there is (besides potential deploying issues) available in the DNS. The only problem is that hosts that are being setup could have the DNS setup long before their mgmt is actually reachable (by running the sre.hosts.provision cookbook). So those should need to be excluded somehow, maybe checking the status in Netbox.
Use the already existing (although still partially experimental) hiera repository auto-generated from netbox data. We could add some site data with the list of all the mgmt FQDNs split by site. This approach would have the benefit of allowing to filter the data based on Netbox status to prevent false positives with hosts to be provisioned/offlined.

cc @jbond

lmata subscribed.Jun 9 2022, 12:11 PM

In T310266#7991769, @Volans wrote:

From what I gathered from @fgiunchedi the question is how to get the list of all available MGMT FQDNs exposed to blackbox exporter.

That's correct yes, more specifically we're looking to generate a list of targets for Prometheus to pick up. For each target then Prometheus will ask blackbox exporter (in this case running on host(s) with access to the mgmt network) to probe for the ssh banner.

Some generic comment/question:

At which point we should start monitoring the MGMT interface? I think after the sre.hosts.provision cookbook has been run.

SGTM

Do we need to monitor hosts that are Decommissioning in Netbox? So after they got decommissioned but are still racked and potentially re-usable. I think so but no strong opinion.

Seems reasonable to me as a first iteration anyways

The workflow for decommissioning or offlining a host (based on the above) should include the update of the source of truth for Alertmanager to prevent false positives.

As mentioned in IRC I think that there are various options here:

Call the Netbox APIs with pynetbox and generate the list of available FQDNs filtering them based on the decided parameters.

Use the already existing auto-generated dns repository based on netbox data . Anything committed there is (besides potential deploying issues) available in the DNS. The only problem is that hosts that are being setup could have the DNS setup long before their mgmt is actually reachable (by running the sre.hosts.provision cookbook). So those should need to be excluded somehow, maybe checking the status in Netbox.

Use the already existing (although still partially experimental) hiera repository auto-generated from netbox data. We could add some site data with the list of all the mgmt FQDNs split by site. This approach would have the benefit of allowing to filter the data based on Netbox status to prevent false positives with hosts to be provisioned/offlined.

Thank you for outlining these solutions, from what I know now I think calling the netbox API seems the most flexible / future-proof solution. Also the hiera solution I think would work, and has the advantage of exposing reusable data as in "the list of mgmt hosts that should be available for use". The other advantage I see in hiera is that then the update of data on the Prometheus side happens on a puppet run, and we don't need timers, scripts, etc. On balance I think I'm leaning a bit more towards the hiera solution (although I'm fine with either!). I'm also ok with the "partially experimental" bit and happy to give feedback/experiment

As mentioned in IRC I think that there are various options here:

I'd also prefer (3) over (1) to prevent the spread of scripts that query Netbox directly. As the most scripts the more difficult it will make Netbox upgrades.

Depending on how frequently they run, they might also contribute to overwhelming Netbox if we have many blackbox exporters.

Do we need to monitor hosts that are Decommissioning in Netbox?

I'm a bit on the fence here. On one hand if it's running on our network it makes sens to monitor it.
On the other most of those devices will not be used, and receiving an alert (and working on it) for such device might seem like a waste of time. It for example could be looked into only if the host is being re-purposed. Maybe a middle ground would be to monitor them, but not send alerts?

As a fourth option i think we could use the new wmflib::resource::import/export functions. I think the difference between using the hiera solution or relaying on exported resources comes down to where (in puppet) we want to define the monitoring configuration.

We currently set up monitoring at the host level, i.e. each host defines (and exports) there own checks which are then realised on the alerting host at the next puppet run. This has the disadvantage that changes take longer to propagate. however it has the advantage that alerts are defined at the host level which means that operators can easily change or override settings for the specific host, profile, role etc by setting some hiera value. moving to one of the options above means that the monitoring configurations all becomes centralised thus changing the setting for one specific host, role etc becomes a bit more tricky, this of course could be seen as an advantage as it ensures consistency however experience suggests to me that there will always be some outlier or instance where a tweak or change will be needed for a specific host or group which could lead to a large profile with a bunch of if statements to handle the 1% of outliers. for this specific check im not sure there is much advantage either way but its worth considering as a more general point.

Specific to the hiera solution, currently this is not really used by anything in production and there is still some work required to make it a bit more of a first class citizen, none of this is blocking but its worth pointing out specifically updates to the hiera data is manual and currently noone knows to run it as such we should:

add hooks to the (de)comission/dns cookbooks to also sync hiera data
add alerting for when the hiera data doesn't match the git repo

We should also think about what metadata the checks will need. at the very least we will need some state to indicate if checks should be configured, but we may also want data to inform if the service is paging and who to page. theses are things that from a puppet PoV would probably be easier to do using exported resources at the host level. however we may also decided that this information is no longer configured in puppet and is instead all configured via alertmanager and the alerts repo.

edit

Do we need to monitor hosts that are Decommissioning in Netbox?

Currently we configure monitoring for anything running puppet which i believe covers at least for some of the decommissioning states

In T310266#7994399, @jbond wrote:

Do we need to monitor hosts that are Decommissioning in Netbox?

Currently we configure monitoring for anything running puppet which i believe covers at least for some of the decommissioning states

One of the first thing the decommission cookbook does is to remove the host from puppet though.

Ideally I think we should monitor the management system from when the host is racked and provisioned with an IP (after the sre.hosts.provision cookbook has run) and until it gets off-lined and uracked (right before the offline script in Netbox is run).
I agree that failures during the out of production times should have a lower severity than those in production.
Hence I don't see too many advantages of driving this via Puppet's logic of attaching it to a host, as the mgmt outlive the host definition in puppet both before and after.
As for easily disable checks on a per-host basis I think that we could drive that via netbox data (status or other) for example if using the netbox-hiera approach or the Netbox APIs one.

In T310266#7994399, @jbond wrote:

As a fourth option i think we could use the new wmflib::resource::import/export functions. I think the difference between using the hiera solution or relaying on exported resources comes down to where (in puppet) we want to define the monitoring configuration.

We currently set up monitoring at the host level, i.e. each host defines (and exports) there own checks which are then realised on the alerting host at the next puppet run. This has the disadvantage that changes take longer to propagate. however it has the advantage that alerts are defined at the host level which means that operators can easily change or override settings for the specific host, profile, role etc by setting some hiera value. moving to one of the options above means that the monitoring configurations all becomes centralised thus changing the setting for one specific host, role etc becomes a bit more tricky, this of course could be seen as an advantage as it ensures consistency however experience suggests to me that there will always be some outlier or instance where a tweak or change will be needed for a specific host or group which could lead to a large profile with a bunch of if statements to handle the 1% of outliers. for this specific check im not sure there is much advantage either way but its worth considering as a more general point.

Agreed on the general point on ownership and "overrides" for alerts depending on the owner of the host(s) in question. As the owner of many systems using exported resources, I'm not a fan of exported resources :( Mostly because of the dependencies between puppet runs at deploy time and being unable to preview changes with PCC.

Specific to the hiera solution, currently this is not really used by anything in production and there is still some work required to make it a bit more of a first class citizen, none of this is blocking but its worth pointing out specifically updates to the hiera data is manual and currently noone knows to run it as such we should:

add hooks to the (de)comission/dns cookbooks to also sync hiera data

add alerting for when the hiera data doesn't match the git repo

Thank you for the context, that's super useful! Specifically this mgmt work isn't super urgent but I'd like to move it forward (and remove some flapping IRC noise when mgmt ssh fails temporarily). What are the tasks I can look at ?

We should also think about what metadata the checks will need. at the very least we will need some state to indicate if checks should be configured, but we may also want data to inform if the service is paging and who to page. theses are things that from a puppet PoV would probably be easier to do using exported resources at the host level. however we may also decided that this information is no longer configured in puppet and is instead all configured via alertmanager and the alerts repo.

For the mgmt checks I think the checks and alerts are pretty uniform though (as in, mgmt ssh fails "for a little bit" -> open task to relevant dcops project) and the metadata can live in alerts.git with a single (a few at most) alerts.
Having said that, I'm +1 on having the discussions you are pointing to, perhaps not in the scope of this task

As for easily disable checks on a per-host basis I think that we could drive that via netbox data (status or other) for example if using the netbox-hiera approach or the Netbox APIs one.

Just an FYI the right now a host can only look up its own host data. if we want a host e.g. prometheus to pull in the data for all hosts then we would need to make some changes, this is not hard to do but would explode the size of the catalogue. Of course we could add Prometheus to the ACL for puppetdb-api and let them download the netbox data (https://puppetdb-api.discovery.wmnet::8443/hiera_export.HieraExport) directly

I'm not a fan of exported resources :( Mostly because of the dependencies between puppet runs at deploy time and being unable to preview changes with PCC.

i hear you i flip between thinking they are good [as long as you accept the delayed convergence] and that they are terrible. For what its worth i have been trying to think about how to make this better for PCC and once an exported resource is in use in production then you should be able to see the results on the host that realises it. getting data on the host that exports it is still a work in progress.

What are the tasks I can look at ?

i just created T310639

For the mgmt checks I think the checks and alerts are pretty uniform

Agree

Having said that, I'm +1 on having the discussions you are pointing to, perhaps not in the scope of this task

Also agree , please loop me into any discussions :)

fgiunchedi mentioned this in T169860: Replace smokeping with a Prometheus-based solution.Jun 28 2022, 1:49 PM

fgiunchedi edited projects, added SRE Observability (FY2022/2023-Q1); removed SRE Observability (FY2021/2022-Q4).Jul 1 2022, 8:09 AM

I'll be resuming work on this (i.e. checking mgmt ssh from blackbox exporter and not icinga), so to recap:

There seem to be consensus that using the hiera data exported from netbox is the way to go
Said hiera repository is in production and should be augmented with at least some monitoring (T310639)
The sre.puppet.sync-netbox-hiera cookbook should be hooked into the provisioning workflow/cookbook(s) so updates happen timely.
The decom/unrack workflow should also include running the sync-netbox-hiera cookbook. Currently there's no cookbook for said workflow, but there should be one
The data usable for this purpose is essentially a site -> list of mgmt hosts (or equivalent, as long as we have a list of mgmt hostnames per-site). Therefore some changes to (either? both?) the cookbook and hiera_export.HieraExport custom netbox scripts are needed

@Volans @jbond does the above recap look correct to you? anything I've missed? thank you!

That sounds like a good summary, the only thing missing is to hook the hiera cookbook also around offline time when we get rid of hosts.
Ideally during the https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Decommissioned_-%3E_Unracked step.
We don't currently have an offline cookbook but we could create one, that runs some of those steps, like the offline script in netbox, the dns cookbook, and plug in also the hiera one.

In T310266#8091636, @Volans wrote:

That sounds like a good summary, the only thing missing is to hook the hiera cookbook also around offline time when we get rid of hosts.
Ideally during the https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Decommissioned_-%3E_Unracked step.
We don't currently have an offline cookbook but we could create one, that runs some of those steps, like the offline script in netbox, the dns cookbook, and plug in also the hiera one.

Thank you @Volans, I've amended the list above to include decom/offline workflow.

WRT the implementation, I was imagining another Script subclass inside hiera_export tailored for mgmt data, and then sync-netbox-hiera would call both HieraExport and the new class

I think the current script is all that's needed, it was already designed to later hold more info. The current output starts with:

{"hosts": ...

So we can easily add another top-level key for common (or similar name). @jbond any thoughts on this?

P.S. If you do any work there I suggest to base it on top of https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/810955 as that will be merged soon, I just need to find the right time as it needs to go in sync with a puppet patch.

fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.Jul 26 2022, 1:22 PM

Change 817739 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/software/netbox-extras@master] customscripts: export 'mgmt' entries from hiera_export

https://gerrit.wikimedia.org/r/817739

gerritbot added a project: Patch-For-Review.Jul 27 2022, 10:04 AM

So we can easily add another top-level key for common (or similar name). @jbond any thoughts on this?

Dzahn mentioned this in T314413: cloudvirt1021 mgmt flapping.Aug 3 2022, 4:39 PM

Change 817739 merged by Filippo Giunchedi:

[operations/software/netbox-extras@master] customscripts: export 'mgmt' entries from hiera_export

https://gerrit.wikimedia.org/r/817739

Change 838139 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/cookbooks@master] sre: write netbox-hiera common.yaml with mgmt data

https://gerrit.wikimedia.org/r/838139

jbond mentioned this in rOSNE9683af42720c: customscripts: export 'mgmt' entries from hiera_export.Oct 4 2022, 12:57 PM

Change 838144 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] C:puppetmaster: drop hieradata from the netbox common path

https://gerrit.wikimedia.org/r/838144

Change 838139 merged by Filippo Giunchedi:

[operations/cookbooks@master] sre.puppet.sync-netbox-hiera: write netbox-hiera common.yaml with mgmt data

https://gerrit.wikimedia.org/r/838139

Change 838144 merged by Jbond:

[operations/puppet@production] C:puppetmaster: drop hieradata from the netbox common path

https://gerrit.wikimedia.org/r/838144

Change 838159 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] P:netbox::data: add profile to load common netbox data

https://gerrit.wikimedia.org/r/838159

Change 838161 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/cookbooks@master] sre.puppet.sync-netbox-hiera: bump timeout to cater for longer script timeout

https://gerrit.wikimedia.org/r/838161

Change 838161 merged by Filippo Giunchedi:

[operations/cookbooks@master] sre.puppet.sync-netbox-hiera: bump timeout to cater for longer script timeout

https://gerrit.wikimedia.org/r/838161

Change 842359 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/software/netbox-extras@master] customscripts: exclude decommissioning hosts from mgmt data

https://gerrit.wikimedia.org/r/842359

Change 845528 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: move service_catalog_targets under ::targets

https://gerrit.wikimedia.org/r/845528

Change 845529 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: probe SSH on mgmt network

https://gerrit.wikimedia.org/r/845529

Change 845528 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: move service_catalog_targets under ::targets

https://gerrit.wikimedia.org/r/845528

Change 842359 merged by Filippo Giunchedi:

[operations/software/netbox-extras@master] customscripts: exclude decommissioning hosts from mgmt data

https://gerrit.wikimedia.org/r/842359

fgiunchedi mentioned this in rOSNE949a6eccb8dc: customscripts: exclude decommissioning hosts from mgmt data.Oct 27 2022, 10:09 AM

Change 853938 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/homer/public@master] mr: allow prometheus_group SSH access to mgmt

https://gerrit.wikimedia.org/r/853938

Volans closed subtask T320721: Decide whether decom'ing hosts mgmt DNS entry should stay or not as Resolved.Nov 7 2022, 1:45 PM

Change 853938 merged by Filippo Giunchedi:

[operations/homer/public@master] mr: allow prometheus_group SSH access to mgmt

https://gerrit.wikimedia.org/r/853938

Change 845529 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: probe SSH on mgmt network

https://gerrit.wikimedia.org/r/845529

fgiunchedi closed subtask T319299: Investigate longer run time for hiera_export netbox script as Invalid.Nov 7 2022, 3:17 PM

Change 854037 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/software/netbox-extras@master] hiera_export: add tenant information to mgmt

https://gerrit.wikimedia.org/r/854037

Change 854037 merged by Filippo Giunchedi:

[operations/software/netbox-extras@master] hiera_export: skip mgmt for non-production tenants

https://gerrit.wikimedia.org/r/854037

fgiunchedi mentioned this in rOSNE05faaeb2bf85: hiera_export: skip mgmt for non-production tenants.Nov 7 2022, 4:57 PM

lmata edited projects, added SRE Observability (FY2022/2023-Q2); removed Patch-For-Review, SRE Observability (FY2022/2023-Q1), Observability-Alerting, User-fgiunchedi.Nov 8 2022, 4:12 PM

fgiunchedi added a project: User-fgiunchedi.Nov 10 2022, 3:12 PM

Change 858363 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] team-dcops: add alerts for mgmt down

https://gerrit.wikimedia.org/r/858363

gerritbot added a project: Patch-For-Review.Nov 17 2022, 4:48 PM

Change 858363 merged by Filippo Giunchedi:

[operations/alerts@master] team-dcops: add alerts for mgmt down

https://gerrit.wikimedia.org/r/858363

We're basically ready to go an start opening tasks, however we should make sure hiera data is synced as part of the decom cookbook or we'll end up with false positives for decom hosts when the mgmt interface becomes unreachable. AFAICT that's not yet the case @Volans ?

Correct, and not only on the decom one. That's https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/804575 and depends on @jbond as he sent the email to SREs (see the one with object Changes to the dns cookbook) :)

Change 860525 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] dcops: switch mgmt down alerts to open tasks

https://gerrit.wikimedia.org/r/860525

With https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/804575 merged we can start opening tasks when mgmt has been unresponsive to ssh for more than 12h. The alert will open tasks to the correct dcops project in phab. @wiki_willy @Papaul @Cmjohnson @Jclark-ctr let me know what you think!

@fguinchedi sounds great but quick question. Will the ticket go directly to dcops? Or would it start with the team that is responsible for that service first?

In T310266#8419653, @Jclark-ctr wrote:

@fguinchedi sounds great but quick question. Will the ticket go directly to dcops? Or would it start with the team that is responsible for that service first?

Yes tasks will start at #dcops-<site> though of course then can be triaged/reassigned as needed

Change 860572 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] icinga: decom mgmt monitoring

https://gerrit.wikimedia.org/r/860572

Change 860573 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] icinga: move mgmt_parents to icinga

https://gerrit.wikimedia.org/r/860573

Change 860574 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: remove mgmt_contactgroups

https://gerrit.wikimedia.org/r/860574

Change 860525 merged by Filippo Giunchedi:

[operations/alerts@master] dcops: switch mgmt down alerts to open tasks

https://gerrit.wikimedia.org/r/860525

Change is live, I'm expecting tasks to start being opened tomorrow!

Thanks @fgiunchedi !

In T310266#8423980, @fgiunchedi wrote:

Change is live, I'm expecting tasks to start being opened tomorrow!

Change 860572 merged by Filippo Giunchedi:

[operations/puppet@production] icinga: decom mgmt monitoring

https://gerrit.wikimedia.org/r/860572

Change 860573 merged by Filippo Giunchedi:

[operations/puppet@production] icinga: move mgmt_parents to icinga

https://gerrit.wikimedia.org/r/860573

Change 860574 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: remove mgmt_contactgroups

https://gerrit.wikimedia.org/r/860574

I have decom'd the icinga mgmt ssh/dns checks and cleaned up now-unused variables, I'm happy to report that after a fair few changes and coordination we're done with this! Thanks to all who helped cc @Volans @ayounsi

fgiunchedi mentioned this in rCCKB008f69d57885: sre.puppet.sync-netbox-hiera: write netbox-hiera common.yaml with mgmt data.Dec 14 2022, 3:31 PM

fgiunchedi mentioned this in rCCKBfca323eecbdb: sre.puppet.sync-netbox-hiera: bump timeout to cater for longer script timeout.

lmata moved this task from Inbox to Done on the SRE Observability (FY2022/2023-Q2) board.Jan 12 2023, 7:00 PM

wiki_willy mentioned this in T334785: ManagementSSHDown.Apr 24 2023, 4:51 PM

Move mgmt SSH checks from Icinga to Prometheus/AlertmanagerClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Move mgmt SSH checks from Icinga to Prometheus/Alertmanager
Closed, ResolvedPublic
Actions

Related Objects
Search...