Page MenuHomePhabricator

VRTS monitoring- re-activate new blackbox check (was: 'ProbeDown')
Closed, ResolvedPublic

Description

This is a ticket that was automatically created by a new type of monitoring.

We got these first alerts while still testing it. They were false positives, not actual service problems, rather bugs in the config for the monitoring check.

We are still working on that and have agreed to deactivate it for the inspiration week. It will be re-enabled and used for production thereafter.

The part that this ticket was auto-created is also the nice part about it as it proofs that part is working.

Now reusing the ticket as a more general monitoring for VRTS ticket to finish that.

Original automatic ticket text is below:


Common information

  • alertname: ProbeDown
  • instance: otrs1001:1443
  • job: probes/custom
  • prometheus: ops
  • severity: critical
  • site: eqiad
  • source: prometheus
  • team: serviceops-collab

Firing alerts



Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I recently fixed the prometheus::blackbox::check::http definition to do the right thing and honor the team label. This is the result (i.e. working as expected).

At any rate, please check the alerts for gitlab/vtrs. AFAICT the probes are trying to connect to addresses that refuse connections (from the logs links above)

Dzahn subscribed.

It was intentional to just see if it works. (and not sure if there was a way to test those before hand).

The expectation was that we would get automatically created tickets, email and IRC notification though if it fails.

description: gitlab1004:443 failed

We configured the checks to test gitlab.wikimedia.org, not gitlab1004:443.

I added silences in alerts.wikimedia.org for all of these. silence feels like disabling notifications though. What I really want is "ACK" or scheduled downtime. But it seems like those concepts don't exist anymore.

description: gitlab1004:443 failed

We configured the checks to test gitlab.wikimedia.org, not gitlab1004:443.

I have clarified a bit the wording at https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown on what the labels mean, in this case the probe to check "gitlab.wikimedia.org" (the http vhost) is run against "gitlab1004:443". At any rate I believe the fix by @Jelto here should do the right thing: https://gerrit.wikimedia.org/r/c/operations/puppet/+/811882

I added silences in alerts.wikimedia.org for all of these. silence feels like disabling notifications though. What I really want is "ACK" or scheduled downtime. But it seems like those concepts don't exist anymore.

I might be missing something, but scheduled downtime also effectively disables notifications? To see the suppressed alerts you can filter by @state=suppressed in the dashboard. I have also expanded https://wikitech.wikimedia.org/wiki/Alertmanager#Silences_%26_acknowledgements with this information, feel free to expand more and I'm happy to expand more as needed!

Dzahn renamed this task from ProbeDown to monitoring / VRTS - new blackbox check reports 'ProbeDown'.Jul 7 2022, 9:29 PM

I have clarified a bit the wording at https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown on what the labels mean

ACK, thank you!

At any rate I believe the fix by @Jelto here should do the right thing: https://gerrit.wikimedia.org/r/c/operations/puppet/+/811882

Indeed, it looks good :) thanks Jelto. And there were a couple more changes that we should have linked to a "monitoring for gitlab" ticket.

I did that here just now: T275170#8066973

And about gitlab monitoring I have one more follow-up at https://gerrit.wikimedia.org/r/812427 which is about avoidig false positives from gitlab-replica.

That being said this automatic ticket is about otrs1001 and not gitlab. Those are 2 different things we recently added.

This ticket is also a duplicate of T312609 which I closed with comments at T312609#8066741.

Let me also add the history for VRTS (otrs1001) monitoring:

https://gerrit.wikimedia.org/r/c/operations/puppet/+/810087 - vrts: add promtheus blackbox monitoring

https://gerrit.wikimedia.org/r/c/operations/puppet/+/811985 - P:vrts: fix probe port - we switched the port to 1443 to listen to envoy

https://gerrit.wikimedia.org/r/c/operations/puppet/+/812144 - vrts/prometheus: set force_tls to true for check on port 1443 - we used force_tls to use https and not http even though it's not port 443 (wrong protocol)

https://gerrit.wikimedia.org/r/c/operations/puppet/+/812152 - vrts/prometheus: fix monitored path, avoid redirect - it did not work because now we got a redirect and had not told it to follow them, fixed path

https://gerrit.wikimedia.org/r/c/operations/puppet/+/812158 - vrts/prometheus: comment out broken check - still would not work because I still had the wrong path and decided to just disable it for now.. commented out

I added silences in alerts.wikimedia.org for all of these. silence feels like disabling notifications though. What I really want is "ACK" or scheduled downtime. But it seems like those concepts don't exist anymore.

I might be missing something, but scheduled downtime also effectively disables notifications?

Sorry for having just dumped this here.. I meant to just save my thoughts and get back to it. Let me explain what I meant a bit more.

Coming from Nagios and then Icinga land.. I have always seen the following things as different from each other:

  • scheduled downtime: it also sends the message to all observers that this is a planned downtime, nothing that was ad-hoc, that people can be expected to be on it AND a side effect is that notifications are disabled during that time. It automatically ends at a pre-defined time, which means it can't be forgotten to remove it and alerting starts again as normal.
  • ACK: it sends the message that this was not planned but someone saw it and reacted and it disables notifications but only until the next state change. So it will stop repeating the message but start sending them again once something changes one way or another. This is often exactly what is needed. It also includes the effect that it can't be forgotten to enable notifications again while stopping spam. It also means a check moved from "unhandled" to "handled" which helps with reducing the signal/noise ratio on dashboards
  • "disable notifications" - This is the one I have never liked and I always tried to preach to people that they should stop using this. Because while it does disable notifications.. it does not send any other message that tells you why that is... for how long it should stay this way, and whether it has been just forgotten to re-enable it after last maintenance or it's because there is planned maintenance or because someone wanted to actually just say "ACK" for that moment to say they will get to it later. When looking at Icinga web UI we can normally find a whole bunch of hosts/services with (seemingly random) disabled notifications.. and most of them are simply because doing it this way means there is no expiration at any time or a state change. Since we are all humans this adds up and we will miss alerts just because of that.

Anyways... I am just saying all that to explain why for me "silence" isn't just "silence" but there are nuances. I am NOT trying to say all that needs to be the same in alertmanager.. just sharing what went through my mind as a long-time Icinga user who is still new to alertmanager.

To see the suppressed alerts you can filter by @state=suppressed in the dashboard. I have also expanded https://wikitech.wikimedia.org/wiki/Alertmanager#Silences_%26_acknowledgements with this information, feel free to expand more and I'm happy to expand more as needed!

ACK, ok, great. thanks!

P.S. In tickets like this and T312609 it would be nice if we can somehow get the host name / service name in there. Currently the ticket tiles are just "Probe Down" and I have started to rename them manually.

Dzahn renamed this task from monitoring / VRTS - new blackbox check reports 'ProbeDown' to VRTS monitoring- re-activate new blackbox check (was: 'ProbeDown').Jul 9 2022, 3:08 AM
Dzahn triaged this task as Medium priority.
Dzahn added a project: vrts.

Change 812142 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] vrts/blackbox: adjust monitoring back to port 80, but fix path

https://gerrit.wikimedia.org/r/812142

Change 812282 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] Revert "vrts/prometheus: comment out broken check"

https://gerrit.wikimedia.org/r/812282

Change 812326 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] vrts/prometheus: re-activate commented check after fixing path

https://gerrit.wikimedia.org/r/812326

the alert/check that created this ticket automatically has been disabled currently.

We agreed to leave it that way for the special week and merge the follow-up fixes after.

re-cycling this ticket as general VRTS monitoring ticket.. we should close it after we have re-activated this AND it's all green

not sure if duplicate or child of T303190 but highly related either way, just thought of that one

I added silences in alerts.wikimedia.org for all of these. silence feels like disabling notifications though. What I really want is "ACK" or scheduled downtime. But it seems like those concepts don't exist anymore.

I might be missing something, but scheduled downtime also effectively disables notifications?

Sorry for having just dumped this here.. I meant to just save my thoughts and get back to it. Let me explain what I meant a bit more.

Coming from Nagios and then Icinga land.. I have always seen the following things as different from each other:

  • scheduled downtime: it also sends the message to all observers that this is a planned downtime, nothing that was ad-hoc, that people can be expected to be on it AND a side effect is that notifications are disabled during that time. It automatically ends at a pre-defined time, which means it can't be forgotten to remove it and alerting starts again as normal.
  • ACK: it sends the message that this was not planned but someone saw it and reacted and it disables notifications but only until the next state change. So it will stop repeating the message but start sending them again once something changes one way or another. This is often exactly what is needed. It also includes the effect that it can't be forgotten to enable notifications again while stopping spam. It also means a check moved from "unhandled" to "handled" which helps with reducing the signal/noise ratio on dashboards
  • "disable notifications" - This is the one I have never liked and I always tried to preach to people that they should stop using this. Because while it does disable notifications.. it does not send any other message that tells you why that is... for how long it should stay this way, and whether it has been just forgotten to re-enable it after last maintenance or it's because there is planned maintenance or because someone wanted to actually just say "ACK" for that moment to say they will get to it later. When looking at Icinga web UI we can normally find a whole bunch of hosts/services with (seemingly random) disabled notifications.. and most of them are simply because doing it this way means there is no expiration at any time or a state change. Since we are all humans this adds up and we will miss alerts just because of that.

+1 on "disable notifications" being an anti-pattern, and has bitten us multiple times. Effectively in AM we don't have that anymore, short of removing the alert itself, so I think that's good.

Anyways... I am just saying all that to explain why for me "silence" isn't just "silence" but there are nuances. I am NOT trying to say all that needs to be the same in alertmanager.. just sharing what went through my mind as a long-time Icinga user who is still new to alertmanager.

Thank you for the feedback and the context -- that is super useful to know/hear. In my mind the current equivalences are that silence == scheduled downtime, and a "alertmanager ACK" (i.e. a silence with description starting with ACK!) == acknowledgement

To see the suppressed alerts you can filter by @state=suppressed in the dashboard. I have also expanded https://wikitech.wikimedia.org/wiki/Alertmanager#Silences_%26_acknowledgements with this information, feel free to expand more and I'm happy to expand more as needed!

ACK, ok, great. thanks!

I see what you did there!

Change 812282 merged by Dzahn:

[operations/puppet@production] Revert "vrts/prometheus: comment out broken check"

https://gerrit.wikimedia.org/r/812282

Change 812326 merged by Dzahn:

[operations/puppet@production] vrts/prometheus: fix path in blackbox http monitoring check

https://gerrit.wikimedia.org/r/812326

Change 812142 abandoned by Dzahn:

[operations/puppet@production] vrts/blackbox: adjust monitoring back to port 80, but fix path

Reason:

monitor envoy port instead of webserver port

https://gerrit.wikimedia.org/r/812142

Change 815386 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] vrts/prometheus: configure monitoring to use only IPv4

https://gerrit.wikimedia.org/r/815386

Change 815386 merged by Dzahn:

[operations/puppet@production] vrts/prometheus: configure monitoring to use only IPv4

https://gerrit.wikimedia.org/r/815386

Change 815390 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] vrts/prometheus: fix IP family name, ip4 not ipv4

https://gerrit.wikimedia.org/r/815390

Change 815390 merged by Dzahn:

[operations/puppet@production] vrts/prometheus: fix IP family name, ip4 not ipv4

https://gerrit.wikimedia.org/r/815390

This is now fixed after a couple follow-ups.

Last one was that to make sure to only check via IPv4 for the envoy on otrs1001 which so far does not listen on IPv6.

I learned today from -observablity how to confirma check that is _not_ alerting actually works.

https://grafana-rw.wikimedia.org/explore?orgId=1&left=%7B%22datasource%22:%22thanos%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22probe_success%7Binstance%3D~%5C%22gitlab.%2B%7Cotrs.%2B%5C%22%7D%22,%22editorMode%22:%22code%22,%22range%22:false,%22instant%22:true,%22format%22:%22table%22,%22exemplar%22:false%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D

this shows status for both new VRTS and gitlab monitoring.

probe_success = 1 actually means things are working
probe_success = 0 would mean things are broken

(!) this is the opposite of how it was in Icinga and is with Linux return codes in general, so it can be confusing