Page MenuHomePhabricator

set up SSL cert monitoring for benefactorevents.wm.o
Closed, ResolvedPublic

Description

We should monitor benefactorevents.wm.o for certificate expiration, by adding it to production puppet modules/icinga/manifests/monitor/certs.pp.

Event Timeline

Change 336559 had a related patch set uploaded (by Dzahn):
icinga: add SSL cert monitoring for benefactorevents

https://gerrit.wikimedia.org/r/336559

Change 336559 merged by Dzahn:
icinga: add SSL cert monitoring for benefactorevents

https://gerrit.wikimedia.org/r/336559

Change 336564 had a related patch set uploaded (by Dzahn):
icinga: fr-tech-ops contact group for benefactorevents

https://gerrit.wikimedia.org/r/336564

Change 336564 merged by Dzahn:
icinga: fr-tech-ops contact group for benefactorevents

https://gerrit.wikimedia.org/r/336564

check added to Icinga, and it became CRIT right away because the cert expires in 22 days

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=benefactorevents.wikimedia.org&service=HTTPS-benefactorevents

The virtual host that goes with the service is CRIT for separate reasons, because it does not let Icinga ping it.

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=benefactorevents.wikimedia.org

benefactorevents.wikimedia.org is an alias for trilogytools1.azurewebsites.net.

  • waws-prod-blu-007.cloudapp.net ping statistics ---

5 packets transmitted, 0 received, 100% packet loss,

This is like the issue there has been for a long time for "eventdonations" (https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&style=hostdetail&hoststatustypes=4&hostprops=2097158)

Can we do anything about that? Do we just keep it permanently ACKed? Should i add the services on a different virtual host that can be reached?

Dzahn mentioned this in Unknown Object (Task).Feb 8 2017, 1:31 AM
Dzahn removed a project: Patch-For-Review.
Dzahn claimed this task.

This thing is alerting since 4 days as it's apparently using the default azure ssl cert.

I am RADICALLY AGAINST monitoring such certificates/hosts if we're not in control of their fate.

Just the fact I had to look at the icinga event, try to access the site, whois the IP, check our puppet history and phabricator history to determine this thing is out of my control is incredibly time consuming, and in the end pointless since it seems NO ONE cared about this enough to either:

a) tell ops to remove the alerts since the campaign/event has finished
b) Fix the problem over a 4 days span.

This thing is alerting since 4 days as it's apparently using the default azure ssl cert.

I am RADICALLY AGAINST monitoring such certificates/hosts if we're not in control of their fate.

Just the fact I had to look at the icinga event, try to access the site, whois the IP, check our puppet history and phabricator history to determine this thing is out of my control is incredibly time consuming, and in the end pointless since it seems NO ONE cared about this enough to either:

a) tell ops to remove the alerts since the campaign/event has finished
b) Fix the problem over a 4 days span.

Yeah the situation was indeed suboptimal. Trilogy shut down the site on Friday (fr-tech/fr-tech-ops didn't get advanced notice) and the icinga alert pages only Ops. As far as I know nobody contacted fr-tech/fr-tech-ops about the alerts until Monday AM.

I posted a few new tasks related to this site, cert, and monitoring: T170140, T170139, T170143

Change 364231 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga: remove monitoring of benefactorevents.wm.org

https://gerrit.wikimedia.org/r/364231

Trilogy shut down the site on Friday (fr-tech/fr-tech-ops didn't get advanced notice) and the icinga alert pages only Ops.

Nobody got paged about this. This check wasn't set to critical, intentionally, as normally it is to alert as a reminder about expiring certs and there are many days left to renew once it starts alerting.

As far as I know nobody contacted fr-tech/fr-tech-ops about the alerts until Monday AM.

Is the expectation that ops get paged and then call fr-tech-ops? It seems/seemed email would be enough per above, and my expectation was that fr-tech-ops get the same mails as ops because the contact group is added. But right now i can't even see email about it in my own inbox nor do i see benefactorevents in prod icinga config.. that's a bit strange.. will check that.

Trilogy shut down the site on Friday (fr-tech/fr-tech-ops didn't get advanced notice) and the icinga alert pages only Ops.

Nobody got paged about this. This check wasn't set to critical, intentionally, as normally it is to alert as a reminder about expiring certs and there are many days left to renew once it starts alerting.

As far as I know nobody contacted fr-tech/fr-tech-ops about the alerts until Monday AM.

Is the expectation that ops get paged and then call fr-tech-ops? It seems/seemed email would be enough per above, and my expectation was that fr-tech-ops get the same mails as ops because the contact group is added. But right now i can't even see email about it in my own inbox nor do i see benefactorevents in prod icinga config.. that's a bit strange.. will check that.

I just assumed I didn't get notified because fr-tech-ops wasn't included in the notification config, but maybe there's another reason? In general I
would have icinga include fr-tech-ops for any alerts that have to do with fundraising. Meanwhile if Ops sees fundraising-related alerts and nobody
from fundraising responds, it would be good to go on the assumption that we aren't receiving the alerts and ping us directly.

I just assumed I didn't get notified because fr-tech-ops wasn't included in the notification config

No, it's more like the other way around, fr-tech-ops is the only contact group:

contact_group => 'fr-tech-ops',
https://gerrit.wikimedia.org/r/#/c/364231/1/modules/icinga/manifests/monitor/certs.pp

What i'm wondering is why i don't see "benefactorevents" in the Icinga config that puppet generates though.

Change 364231 merged by Dzahn:
[operations/puppet@production] icinga: remove monitoring of benefactorevents.wm.org

https://gerrit.wikimedia.org/r/364231