We should monitor benefactorevents.wm.o for certificate expiration, by adding it to production puppet modules/icinga/manifests/monitor/certs.pp.
Description
Details
Related Objects
Event Timeline
Change 336559 had a related patch set uploaded (by Dzahn):
icinga: add SSL cert monitoring for benefactorevents
Change 336564 had a related patch set uploaded (by Dzahn):
icinga: fr-tech-ops contact group for benefactorevents
Change 336564 merged by Dzahn:
icinga: fr-tech-ops contact group for benefactorevents
check added to Icinga, and it became CRIT right away because the cert expires in 22 days
The virtual host that goes with the service is CRIT for separate reasons, because it does not let Icinga ping it.
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=benefactorevents.wikimedia.org
benefactorevents.wikimedia.org is an alias for trilogytools1.azurewebsites.net.
- waws-prod-blu-007.cloudapp.net ping statistics ---
5 packets transmitted, 0 received, 100% packet loss,
This is like the issue there has been for a long time for "eventdonations" (https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&style=hostdetail&hoststatustypes=4&hostprops=2097158)
Can we do anything about that? Do we just keep it permanently ACKed? Should i add the services on a different virtual host that can be reached?
This thing is alerting since 4 days as it's apparently using the default azure ssl cert.
I am RADICALLY AGAINST monitoring such certificates/hosts if we're not in control of their fate.
Just the fact I had to look at the icinga event, try to access the site, whois the IP, check our puppet history and phabricator history to determine this thing is out of my control is incredibly time consuming, and in the end pointless since it seems NO ONE cared about this enough to either:
a) tell ops to remove the alerts since the campaign/event has finished
b) Fix the problem over a 4 days span.
Yeah the situation was indeed suboptimal. Trilogy shut down the site on Friday (fr-tech/fr-tech-ops didn't get advanced notice) and the icinga alert pages only Ops. As far as I know nobody contacted fr-tech/fr-tech-ops about the alerts until Monday AM.
I posted a few new tasks related to this site, cert, and monitoring: T170140, T170139, T170143
Change 364231 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga: remove monitoring of benefactorevents.wm.org
Nobody got paged about this. This check wasn't set to critical, intentionally, as normally it is to alert as a reminder about expiring certs and there are many days left to renew once it starts alerting.
As far as I know nobody contacted fr-tech/fr-tech-ops about the alerts until Monday AM.
Is the expectation that ops get paged and then call fr-tech-ops? It seems/seemed email would be enough per above, and my expectation was that fr-tech-ops get the same mails as ops because the contact group is added. But right now i can't even see email about it in my own inbox nor do i see benefactorevents in prod icinga config.. that's a bit strange.. will check that.
I just assumed I didn't get notified because fr-tech-ops wasn't included in the notification config, but maybe there's another reason? In general I
would have icinga include fr-tech-ops for any alerts that have to do with fundraising. Meanwhile if Ops sees fundraising-related alerts and nobody
from fundraising responds, it would be good to go on the assumption that we aren't receiving the alerts and ping us directly.
No, it's more like the other way around, fr-tech-ops is the only contact group:
contact_group => 'fr-tech-ops', https://gerrit.wikimedia.org/r/#/c/364231/1/modules/icinga/manifests/monitor/certs.pp
What i'm wondering is why i don't see "benefactorevents" in the Icinga config that puppet generates though.
Change 364231 merged by Dzahn:
[operations/puppet@production] icinga: remove monitoring of benefactorevents.wm.org