Please assist in adding catalyst to this alert manager. We get alerts from catalyst-qte but not from the main catalyst project and would like to via our group email (catalyst@wikimedia.org) . Thanks!
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | dcaro | T386416 metricsinfra: send alerts for the catalyst project to catalyst@w.o email | |||
| Invalid | EBomani | T385330 Grafana.wmcloud.org has project alerts for catalyst, route alerts catalyst/patchdemo maintainers |
Event Timeline
The Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more specific project tag to this task. Thanks!
Unfortunately we don't have a self-service alerting setup yet, but we can help :)
I don't see any alerts triggering right now for the catalyst project, there's some for catalyst-qte but both should already be shown both in production karma (if you monitor that one, https://alerts.wikimedia.org/?q=project%3Dcatalyst-qte) and wmcloud one (https://prometheus-alerts.wmcloud.org/?q=project%3Dcatalyst-qte).
For email notifications, how is it being done for catalyst-qte right now? I don't see any special config on the metricsinfra side.
I can add it for the catalyst related alerts to get sent to that email, though probably better to do the same for both projects, that's why I ask
Note also that any custom alerts will need adding some stuff to the metricsinfra DB for now, so let me know if you have any requests there too.
Hello David, not sure how the catalyst-qte email alerts are configured right now. I will ask other people in my team whether they remember anything related to the alerts up but as far as I know, neither of us knows how those were set up. Will update this as soon as I find out.
I can add it for the catalyst related alerts to get sent to that email, though probably better to do the same for both projects, that's why I ask
This sounds good - provided it will not cause any functionality issues, can we give adding the special config for email alerts for both catalyst and catalyst-qte a try? Also, what would need to be added to the metricsinfra DB? Let me know if you need any information from our end.
Hi @dcaro, just to add some clarification if you need it, we do get emails for puppet failures from the catalyst project, but would like to get email alerts for things InstanceDown or other alerts that may be firing.
This sounds good - provided it will not cause any functionality issues, can we give adding the special config for email alerts for both catalyst and catalyst-qte a try?
Sure, you might get duplicated emails for catalyst-qte, but I'll add both.
Also, what would need to be added to the metricsinfra DB?
If you have any custom prometheus query you want to alert on, it has to be entered manually in the DB for now (you can play with prometheus to find the query you want https://prometheus.wmcloud.org/graph).
Hi @dcaro, just to add some clarification if you need it, we do get emails for puppet failures from the catalyst project, but would like to get email alerts for things InstanceDown or other alerts that may be firing.
ack, thanks for the clarification!
Yep, the puppet ones are sent outside the alerting system by the script directly (they are way older than the current setup)
This should kinda do it:
MariaDB [prometheusconfig]> select * from projects as p join contact_groups as cg join contact_group_members as cgm on p.default_contact_group_id=cg.id and cg.id=cgm.contact_group_id where p.name like '%catalyst%';
+-----+--------------+--------------+--------------------------+-----------+--------------+----+----------+------------+----+------------------+-------+------------------------+
| id | openstack_id | name | default_contact_group_id | acl_group | extra_labels | id | name | project_id | id | contact_group_id | type | value |
+-----+--------------+--------------+--------------------------+-----------+--------------+----+----------+------------+----+------------------+-------+------------------------+
| 211 | catalyst-qte | catalyst-qte | 5 | NULL | {} | 5 | catalyst | 216 | 6 | 5 | EMAIL | catalyst@wikimedia.org |
| 216 | catalyst | catalyst | 5 | NULL | {} | 5 | catalyst | 216 | 6 | 5 | EMAIL | catalyst@wikimedia.org |
+-----+--------------+--------------+--------------------------+-----------+--------------+----+----------+------------+----+------------------+-------+------------------------+
2 rows in set (0.003 sec)Now, you won't be able to silence alerts unless I add an acl_group (or if you do it from production alertmanager if you have access to it), would https://ldap.toolforge.org/group/releng be the right LDAP group?
Now, you won't be able to silence alerts unless I add an acl_group (or if you do it from production alertmanager if you have access to it)
As documented on https://wikitech.wikimedia.org/wiki/Help:Cloud_VPS_managed_monitoring#Alerts and https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Monitoring#Silencing_alerts, by default any member/viewer of the project can silence alerts for that project. The acl_group field is for Toolforge-like projects where we want to restrict that further.
Awesome :), then the job is done, you should be getting now the alerts for both catalyst and catalyst-qte on your e-mail.
I'll close this task, but please test (ex. creating a VM and stopping it or such) and reopen if you find any issues.
@dcaro Thank you for making the changes. I made an instance to test whether we'd get emails when it is in InstanceDown status but found that it is not happening. The alert surfaces momentarily in the Pending state on Grafana but after some time (about 5 minutes), it seems to disappear instead of progressing into Firing, so the email is never triggered. Do you happen to know how to resolve this?
I have attached screenshots showing this. As you can see in image 1 panel 1, the instance is paused and triggers the InstanceDown alerts on panel 3. In panel 2, however, we see the query results are empty since the alert is still Pending and not Firing like the others. Image 2 shows how the Grafana alert shows up for the project on its dashboard.
Img1 :
Instances that have been manually shut down will automatically be removed from the list of Prometheus scrape targets, which means that they won't send out any alerts (or only a brief one before the target list is updated).
I see.. is there anything we can do to test this then? If not, I am alright with closing this task and will reopen if/when we find a case of an instance alert that we do not get an email for.
@EBomani Maybe the easiest way would be to stop puppet, in 24h you should see an alert popping up (plus email). The full list of alerts can be seen here:
https://prometheus.wmcloud.org/alerts
Though only a few apply to all instances (PuppetAgent*, InstanceDown and that might be it).
@dcaro Thank you! Since we already used to get emails for Puppet failures on all our instances and it seems that manually done InstanceDown events do not trigger active alerts for e-mails, I think the only way we can assert this works is when the instances experience issues organically in the future. I am happy to close out this issue for now and if we do notice an issue that occurs but does not provide e-mail alerts, we can reopen it. Thanks so much again for your help with this!
This doesn't seem to have ever worked; the notification emails are being bounced back due to a permission issue on the Google group end. Please fix the settings to allow root@wmcloud.org and root@wmflabs.org to post to the list.

