Page MenuHomePhabricator

[alerting] Create alerts for cloud-vps/VideoCutTool app
Closed, ResolvedPublic

Description

For the project (hosted under cloud-vps): videocuttool, there are currently no alerts defined on prometheus (https://prometheus-alerts.wmcloud.org/?q=project%3D~videocuttool)

Please help us in creating alerts for the metrics (as defined in the attached grafana dashboard)
https://grafana.wmcloud.org/d/0g9N-7pVz/cloud-vps-project-board?orgId=1&var-project=videocuttool&var-instance=$__all&from=now-24h&to=now&timezone=utc

P0 alerts we require

  1. CPU usage
  2. Free Disk space left
  3. RAM usage

We can go with default threshold values for now (as other cloud-vps projects)

Event Timeline

fnegri triaged this task as Medium priority.Nov 10 2025, 9:17 AM

Hey @fnegri!
Thanks for picking up the ticket. please let me know if you need to know anything about the tool. I'm active on both phabricator and telegram

there are currently no alerts defined on prometheus (https://prometheus-alerts.wmcloud.org/?q=project%3D~videocuttool)

Please note that that dashboard will only show active alerts. The full list of alerts can be seen at https://prometheus.wmcloud.org/alerts, there are no specific ones for videocuttool at the moment, but there are some global ones (e.g. WidespreadInstanceDown) that when active will be visible at https://prometheus-alerts.wmcloud.org/?q=project%3D~videocuttool.

  1. CPU usage
  2. Free Disk space left
  3. RAM usage

We can go with default threshold values for now (as other cloud-vps projects)

@Reputation22 if you could provide precise definitions, I can then set up the alerts. We don't have "default thresholds" at the moment, so you can pick something that makes sense to you and they can be adjusted in the future if needed.

You can test the alert expressions in https://prometheus.wmcloud.org/graph, for example an alert expression for "Free disk space left" could be something like:

100 * node_filesystem_avail_bytes{project="videocuttool"} / node_filesystem_size_bytes{project="videocuttool"} < 20

there are currently no alerts defined on prometheus (https://prometheus-alerts.wmcloud.org/?q=project%3D~videocuttool)

Please note that that dashboard will only show active alerts. The full list of alerts can be seen at https://prometheus.wmcloud.org/alerts, there are no specific ones for videocuttool at the moment, but there are some global ones (e.g. WidespreadInstanceDown) that when active will be visible at https://prometheus-alerts.wmcloud.org/?q=project%3D~videocuttool.

For this, can you please confirm once alert gets triggered, and visible at (https://prometheus-alerts.wmcloud.org/?q=project%3D~videocuttool), i hope it sends us (maintainers) the notification for the same (either on our email / IRC .etc), if not can we configure email notifications for the same?

  1. CPU usage
  2. Free Disk space left
  3. RAM usage

We can go with default threshold values for now (as other cloud-vps projects)

@Reputation22 if you could provide precise definitions, I can then set up the alerts. We don't have "default thresholds" at the moment, so you can pick something that makes sense to you and they can be adjusted in the future if needed.

You can test the alert expressions in https://prometheus.wmcloud.org/graph, for example an alert expression for "Free disk space left" could be something like:

100 * node_filesystem_avail_bytes{project="videocuttool"} / node_filesystem_size_bytes{project="videocuttool"} < 20

Can you please explain what exactly is the above mentioned alert expression calculating? or maybe like how would the expression folds when i say i want to trigger an alert when free disk space (/ disk space) from this grafana dashboard (https://grafana.wmcloud.org/d/0g9N-7pVz/cloud-vps-project-board?orgId=1&var-project=videocuttool&var-instance=$__all&from=now-24h&to=now&timezone=utc&viewPanel=panel-6-clone-1%2Fgrid-item-41%2Fpanel-41)
reaches 15%?

Based on this, I can provide you the threshold values for the expressions.
Here's what we want based on the above mentioned grafana dashboard

  1. CPU Usage: 80% or above (>=80) [both for vct and beta-vct]
  2. Free disk space left: 15% or below (<=15) [both for vct and beta-vct]
  3. RAM usage: used (above 12gb in vct-bookworm-new, and 3gb in beta-vct)

ps: we need alerts for both instances

  1. videocuttool-bookworm-new
  2. beta-videocuttool-bookworm

I can provide you the alert expression itself, once I understand how and what exactly it calculates

@Reputation22 I don't think we need to set alerts for the beta version, since it's mostly used for our internal testing. Please update the task request accordingly.

@Reputation22 I don't think we need to set alerts for the beta version, since it's mostly used for our internal testing. Please update the task request accordingly.

so in the grand scheme of plans, beta version would be made available for new contributors to test their changes before making any changes live/push to master
hence we kind of require the alerts, but yes we can keep the severity of it less than that of production env.

For this, can you please confirm once alert gets triggered, and visible at (https://prometheus-alerts.wmcloud.org/?q=project%3D~videocuttool), i hope it sends us (maintainers) the notification for the same (either on our email / IRC .etc), if not can we configure email notifications for the same?

No notifications are sent by default, but once the alerts are working, we can set up notifications to either email, IRC or Phabricator.

Can you please explain what exactly is the above mentioned alert expression calculating?

What exactly is not clear in the expression? It's calculating the available disk space by dividing node_filesystem_avail_bytes by node_filesystem_size_bytes, then multiplying the ratio by 100 to get the percentage, and checking if that is less than 20. It's doing the same calculation independently for all drives in all instances, but filtering for project="videocuttool".

The PromQL syntax can be tricky to master, but there are plenty of guides and examples online. You can experiment freely on https://prometheus.wmcloud.org/graph where you can enter different expessions and see what they return, until you find one that you are happy with, then I can create an alert based on that expression.

Screenshot 2025-11-21 at 16.04.26.png (1×2 px, 377 KB)

If you want to start from the existing Grafana board, you can click on the "three dots" menu icon next to any graph, then click "inspect" and go to the "Query" tab to see the PromQL definitions of each graph.

Screenshot 2025-11-21 at 16.01.28.png (1×854 px, 144 KB)

Screenshot 2025-11-21 at 16.01.46.png (898×1 px, 182 KB)

fnegri changed the task status from Open to In Progress.Nov 21 2025, 3:13 PM
fnegri changed the task status from In Progress to Stalled.
fnegri changed the task status from Stalled to In Progress.

For this, can you please confirm once alert gets triggered, and visible at (https://prometheus-alerts.wmcloud.org/?q=project%3D~videocuttool), i hope it sends us (maintainers) the notification for the same (either on our email / IRC .etc), if not can we configure email notifications for the same?

No notifications are sent by default, but once the alerts are working, we can set up notifications to either email, IRC or Phabricator.

Can you please explain what exactly is the above mentioned alert expression calculating?

What exactly is not clear in the expression? It's calculating the available disk space by dividing node_filesystem_avail_bytes by node_filesystem_size_bytes, then multiplying the ratio by 100 to get the percentage, and checking if that is less than 20. It's doing the same calculation independently for all drives in all instances, but filtering for project="videocuttool".

The PromQL syntax can be tricky to master, but there are plenty of guides and examples online. You can experiment freely on https://prometheus.wmcloud.org/graph where you can enter different expessions and see what they return, until you find one that you are happy with, then I can create an alert based on that expression.

Screenshot 2025-11-21 at 16.04.26.png (1×2 px, 377 KB)

If you want to start from the existing Grafana board, you can click on the "three dots" menu icon next to any graph, then click "inspect" and go to the "Query" tab to see the PromQL definitions of each graph.

Screenshot 2025-11-21 at 16.01.28.png (1×854 px, 144 KB)

Screenshot 2025-11-21 at 16.01.46.png (898×1 px, 182 KB)

Oh sure let me check and provide you the concerned expressions right away

For beta-vct

  1. CPU usage
100 * (sum by (mode) (irate(node_cpu_seconds_total{project="videocuttool",instance="beta-videocuttool-bookworm",mode!~"idle|guest|guest_nice"}[6m0s]))) / scalar(count(node_cpu_seconds_total{project="videocuttool",instance="beta-videocuttool-bookworm",mode="idle"})) >= 80
  1. Free disk usage left
100 * (node_filesystem_avail_bytes{fstype=~"ext4|xfs",job="node",project="videocuttool",instance="beta-videocuttool-bookworm"} / node_filesystem_size_bytes{fstype=~"ext4|xfs",job="node",project="videocuttool",instance="beta-videocuttool-bookworm"}) <= 15

For vct

  1. CPU usage
100 * (sum by (mode) (irate(node_cpu_seconds_total{project="videocuttool",instance="videocuttool-bookworm-new",mode!~"idle|guest|guest_nice"}[6m0s]))) / scalar(count(node_cpu_seconds_total{project="videocuttool",instance="videocuttool-bookworm-new",mode="idle"})) >= 80
  1. Free disk usage left
100 * (node_filesystem_avail_bytes{fstype=~"ext4|xfs",job="node",project="videocuttool",instance="videocuttool-bookworm-new"} / node_filesystem_size_bytes{fstype=~"ext4|xfs",job="node",project="videocuttool",instance="videocuttool-bookworm-new"}) <= 15

I've derived these expressions based on the grafana > inspect method you described above. We can proceed with these. Also a quick question here, as you can see in the image attached below (for disk space usage), we're surveilling 3 disk usages [/, /var/lib/docker, /VideoCutTool/app/data], so will the alert comes when any one of the disk goes below the threshold (ideally it should, and that's what we're looking for)?, if not will we be required to create multiple alerts for same thing based on different disks?

Screenshot 2025-11-21 at 21.04.56.png (1×2 px, 177 KB)

Also do we have different severity defined in alert-manager? (something like SEV-0, SEV-1, SEV-2 .etc), which we can attach to the alerts defined?

Thanks, I will create those next week!

will the alert comes when any one of the disk goes below the threshold

Yes.

do we have different severity defined in alert-manager?

You can create custom ones for your project, but for shared alerts we use critical and warning, so I would recommend using those two.

Thanks, I will create those next week!

will the alert comes when any one of the disk goes below the threshold

Yes.

do we have different severity defined in alert-manager?

You can create custom ones for your project, but for shared alerts we use critical and warning, so I would recommend using those two.

yes sure, maybe we can revisit creating custom alerts later on
for now let's mark alerts for vct -> critical.. and beta-vct-> warning

also please enable email alerts for maintainers (if required we can share email ids here as well)

thanks

Apologies for the delay, I have just created the alerts and you can see them at https://prometheus.wmcloud.org/rules#videocuttool

also please enable email alerts for maintainers (if required we can share email ids here as well)

Please let me know which emails should receive the alerts and I'll set that up. You can also message me privately if you don't want to write the email addresses in a public task.

MariaDB [prometheusconfig]> SELECT * FROM alerts WHERE project_id = 139\G
*************************** 1. row ***************************
         id: 32
 project_id: 139
       name: HighCPUUsage
       expr: 100 * (sum by (mode) (irate(node_cpu_seconds_total{project="videocuttool",instance="beta-videocuttool-bookworm",mode!~"idle|guest|guest_nice"}[6m0s]))) / scalar(count(node_cpu_seconds_total{project="videocuttool",instance="beta-videocuttool-bookworm",mode="idle"})) >= 80
   duration: 5m
   severity: warning
annotations: {"summary": "High CPU usage in videocuttool beta"}
*************************** 2. row ***************************
         id: 33
 project_id: 139
       name: LowDiskSpace
       expr: 100 * (node_filesystem_avail_bytes{fstype=~"ext4|xfs",job="node",project="videocuttool",instance="beta-videocuttool-bookworm"} / node_filesystem_size_bytes{fstype=~"ext4|xfs",job="node",project="videocuttool",instance="beta-videocuttool-bookworm"}) <= 15
   duration: 5m
   severity: warning
annotations: {"summary": "Low disk space in videocuttool beta"}
*************************** 3. row ***************************
         id: 34
 project_id: 139
       name: HighCPUUsage
       expr: 100 * (sum by (mode) (irate(node_cpu_seconds_total{project="videocuttool",instance="videocuttool-bookworm-new",mode!~"idle|guest|guest_nice"}[6m0s]))) / scalar(count(node_cpu_seconds_total{project="videocuttool",instance="videocuttool-bookworm-new",mode="idle"})) >= 80
   duration: 5m
   severity: critical
annotations: {"summary": "High CPU usage in videocuttool"}
*************************** 4. row ***************************
         id: 36
 project_id: 139
       name: LowDiskSpace
       expr: 100 * (node_filesystem_avail_bytes{fstype=~"ext4|xfs",job="node",project="videocuttool",instance="videocuttool-bookworm-new"} / node_filesystem_size_bytes{fstype=~"ext4|xfs",job="node",project="videocuttool",instance="videocuttool-bookworm-new"}) <= 15
   duration: 5m
   severity: critical
annotations: {"summary": "Low disk space in videocuttool"}
4 rows in set (0.002 sec)

Apologies for the delay, I have just created the alerts and you can see them at https://prometheus.wmcloud.org/rules#videocuttool

also please enable email alerts for maintainers (if required we can share email ids here as well)

Please let me know which emails should receive the alerts and I'll set that up. You can also message me privately if you don't want to write the email addresses in a public task.

MariaDB [prometheusconfig]> SELECT * FROM alerts WHERE project_id = 139\G
*************************** 1. row ***************************
         id: 32
 project_id: 139
       name: HighCPUUsage
       expr: 100 * (sum by (mode) (irate(node_cpu_seconds_total{project="videocuttool",instance="beta-videocuttool-bookworm",mode!~"idle|guest|guest_nice"}[6m0s]))) / scalar(count(node_cpu_seconds_total{project="videocuttool",instance="beta-videocuttool-bookworm",mode="idle"})) >= 80
   duration: 5m
   severity: warning
annotations: {"summary": "High CPU usage in videocuttool beta"}
*************************** 2. row ***************************
         id: 33
 project_id: 139
       name: LowDiskSpace
       expr: 100 * (node_filesystem_avail_bytes{fstype=~"ext4|xfs",job="node",project="videocuttool",instance="beta-videocuttool-bookworm"} / node_filesystem_size_bytes{fstype=~"ext4|xfs",job="node",project="videocuttool",instance="beta-videocuttool-bookworm"}) <= 15
   duration: 5m
   severity: warning
annotations: {"summary": "Low disk space in videocuttool beta"}
*************************** 3. row ***************************
         id: 34
 project_id: 139
       name: HighCPUUsage
       expr: 100 * (sum by (mode) (irate(node_cpu_seconds_total{project="videocuttool",instance="videocuttool-bookworm-new",mode!~"idle|guest|guest_nice"}[6m0s]))) / scalar(count(node_cpu_seconds_total{project="videocuttool",instance="videocuttool-bookworm-new",mode="idle"})) >= 80
   duration: 5m
   severity: critical
annotations: {"summary": "High CPU usage in videocuttool"}
*************************** 4. row ***************************
         id: 36
 project_id: 139
       name: LowDiskSpace
       expr: 100 * (node_filesystem_avail_bytes{fstype=~"ext4|xfs",job="node",project="videocuttool",instance="videocuttool-bookworm-new"} / node_filesystem_size_bytes{fstype=~"ext4|xfs",job="node",project="videocuttool",instance="videocuttool-bookworm-new"}) <= 15
   duration: 5m
   severity: critical
annotations: {"summary": "Low disk space in videocuttool"}
4 rows in set (0.002 sec)

aren't those accesible to you via horizon/vps portal .etc?
we need to send alerts only to owners/maintainers.

If they're not accesible.. I'll ping you privately

They are not accessible via Horizon but they are accessible from the following public links:

The SQL command I pasted above was only as a reference, to show the alert definitions in the DB, not something that anybody is expected to use.

They are not accessible via Horizon but they are accessible from the following public links:

no i meant the emails for sending alerts can be taken out via horizon/vps portal .etc?

no i meant the emails for sending alerts can be taken out via horizon/vps portal .etc?

Ah sorry, I completely misunderstood! Yes I can use the emails that are associated to the project members in Cloud VPS. Should I include all the current project members? Should I include "viewers" as well (users with "reader" role on the project)?

Please note that at the moment we don't have a way to fetch those automatically, so I can copy-paste the emails from there, but if you modify the project members, the list of emails that will receive the alerts will not update automatically.

no i meant the emails for sending alerts can be taken out via horizon/vps portal .etc?

Ah sorry, I completely misunderstood! Yes I can use the emails that are associated to the project members in Cloud VPS. Should I include all the current project members? Should I include "viewers" as well (users with "reader" role on the project)?

Please note that at the moment we don't have a way to fetch those automatically, so I can copy-paste the emails from there, but if you modify the project members, the list of emails that will receive the alerts will not update automatically.

aah.. so what would be the process of adding/removing people to recieve alerts?
raise another phab ticket?

as of now you can go ahead with the following users

  1. Gopavasanth
  2. Sohom Dutta
  3. Reputation22
  4. Punith.nyk

these are the active maintainers/owners as of now.

(ideally all this should be configurable by maintainers/owners, will help in reducing redundant work, what do you say)

(ideally all this should be configurable by maintainers/owners, will help in reducing redundant work, what do you say)

T128715: Add all Cloud VPS project administrators to the Prometheus notification group for each project and T47828: Implement mail aliases for Cloud-VPS projects (<novaproject>@wmcloud.org) is one way that Cloud VPS projects could get a project manageable email list function for things like this.

(ideally all this should be configurable by maintainers/owners, will help in reducing redundant work, what do you say)

T128715: Add all Cloud VPS project administrators to the Prometheus notification group for each project and T47828: Implement mail aliases for Cloud-VPS projects (<novaproject>@wmcloud.org) is one way that Cloud VPS projects could get a project manageable email list function for things like this.

these are still open, so i guess this is not in place currently?
but i do get the idea, basically we can create a group/email subscription like, and those willing to get alers can subscribe to it?

@Reputation22 yes the current system is very manual and definitely sub-optimal, implementing the tasks mentioned by @bd808 would make things much easier but we don't have a timeline for those...

as of now you can go ahead with the following users

Gopavasanth
Sohom Dutta
Reputation22
Punith.nyk

I will proceed with adding these users manually to the "contact group" that will receive emails for alerts in the "videocuttool" project.

Done:

MariaDB [prometheusconfig]> SELECT * FROM contact_group_members WHERE contact_group_id=7;
+----+------------------+-------+------------------------------+
| id | contact_group_id | type  | value                        |
+----+------------------+-------+------------------------------+
| 12 |                7 | EMAIL | gop****                      |
| 13 |                7 | EMAIL | pun****                      |
| 14 |                7 | EMAIL | var***                       |
| 15 |                7 | EMAIL | soh***                       |
+----+------------------+-------+------------------------------+
4 rows in set (0.003 sec)

Done:

MariaDB [prometheusconfig]> SELECT * FROM contact_group_members WHERE contact_group_id=7;
+----+------------------+-------+------------------------------+
| id | contact_group_id | type  | value                        |
+----+------------------+-------+------------------------------+
| 12 |                7 | EMAIL | gop****                      |
| 13 |                7 | EMAIL | pun****                      |
| 14 |                7 | EMAIL | var***                       |
| 15 |                7 | EMAIL | soh***                       |
+----+------------------+-------+------------------------------+
4 rows in set (0.003 sec)

cool thanks a lot!!

also on a side note.. can i add a custom message in these alerts? it would be great if i can attach a runbook here on how to resolve these alerts

@Reputation22 yes the current system is very manual and definitely sub-optimal, implementing the tasks mentioned by @bd808 would make things much easier but we don't have a timeline for those...

as of now you can go ahead with the following users

Gopavasanth
Sohom Dutta
Reputation22
Punith.nyk

I will proceed with adding these users manually to the "contact group" that will receive emails for alerts in the "videocuttool" project.

oh..i see
does this contain some wiki internal data? afaik all wiki projets are open source, and if they're i'd be happy to contribute/work on this (if its compliant enough so that wiki contributors can work on this)

Done:

MariaDB [prometheusconfig]> SELECT * FROM contact_group_members WHERE contact_group_id=7;
+----+------------------+-------+------------------------------+
| id | contact_group_id | type  | value                        |
+----+------------------+-------+------------------------------+
| 12 |                7 | EMAIL | gop****                      |
| 13 |                7 | EMAIL | pun****                      |
| 14 |                7 | EMAIL | var***                       |
| 15 |                7 | EMAIL | soh***                       |
+----+------------------+-------+------------------------------+
4 rows in set (0.003 sec)

cool thanks a lot!!

also on a side note.. can i add a custom message in these alerts? it would be great if i can attach a runbook here on how to resolve these alerts

as per this.. yes we did get an alert
thanks
i think we can close this ticket now

also on a side note.. can i add a custom message in these alerts? it would be great if i can attach a runbook here on how to resolve these alerts

Yes you can, you can add both a "description" field with a custom string and also a "runbook" field with a link to a runbook, and a "dashboard" field with a link to a Grafana dashboard.

Please paste the messages and links here and I'll add them.

does this contain some wiki internal data? afaik all wiki projets are open source, and if they're i'd be happy to contribute/work on this (if its compliant enough so that wiki contributors can work on this)

Yep all the code is open source and you're very welcome to send a patch, although it might take a while to understand how all the related components are working... I'd recommend asking questions directly in the tasks if you need help.

also on a side note.. can i add a custom message in these alerts? it would be great if i can attach a runbook here on how to resolve these alerts

Yes you can, you can add both a "description" field with a custom string and also a "runbook" field with a link to a runbook, and a "dashboard" field with a link to a Grafana dashboard.

Please paste the messages and links here and I'll add them.

does this contain some wiki internal data? afaik all wiki projets are open source, and if they're i'd be happy to contribute/work on this (if its compliant enough so that wiki contributors can work on this)

Yep all the code is open source and you're very welcome to send a patch, although it might take a while to understand how all the related components are working... I'd recommend asking questions directly in the tasks if you need help.

i don't have a runbook ready as such. maybe I'll provide you the wiki link of it (which I'll update later) that you can attach as runbook
for grafana i think we can add this link to all the alerts

https://grafana.wmcloud.org/d/0g9N-7pVz/cloud-vps-project-board?orgId=1&var-project=videocuttool&var-instance=$__all&from=now-24h&to=now&timezone=utc

also on a side note.. can i add a custom message in these alerts? it would be great if i can attach a runbook here on how to resolve these alerts

Yes you can, you can add both a "description" field with a custom string and also a "runbook" field with a link to a runbook, and a "dashboard" field with a link to a Grafana dashboard.

Please paste the messages and links here and I'll add them.

does this contain some wiki internal data? afaik all wiki projets are open source, and if they're i'd be happy to contribute/work on this (if its compliant enough so that wiki contributors can work on this)

Yep all the code is open source and you're very welcome to send a patch, although it might take a while to understand how all the related components are working... I'd recommend asking questions directly in the tasks if you need help.

i'll definitely take a look over there, and will ping if i'm stuck anywhere.
thanks

for alert runbook, lets use this link for all alerts

https://commons.wikimedia.org/w/index.php?title=Commons%3AVideoCutTool%2FAlerts_Runbook#

i'll update this (maybe tomorrow or day after)

thanks

fnegri moved this task from In progress to Done on the cloud-services-team (FY2025/2026-Q1-Q2) board.

Added both links, you can see them here and they should be visible in the alert emails.

I'm marking this as Resolved, please reopen it if you need further tweaks!

for alert runbook, lets use this link for all alerts

https://commons.wikimedia.org/w/index.php?title=Commons%3AVideoCutTool%2FAlerts_Runbook#

i'll update this (maybe tomorrow or day after)

thanks

Make it a page on wikitech; you are generally not supposed to put this on content wikis.

Make it a page on wikitech; you are generally not supposed to put this on content wikis.

You could maybe use https://wikitech.wikimedia.org/wiki/Nova_Resource:Videocuttool

for alert runbook, lets use this link for all alerts

https://commons.wikimedia.org/w/index.php?title=Commons%3AVideoCutTool%2FAlerts_Runbook#

i'll update this (maybe tomorrow or day after)

thanks

Make it a page on wikitech; you are generally not supposed to put this on content wikis.

does this work? @Soda

https://wikitech.wikimedia.org/wiki/Nova_Resource:Videocuttool/Alert_Runbook

for alert runbook, lets use this link for all alerts

https://commons.wikimedia.org/w/index.php?title=Commons%3AVideoCutTool%2FAlerts_Runbook#

i'll update this (maybe tomorrow or day after)

thanks

Make it a page on wikitech; you are generally not supposed to put this on content wikis.

does this work? @Soda

https://wikitech.wikimedia.org/wiki/Nova_Resource:Videocuttool/Alert_Runbook

sure @fnegri can you please update the runbook url with this

fnegri reopened this task as In Progress.Dec 3 2025, 9:50 AM

sure @fnegri can you please update the runbook url with this

Done!