Page MenuHomePhabricator

Implementing alert system to notify maintainers of downtime
Closed, ResolvedPublicFeature

Assigned To
Authored By
Gopavasanth
Jun 29 2024, 12:21 PM
Referenced Files
Restricted File
Nov 2 2024, 10:45 PM
F57604846: Screenshot 2024-10-10 at 11.53.37 PM.png
Oct 10 2024, 6:26 PM
F57588190: image.png
Oct 4 2024, 7:11 AM

Description

Tool Watch is now live and offers comprehensive data on the status and uptime of web-based tools through graphs.

Our task is to find and implement a system that sends alerts to tool maintainers if their tools go down, prompting them to take action.

Related References:

Eg: https://tool-watch.toolforge.org/tools/1

Event Timeline

image.png (1×2 px, 533 KB)

Source: https://tool-watch.toolforge.org/

It has been observed that out of a total of 1925 tools, only 543 are currently up and running, while 1382 tools have been down for some time.

To address this, I have come up with a solution to notify tool maintainers via email if their tool has been down for more than 7 days. The idea is to automatically send a reminder to the tool's maintainers at the address tools.<TOOL_NAME>@toolforge.org, prompting them to take the necessary actions to bring their tool back online. This approach will help keep the tools on Toolforge active, reduce downtime, and potentially lower infrastructure costs by ensuring unused or inactive tools are properly managed.

However, I also wonder if this could result in maintainers receiving too many emails, possibly leading to spam-like behavior. Do you think this would be a concern, or would the benefit of keeping tools lively outweigh the risk of email overload?

Also, what would be a reasonable time period before sending notifications—3 days, 7 days, 30 days, or perhaps another number? I’m open to suggestions on this.

@bd808/@Slst2020 , I would love to hear your thoughts and suggestions on this idea.

JJMC89 changed the subtype of this task from "Bug Report" to "Feature Request".Oct 4 2024, 8:05 AM
JJMC89 subscribed.

T376451: Toolwatch incorrectly reports non-web tools as unavailable is a blocker to doing any kind of notifications.

There would need to be a way to differentiate between tools that are down because they are "unhealthy" vs. tools that are "intentionally" down.

As for notifications, I would personally not like to receive any unless I had explicitly opted in. If I do opt in, I'd ideally be able to choose when to receive an alert, e.g "Send me an alert after the tool has been reported down for X time" If setting the time interval is not possible, then a reasonable time period would be in the range of minutes (say 15), not days.

Hi! I very much like the idea, and wouldn't mind to receive an email warning me if one of my tool is down.

That being said, when I opened the tool, first thing I did was search for my username (Sophivorus), but no results came up. So I searched for my only tool listed at toolhub (Synchronizer) and found it, but was surprised to see it listed as Unavailable. So I clicked it and went to https://toolsadmin.wikimedia.org/tools/id/synchronizer, but from there I couldn't find a link to the toolhub entry (https://toolhub.wikimedia.org/tools/synchronizer) nor to the tool itself (https://www.mediawiki.org/wiki/Synchronizer). As you can see from there, the tool is working ok. Granted it's not a very standard tool in that it's embedded in a wiki, but how are you checking for tool health?

Hi, This is an awesome idea/tool! I recently started mapping some tools myself and encountered quite a few that were no longer available, so this will definitely help. Just a quick question: is the data currently up to date? I noticed that my tool (https://toolsadmin.wikimedia.org/tools/id/wikicurricula-cl) isn’t appearing. Thanks!

I am running Ordia at https://ordia.toolforge.org/. When I view the status at https://tool-watch.toolforge.org/search?search=ordia it says Unavailable, but the tool is available. For some reason the tool URL is https://github.com/fnielsen/ordia/wiki in tool-watch. Am I doing something wrong?

I wouldn't mind getting notifications, but keep in mind that this might also be a per-tool preference (not only per-dev).

  1. Some of the tools I'm assigned to are used daily, also by anons. I would prefer to know if they are down for more than 24 hours.
  2. I have a tool that is not very popular, and I don't mind if it is down for a week.
  3. I also have a tool that is very popular, but only during WLM, so I might want to be able to change monitoring or notification intensity.

Also, as some of the others mentioned, it is not clear what is used for monitoring. The Tool Watch service says my very simple Authors tool is down, but the webpage is working and the main page responds with a 200 HTTP code.
https://authors.toolforge.org/toolinfo.json

So maybe some simple config like:

https://authors.toolforge.org/tool-monitoring.json
{
  "test-url" : "https://authors.toolforge.org/",
  "notify-delay" : 7,
}

The test URL could be distinct from the main page to, for example, test database availability or something similar. Not sure what you would consider a valid response, but I guess a non-zero length response with an HTTP 200 code should be good.

Who will receive the notification emails? The Author column contains the first (why only the first?) author of the tool – which may not be the same as the maintainer (for example, Átíró, of which I’m the only maintainer, lists Chery, who has never had Toolforge access), and in fact may not even be an existing SUL user name. If you send emails, you should send them to the maintainers, not the authors. (I don’t know how they can be queried programmatically, but https://toolsadmin.wikimedia.org/tools/id/atiro clearly lists me as the sole maintainer, so the data is available somewhere.)

If the mails go to the right people, I’d be happy receive notifications about tools I maintain being down ASAP (i.e. at the next check). I think there should be a management interface where I can

  • Set planned downtime, so that I don’t get emails if I manually and intentionally stopped the service.
  • Mark the tool fixed, so that if it’s down at the next check, it’s regarded as a new downtime. However, if I don’t mark it as fixed, the next successful check should also mark it fixed. (Not manually marking it as fixed risks that if I believe to have fixed it, but it goes down by the next time it’s checked, I’m not alerted.)
  • Maybe also acknowledge the alert without fixing it – this would allow sending repeated mails, in case I missed the previous ones, without bothering me about something I very well know.

@bd808/@Slst2020 , I would love to hear your thoughts and suggestions on this idea.

My main feedback is that the system should be opt-in rather than opt-out. Others seem to be pointing out a number of conditions they would like to see after opting into the service.

Who will receive the notification emails? The Author column contains the first (why only the first?) author of the tool – which may not be the same as the maintainer (for example, Átíró, of which I’m the only maintainer, lists Chery, who has never had Toolforge access), and in fact may not even be an existing SUL user name. If you send emails, you should send them to the maintainers, not the authors. (I don’t know how they can be queried programmatically, but https://toolsadmin.wikimedia.org/tools/id/atiro clearly lists me as the sole maintainer, so the data is available somewhere.)

If the mails go to the right people, I’d be happy receive notifications about tools I maintain being down ASAP (i.e. at the next check). I think there should be a management interface where I can

  • Set planned downtime, so that I don’t get emails if I manually and intentionally stopped the service.
  • Mark the tool fixed, so that if it’s down at the next check, it’s regarded as a new downtime. However, if I don’t mark it as fixed, the next successful check should also mark it fixed. (Not manually marking it as fixed risks that if I believe to have fixed it, but it goes down by the next time it’s checked, I’m not alerted.)
  • Maybe also acknowledge the alert without fixing it – this would allow sending repeated mails, in case I missed the previous ones, without bothering me about something I very well know.

In that case one more column could be created which lists maintainers of the tools, and then the mail could be sent to maintainers.

Screenshot 2024-10-10 at 11.53.37 PM.png (1×2 px, 341 KB)

@Gopavasanth If this looks fine to you, i can create a pull request for this feature.
Basically just added one more column which lists maintainers of the tool.

In that case one more column could be created which lists maintainers of the tools, and then the mail could be sent to maintainers.

That might be cosmetically nice, but it is not needed for the email functionality of the tool.

The idea is to automatically send a reminder to the tool's maintainers at the address tools.<TOOL_NAME>@toolforge.org

The plan to send the notifications to the given tool's Toolforge email address would automatically email all registered maintainers of the tool. The backend logic for handling that address uses information from the LDAP directory that is the canonical data store for Toolforge tool maintainer information. https://wikitech.wikimedia.org/wiki/Help:Toolforge/Email#Mail_to_a_Tool

How are you going to handle planned work on underlying infrastructure? Will you send out alarms or will you correlate it to the planned work so people know what is going on?

Who will receive the notification emails? The Author column contains the first (why only the first?) author of the tool – which may not be the same as the maintainer (for example, Átíró, of which I’m the only maintainer, lists Chery, who has never had Toolforge access), and in fact may not even be an existing SUL user name. If you send emails, you should send them to the maintainers, not the authors. (I don’t know how they can be queried programmatically, but https://toolsadmin.wikimedia.org/tools/id/atiro clearly lists me as the sole maintainer, so the data is available somewhere.)

If the mails go to the right people, I’d be happy receive notifications about tools I maintain being down ASAP (i.e. at the next check). I think there should be a management interface where I can

  • Set planned downtime, so that I don’t get emails if I manually and intentionally stopped the service.
  • Mark the tool fixed, so that if it’s down at the next check, it’s regarded as a new downtime. However, if I don’t mark it as fixed, the next successful check should also mark it fixed. (Not manually marking it as fixed risks that if I believe to have fixed it, but it goes down by the next time it’s checked, I’m not alerted.)
  • Maybe also acknowledge the alert without fixing it – this would allow sending repeated mails, in case I missed the previous ones, without bothering me about something I very well know.

In that case one more column could be created which lists maintainers of the tools, and then the mail could be sent to maintainers.

Screenshot 2024-10-10 at 11.53.37 PM.png (1×2 px, 341 KB)

@Gopavasanth If this looks fine to you, i can create a pull request for this feature.
Basically just added one more column which lists maintainers of the tool.

Hi @MahimaSinghal lets not jump into this, and besides if we add another column saying Send Email, anyone can click on it and send the email to the maintainers (essentially spamming them)

The idea is to automatically send a reminder to the tool's maintainers at the address tools.<TOOL_NAME>@toolforge.org

The plan to send the notifications to the given tool's Toolforge email address would automatically email all registered maintainers of the tool. The backend logic for handling that address uses information from the LDAP directory that is the canonical data store for Toolforge tool maintainer information. https://wikitech.wikimedia.org/wiki/Help:Toolforge/Email#Mail_to_a_Tool

I see. I seem to have skipped through that comment (even though I tried to read all comments, not only the description, before replying).

Hi @MahimaSinghal lets not jump into this

I think it actually makes much more sense than the Author column, and therefore I’d do this independently of the emailing feature: this tool is about reporting unhealthy tool, and fixing that is the job of the maintainers, not that of the authors (who may have long left the project).

and besides if we add another column saying Send Email, anyone can click on it and send the email to the maintainers (essentially spamming them)

Sorry, but this is just a slippery slope. Adding a Maintainers column doesn’t mean we have to add a Send Email column as well.

The idea is to automatically send a reminder to the tool's maintainers at the address tools.<TOOL_NAME>@toolforge.org

The plan to send the notifications to the given tool's Toolforge email address would automatically email all registered maintainers of the tool. The backend logic for handling that address uses information from the LDAP directory that is the canonical data store for Toolforge tool maintainer information. https://wikitech.wikimedia.org/wiki/Help:Toolforge/Email#Mail_to_a_Tool

I see. I seem to have skipped through that comment (even though I tried to read all comments, not only the description, before replying).

Hi @MahimaSinghal lets not jump into this

I think it actually makes much more sense than the Author column, and therefore I’d do this independently of the emailing feature: this tool is about reporting unhealthy tool, and fixing that is the job of the maintainers, not that of the authors (who may have long left the project).

and besides if we add another column saying Send Email, anyone can click on it and send the email to the maintainers (essentially spamming them)

Sorry, but this is just a slippery slope. Adding a Maintainers column doesn’t mean we have to add a Send Email column as well.

From the above comment it felt that we're adding a send email button along with authors, hence the spam concern

I see. I don’t feel that. Sorry for accusing you of arguing with a slippery slope; however, next time please be more explicit about your thinking to avoid such misunderstandings.

@Soylacarli, @Fnielsen, @Nux, thank you for highlighting the bugs and gaps in the tool! The issues have been identified, fixed, and deployed. Please let us know if the behavior has improved or if it continues to be the same moving forward.

The tool now exclusively tracks web-based tools, screenshot attached with new latest stats:
{F57671989}

Tool: https://tool-watch.toolforge.org/

As for the alerts on downtime, I’ll gather all the feedback mentioned above and develop a proper plan soon. Thank you for your advices and patience.

The tool now exclusively tracks web-based tools, screenshot attached with new latest stats:

Actually, it isn’t attached, only referenced, so we don’t see it. Please attach it.


I still miss the maintainers information: I wanted to quickly see if all tools I maintain are up, and searching for my user name haven’t given any results.

@Soylacarli, @Fnielsen, @Nux, thank you for highlighting the bugs and gaps in the tool! The issues have been identified, fixed, and deployed. Please let us know if the behavior has improved or if it continues to be the same moving forward.

My tool is still reported as unavailable:
https://tool-watch.toolforge.org/tools/238

Seems like maybe the tool-watch doesn't know the URL of the tools. Not sure why. My toolinfo seems fine:
https://authors.toolforge.org/toolinfo.json

@Soylacarli, @Fnielsen, @Nux, thank you for highlighting the bugs and gaps in the tool! The issues have been identified, fixed, and deployed. Please let us know if the behavior has improved or if it continues to be the same moving forward.

The tool now exclusively tracks web-based tools, screenshot attached with new latest stats:
{F57671989}

Tool: https://tool-watch.toolforge.org/

As for the alerts on downtime, I’ll gather all the feedback mentioned above and develop a proper plan soon. Thank you for your advices and patience.

It has not improved. Go to https://tool-watch.toolforge.org/search?search=ordia and see that it says Unavailable. The URL is wrong. And the tool is ok at https://ordia.toolforge.org/

It has not improved. Go to https://tool-watch.toolforge.org/search?search=ordia and see that it says Unavailable. The URL is wrong. And the tool is ok at https://ordia.toolforge.org/

https://toolsadmin.wikimedia.org/tools/id/ordia/info/id/538/history?version_id2=4258&version_id1=775 should help fix that bug by making the toolinfo.json record for the tool treat it as a web service and publish the correct URL.

I am not sure what that means. Is it something that I should do? https://tool-watch.toolforge.org/tools/1117 still shows unavailability.

I am not sure what that means. Is it something that I should do? https://tool-watch.toolforge.org/tools/1117 still shows unavailability.

My note was showing that I had edited the tool's toolinfo configuration to mark the tool as a webservice in toolsadmin. This changed the generated toolinfo.json data that is loaded by Toolhub so that now https://toolhub.wikimedia.org/tools/toolforge-ordia shows the tool's URL correctly as https://ordia.toolforge.org/. I don't know when tool-watch will pick up this change.

I also don't know why tool-watch was attempting to monitor the tool at all as the prior https://github.com/fnielsen/ordia/wiki URL is not hosted at toolforge.org subdomain. That seems to contradict the "Note: URLs that do not have a parsed_url.netloc of *.toolforge.org are not shown." information on the tool-watch landing page.

Thanks. I did not seem to have permission to edit the "Tool type" field.

I've added "tool_type": "web app" to my toolinfo.json. That seems to have fixed toolhub too.

Example for those that want to fix in JSON:
https://authors.toolforge.org/toolinfo.json