Page MenuHomePhabricator

Tool Request: ToolForge Health Dashboard Tool (ToolWatch)
Closed, ResolvedPublic

Assigned To
Authored By
Gopavasanth
Jul 7 2023, 8:13 PM
Referenced Files
F37583053: image.png
Aug 19 2023, 6:28 AM
Restricted File
Aug 8 2023, 12:28 PM
F37333530: image.png
Aug 8 2023, 12:27 PM
Tokens
"Love" token, awarded by TBurmeister.

Description

ToolWatch. The purpose of this tool is to check the health status of tools deployed on ToolForge and present the information in a visually appealing dashboard.

Description:
After exploring the list of tools on https://toolhub.wikimedia.org/ and noticing that many tools on https://admin.toolforge.org/tools are inactive, I believe it would be valuable to have a centralized dashboard that displays the health status of all deployed tools on ToolForge.

ToolWatch would perform regular checks on the status of individual tools and present the information in a user-friendly manner. The dashboard should include the following features:

Tool Status: Display the current status of each tool, indicating whether it is active or inactive based on the ping check.
Health Metrics: Show relevant metrics for each tool, such as the tool's health status vs DateTime.
Notifications: Alert users when a tool becomes inactive or experiences significant issues.
Search and Filtering: Allow users to search for specific tools and filter tools based on their status, last update, or other criteria.

Having such a dashboard would greatly enhance the management and monitoring of tools deployed on ToolForge. It would enable developers and administrators to quickly identify inactive tools, monitor their performance, and take necessary actions to ensure their functionality.

I would love to hear your thoughts on the same :)

Event Timeline

Back when Toolhunt was still in active development, we explored an adjacent topic, e.g. could we display a set of metrics for each tool that would together serve as a proxy for "quality"? This considered all tool types, not just the ones deployed on Toolforge. In the end, we let go of this idea because (among other things) it was just too complex to implement. But if you constrain the problem to just tools hosted on Toolforge and a few metrics, it should be doable.

https://meta.wikimedia.org/wiki/Toolhub/The_Quality_Signal_Sessions
T296434
T299152

Another effort in this direction that comes to mind is an internal experiment that the Technical Engagement team did last year. The aim was to identify tools hosted on our infrastructure that "might need some extra love" where the criteria for needing that love was a combination of the tool's health, in terms similar to the ones you list above, and its impact, as defined by metrics such as unique page views, number of edits, etc.

T313546
https://www.mediawiki.org/wiki/Wikimedia_Technical_Engagement/Practices_to_support_maintainers_and_high-impact_tools

As an initial approach, I would consider building this app on top of Toolhub. After all, it's a catalog of tools with search capabilities and I'd love to see it live up to its name and become a true hub for all things related to Tool information. Any metrics or info that aren't currently available could be added to the API. You could then develop your user interface as an independent tool., or alternatively, and this is just a wild idea at this stage, we might consider integrating the dashboard directly into Toolhub. I'm confident that @bd808 would have insightful viewpoints to contribute to this discussion. :)

Tool Status: Display the current status of each tool, indicating whether it is active or inactive.

What do you mean by "active"? Are you thinking about webservices with live HTTP endpoints or something else?

Last Update: Provide the date and time of the most recent update or activity for each tool.

Are you thinking about source code changes when you say "update" or something else? @sdkim was interested in this as a metric, but I have always been skeptical of it. In my experience a very large number of the tools that I personally build need source code updates very infrequently. This can be true even for library code.

Health Metrics: Show relevant metrics for each tool, such as uptime, response time, and resource usage.

The example metrics here seem biased towards web services. That's fine, but it does leave out a lot of tools.

Notifications: Alert users when a tool becomes inactive or experiences significant issues.

How would a monitoring system know who uses a given tool? What would a user do when alerted that a given tool is broken?

Hi @bd808/@Slst2020, thanks for providing your thoughts.

What do you mean by "active"? Are you thinking about webservices with live HTTP endpoints or something else?

As an initial step, I was thinking to perform a basic Pingdom check on the tools and onboarding additional metrics going forward.

Are you thinking about source code changes when you say "update" or something else? @sdkim was interested in this as a metric, but I have always been skeptical of it. In my experience a very large number of the tools that I personally build need source code updates very infrequently. This can be true even for library code.

Ah okay, how about performing these Pingdom checks, storing data into DB and visually representing a graph to check the availability of the tool (monthly, yearly..), as some of the tools were down for ages?

The example metrics here seem biased towards web services. That's fine, but it does leave out a lot of tools.

Yes, you got it right, it works for the web services. do you have any alternative plans/ideas to also make these work for other services as well?

How would a monitoring system know who uses a given tool? What would a user do when alerted that a given tool is broken?

These alerts are sent to the maintainers of the tool, to visit the tool and take necessary actions for bringing back the tool live.

Something like this:

image.png (706×1 px, 61 KB)
{F37333533}

Thanks for the presentation! For others who may be looking for a link, the tool is alive here: https://tool-watch.toolforge.org/

I want to report a bug. It shows this screen at present:

image.png (838×2 px, 191 KB)

which indicates https://bash.toolforge.org is down when it is in fact working fine.

The application's source is at https://github.com/gopavasanth/ToolWatch. It would be nice if this link and the license of the project were present in the web UI somewhere.

I don't think the currently deployed application actually works as intended. Here are a few issues that I have noticed so far:

  • As @Niharika pointed out in T341379#9103579 it is reporting bash.toolforge.org as down when that tool appears to be working fine.
  • The tool currently is reporting nearly everything as "unavailable" with last check timestamps from the early UTC hours of 2023-08-19. Timestamps also appear to be microseconds apart. The checker likely would have hit rate limits that we apply to all *.toolforge.org visitors. Currently visitors are rate limited to 100 requests per second per IP address across all Toolforge tools.
  • The status checking code in the tool is very simplistic and only checking to see if the response to a HEAD request returns a 200 status or not. It should at least check for 429 responses from the backend, and ideally would have some plan for dealing with any status from the 4xx client error block. It probably should consider any 2xx or 3xx response status to be a success as well.
  • For display purposes the tool is using each toolinfo.json record's name property as a Toolforge tool name and is templating that name into a https://{{ tool.name }}.toolforge.org URL. The naming assumption is incorrect. The name in a toolinfo.json record is an opaque identifier. For the vast majority of records emitted by https://toolsadmin.wikimedia.org/tools/toolinfo/v1.2/toolinfo.json, name will start with 'toolforge-' in an effort to prevent global key collisions.
  • The tool does not seem to filter out tools with a toolinfo.json URL that starts with https://toolsadmin.wikimedia.org/. URL is a required field in a toolinfo.json record so Striker fills in an URL point back to itself for tools which have not been marked as being webservices by their maintainers.

The application's source is at https://github.com/gopavasanth/ToolWatch. It would be nice if this link and the license of the project were present in the web UI somewhere.

That's a good point, will update :)

I don't think the currently deployed application actually works as intended. Here are a few issues that I have noticed so far:

  • As @Niharika pointed out in T341379#9103579 it is reporting bash.toolforge.org as down when that tool appears to be working fine.

Yeah good point, I've updated the tool to hopefully catch that case.

  • The tool currently is reporting nearly everything as "unavailable" with last check timestamps from the early UTC hours of 2023-08-19. Timestamps also appear to be microseconds apart. The checker likely would have hit rate limits that we apply to all *.toolforge.org visitors. Currently visitors are rate limited to 100 requests per second per IP address across all Toolforge tools.

I've added a wait time of 0.01 secs to be on the safe side. It turns out the periodic job (daily) to update the status was not running as intended, I'll do a special run to update the DB now :)

  • The status checking code in the tool is very simplistic and only checking to see if the response to a HEAD request returns a 200 status or not. It should at least check for 429 responses from the backend, and ideally would have some plan for dealing with any status from the 4xx client error block. It probably should consider any 2xx or 3xx response status to be a success as well.

I've included 200 and 300s in the successes. I'm assuming the timeout above should head off any 4xx related issues ?

  • For display purposes the tool is using each toolinfo.json record's name property as a Toolforge tool name and is templating that name into a https://{{ tool.name }}.toolforge.org URL. The naming assumption is incorrect. The name in a toolinfo.json record is an opaque identifier. For the vast majority of records emitted by https://toolsadmin.wikimedia.org/tools/toolinfo/v1.2/toolinfo.json, name will start with 'toolforge-' in an effort to prevent global key collisions.

I've updated this to use the .url field. :)

  • The tool does not seem to filter out tools with a toolinfo.json URL that starts with https://toolsadmin.wikimedia.org/. URL is a required field in a toolinfo.json record so Striker fills in an URL point back to itself for tools which have not been marked as being webservices by their maintainers.

Done :)

Loading resources (css, js, images) from non-Wikimedia controlled websites is not generally desired (T133919: [EPIC] Protect end-user privacy by restricting non-consensual third-party browser interactions). https://csp-report.toolforge.org/search?ft=tool-watch shows a number of public CDNs being used by the current codebase. https://wikitech.wikimedia.org/wiki/Help:Toolforge/Web#Load_external_assets_using_our_CDN_services provides links to privacy preserving proxies for cdnjs and google fonts that may be useful in updating the application.

Loading resources (css, js, images) from non-Wikimedia controlled websites is not generally desired (T133919: [EPIC] Protect end-user privacy by restricting non-consensual third-party browser interactions). https://csp-report.toolforge.org/search?ft=tool-watch shows a number of public CDNs being used by the current codebase. https://wikitech.wikimedia.org/wiki/Help:Toolforge/Web#Load_external_assets_using_our_CDN_services provides links to privacy preserving proxies for cdnjs and google fonts that may be useful in updating the application.

Fixed, thanks for the heads up.

Also I notice that certain URLs in the URL field are concatenations of two or more URLs, is this expected or should I file a bug for this?

Also I notice that certain URLs in the URL field are concatenations of two or more URLs, is this expected or should I file a bug for this?

That sounds unexpected to me. Please do create a bug report and hopefully someone can figure out what is going on. It may have something to do with folks misunderstanding the intent/function of the "Path to tool below main webservice." field when using Striker to create a toolinfo record.

@Soda: Could you share the task ID please, per last two comments? Thanks in advance!

Given that https://tool-watch.toolforge.org/ is up and running, and there is an associated Phab tag Tool-toolwatch, can we mark this task as Resolved, and open new tasks for bugs/feature requests?

I'm marking this as resolved (see my previous comment). Feel free to reopen if you think there is still work to do as part of this task.