Page MenuHomePhabricator

Adopt service status dashboard
Open, MediumPublic

Description

As a tool maintainer, when Toolforge has issues, I get signals from lots of differences sources. Each contacting me directly or filing bug reports in various ways/places (on-wiki talk pages, e-mail, Phabricator, IRC, GitHub) because they have nowhere else to go.

I should be able to point them to something akin to the old status.toolserver.org where it would e.g. say: "NFS is having issues. Tool Labs web services may be slow or unresponsive."

Or something more elaborate like:

Status: "NFS is having issues" – 5 minutes ago
             (status message history »)


Wikimedia Cloud Services Infrastructure:

  • Network: OK
  • NFS: Failure
  • ToolsDB: OK

Toolforge:

  • Web server: Failure (indirect failure due to NFS issues)

Event Timeline

Krinkle raised the priority of this task from to Needs Triage.
Krinkle updated the task description. (Show Details)
Krinkle added projects: Cloud-VPS, Cloud-Services.
Krinkle subscribed.
chasemp set Security to None.

This is critical for proper communication with end users.

Examples of open source status dashboards:

GTirloni renamed this task from Labs needs a reliable and communicative status dashboard to Adopt service status dashboard.Feb 21 2019, 12:01 PM
GTirloni moved this task from Inbox to Blocked on the cloud-services-team (Kanban) board.
aborrero subscribed.

Sorry to enter this empty comment, but I couldn't agree more with this phab task.

Said status page should probably not exist as a normal CloudVPS project/Toolforge tool to avoid service outages making the status page inaccessible. Any ideas where? Separate production host/vm/service? Somewhere external like wikitech-static? Just have it hosted on WMCS systems and treat "status page not available" as "something is definitely horribly wrong"?

Another question is that the data needs to come from somewhere, preferably automatically. How would that be done? I'd guess Prometheus alerts (T266050) would be the way to go. Detailed information, status updates and planned maintenance would still probably have to be added manually.

See also potentially related tasks:

In T95922#6811861, @Majavah wrote:

Said status page should probably not exist as a normal CloudVPS project/Toolforge tool to avoid service outages making the status page inaccessible. Any ideas where? Separate production host/vm/service? Somewhere external like wikitech-static? Just have it hosted on WMCS systems and treat "status page not available" as "something is definitely horribly wrong"?

I think starting with something basic on Toolforge would be a good idea (don't let perfect be the enemy of the good, etc.). It could check connecting to the dbs, test outbound network to the wikis, etc.

Another question is that the data needs to come from somewhere, preferably automatically. How would that be done? I'd guess Prometheus alerts (T266050) would be the way to go. Detailed information, status updates and planned maintenance would still probably have to be added manually.

In theory everything in Prometheus should be accessible by Grafana (I think...), in which case maybe making a Grafana dashboard might be the way to go?

I created a Toolforge tool named cloud-status (since status was taken) and tried installing a few different still-maintained OSS status pages and didn't manage to get any of them properly working. I think I'll just write a very basic one myself instead of dealing with the hassle of trying to get one working.

In T95922#6812561, @Majavah wrote:

I think I'll just write a very basic one myself instead of dealing with the hassle of trying to get one working.

What are you going to monitor? How are you going to monitor it? What will make this monitoring canonical?

The hard part of this is not the software that shows status, it is defining what the services are and how to measure their "up" status.

What are you going to monitor? How are you going to monitor it? What will make this monitoring canonical?

I think there's a lot of low hanging fruit, such as:

  • connect to mysql replicas
  • connect to tools-db
  • connect to tools-redis
  • NFS, maybe stat a known file with a timeout?
  • networking to wikis, maybe an API request?
  • grid engine, maybe display the status of the existing monitoring check? (not sure where that is)

Plus a link to the cloud-announce archives, I think that would cover what would take down a majority of tools and bots (but leave an isolated k8s tool up), though obviously not 100% or even 90% complete.

This is sounding a lot like a frontend to toolschecker that people can actually read, as far as low-hanging fruit. It does most of that, but it is only read by icinga, which external users cannot see.

This topic came up again during the recent WMCS offsite in Berlin. An additional idea that was discussed is having some form of IRC integration, i.e. the ability to update the status dashboard writing something in IRC. Currently we are manually updating the IRC channel topic to signal any ongoing maintenance or incident, but many users are not checking IRC. If we could replicate that message to a web status page, it might be easier for users to find it.

As an example, see this recent comment from a user who struggled to find the information about a maintenance operation.