Page MenuHomePhabricator

Implement an accurate and easy to understand status page for all wikis
Open, MediumPublic

Description

In T199816 we deprecated status.wikimedia.org, and replaced it with a static HTML page that points users to a grafana dashboard. This is a reasonable intermediate step, but is not a particularly user-friendly way to indicate when a particular wiki, group of wikis, or partial functionality is unavailable.

We should implement a status page that gives users information about when some/all functionality is impacted; and displays that information in a user-friendly and easy to understand way. Ease of use and accuracy are very important and should be taken into account by any implementation. (These were the biggest issues with the prior Watchmouse implementation, and resulted in that implementation being removed.)

For reference, the previous site

Event Timeline

I think a good example we could build off of would be https://status.discordapp.com/ as it has the basics and explains why an issue happens. We could easily expand something like this to fit the needs of Wikimedia. Discord uses an external provider, however the look of the page itself is quite easy to understand.

I just stumbled upon https://www.githubstatus.com/ (github had an outage) and I quite liked the timeline of "green, yellow, red" (green = the whole day was okay, yellow = degredation below on hour, red = degredation above one hour) per day in the past ninety days.

Technically in progress between @CDanis and me.

lmata mentioned this in Unknown Object (Task).Apr 29 2021, 9:02 PM

Quote opened in T281530

https://wikimedia.statuspage.io/ is live and will continue to see more development, usage and integration. Further efforts to be tracked in the Observability Workboard.

https://wikimedia.statuspage.io/ is live and will continue to see more development, usage and integration. Further efforts to be tracked in the Observability Workboard.

Is there documentation about this somewhere? E.g. what does "Editing" mean, which API is "API" about, what developer tools are being watched, etc.?

Will this end up under a *.wikimedia.org domain or something like wikimediastatus.org? And is this link ready to be spread to wiki users on various technical help pages?

lmata reopened this task as Open.EditedJun 24 2021, 7:43 PM

Is there documentation about this somewhere? E.g. what does "Editing" mean, which API is "API" about, what developer tools are being watched, etc.?

Specific implementation (non vendor) documentation has not been created. Some context for the Atlassian StatusPage tool selection can be found in T281530. I will update this task with the metric details in a follow up.

Will this end up under a *.wikimedia.org domain or something like wikimediastatus.org?

This is not in the immediate plan yet. The ONFIRE group discussed internally and thought that keeping this status page separate from all infrastructure (including DNS) could provide more resiliency in the event of a catastrophic event, in the sense that decoupling the status page from all internal and production infrastructure might be the better option. Also this implementation is in its early stages and the goal is to adopt the service on the observability side and continue to expand on what has been built.

And is this link ready to be spread to wiki users on various technical help pages?

Not quite but we are planning an update in the SRE staff meeting when it is fully ready for public consumption.

Will reopen the task to update documentation and to flag when ready for public consumption.

https://wikimedia.statuspage.io/ is live and will continue to see more development, usage and integration. Further efforts to be tracked in the Observability Workboard.

Is there documentation about this somewhere? E.g. what does "Editing" mean, which API is "API" about, what developer tools are being watched, etc.?

The section headers are not finalized -- honestly I'm not happy with them, despite having chose them. These were just the ideas that we hated the least. Documentation to be written, likely after some more discussion and some experience.

"Editing" is intended to reflect "actions that affect logged-in users or users making edits". Everything from database read-only time for maintance, to issues that predominantly affect logged-in users (for example high appserver latency) would go here.

"API" is intended to reflect any WMF-provided API (but not Wikimedia Enterprise). "Apps & APIs" would probably be clearer to users -- the mobile apps are highly dependent on the API servers of course.

Will this end up under a *.wikimedia.org domain or something like wikimediastatus.org?

I was thinking we would have a status.wikimedia.org that serves a HTTP 302 to the other domain.

We've also talked about registering other dedicated domains for this purpose, but haven't done that yet. As Leo said we do really want this to be entirely hosted off-infra so that it is accessible even in a true disaster scenario.

And is this link ready to be spread to wiki users on various technical help pages?

I'd say not yet, but soon. Before we popularized it I would want to make sure we've updated the IR procedure with instructions on how and when to update, and done some SRE training.

https://wikimedia.statuspage.io/ is live and will continue to see more development, usage and integration. Further efforts to be tracked in the Observability Workboard.

Is there documentation about this somewhere? E.g. what does "Editing" mean, which API is "API" about, what developer tools are being watched, etc.?

The section headers are not finalized -- honestly I'm not happy with them, despite having chose them. These were just the ideas that we hated the least. Documentation to be written, likely after some more discussion and some experience.

After some more thought, I've removed "Developer tools" for now. Not only is there a lot to potentially cover there, but also, my thinking is that technical Wikimedians who are engaged with the developer community already know where to go to check the status of things in a way that works for them -- whether that be Grafana, or IRC, etc.

In the event that there's broad disagreement with this, we can do some better scoping of what to include and add it back.

After some more thought, I've removed "Developer tools" for now. Not only is there a lot to potentially cover there, but also, my thinking is that technical Wikimedians who are engaged with the developer community already know where to go to check the status of things in a way that works for them -- whether that be Grafana, or IRC, etc.

In the event that there's broad disagreement with this, we can do some better scoping of what to include and add it back.

Makes sense to me. Can we include a link to Grafana on that page? "Details about other services may be found in our Grafana dashboards" or something.

I was thinking we would have a status.wikimedia.org that serves a HTTP 302 to the other domain.

I think we want to avoid offsite redirects when possible (c.f. T284222), but it makes sense in this case. Before doing so though we should define what privacy policy applies to the site (I noticed Cloudfront and Google requests) because people will ask.

We've also talked about registering other dedicated domains for this purpose, but haven't done that yet. As Leo said we do really want this to be entirely hosted off-infra so that it is accessible even in a true disaster scenario.

+1 to having a memorable domain that's independent of any service provider.

I was thinking we would have a status.wikimedia.org that serves a HTTP 302 to the other domain.

Came here to say this as well, tried http://status.wikimedia.org today and was surprised to find a deprecation notice there.

! In T202061#7184588, @Legoktm wrote:
I think we want to avoid offsite redirects when possible (c.f. T284222), but it makes sense in this case. Before doing so though we should define what privacy policy applies to the site (I noticed Cloudfront and Google requests) because people will ask.

We could also meet in the middle by serving a link and information about the atlassian privacy policy. That would be an improvement over the current "status.wm.o is deprecated" notice. AFIK the current page is managed manually, so proposing some wording here.

https://status.wikimedia.org has been moved to https://wikimedia.statuspage.io

To help ensure availability during wikimedia infrastructure outages, the wikimedia status page is now hosted with Atlassian statuspage.  Be advised that the Atlassian privacy policy applies when visiting wikimedia.statuspage.io, please see https://www.atlassian.com/legal/privacy-policy for further detail.

! In T202061#7176114, @CDanis wrote:
We've also talked about registering other dedicated domains for this purpose, but haven't done that yet. As Leo said we do really want this to be entirely hosted off-infra so that it is accessible even in a true disaster scenario.

Along with e.g. "wikipediastatus.org" this I think we should consider responding to status.wikipedia.org, and other domains. I think it's fairly commonplace and intuitive to try prefixing a domain with "status" when searching for their status page, and we could simply use the same vhost on wikitech-static to serve a static "hint" page as outlined above to point users towards the canonical status page address.