Page MenuHomePhabricator

Implement an accurate and easy to understand status page for all wikis
Open, MediumPublic

Assigned To
None
Authored By
Imarlier
Aug 16 2018, 2:50 PM
Referenced Files
F31755516: image.png
Apr 13 2020, 8:53 PM
Tokens
"Love" token, awarded by herron."Like" token, awarded by Elitre."Love" token, awarded by Quiddity."Yellow Medal" token, awarded by Ladsgroup."Love" token, awarded by waldyrious.

Description

In T199816 we deprecated status.wikimedia.org, and replaced it with a static HTML page that points users to a grafana dashboard. This is a reasonable intermediate step, but is not a particularly user-friendly way to indicate when a particular wiki, group of wikis, or partial functionality is unavailable.

We should implement a status page that gives users information about when some/all functionality is impacted; and displays that information in a user-friendly and easy to understand way. Ease of use and accuracy are very important and should be taken into account by any implementation. (These were the biggest issues with the prior Watchmouse implementation, and resulted in that implementation being removed.)

For reference, the previous site

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I think a good example we could build off of would be https://status.discordapp.com/ as it has the basics and explains why an issue happens. We could easily expand something like this to fit the needs of Wikimedia. Discord uses an external provider, however the look of the page itself is quite easy to understand.

I just stumbled upon https://www.githubstatus.com/ (github had an outage) and I quite liked the timeline of "green, yellow, red" (green = the whole day was okay, yellow = degredation below on hour, red = degredation above one hour) per day in the past ninety days.

Technically in progress between @CDanis and me.

lmata mentioned this in Unknown Object (Task).Apr 29 2021, 9:02 PM

Quote opened in T281530

https://wikimedia.statuspage.io/ is live and will continue to see more development, usage and integration. Further efforts to be tracked in the Observability Workboard.

https://wikimedia.statuspage.io/ is live and will continue to see more development, usage and integration. Further efforts to be tracked in the Observability Workboard.

Is there documentation about this somewhere? E.g. what does "Editing" mean, which API is "API" about, what developer tools are being watched, etc.?

Will this end up under a *.wikimedia.org domain or something like wikimediastatus.org? And is this link ready to be spread to wiki users on various technical help pages?

lmata reopened this task as Open.EditedJun 24 2021, 7:43 PM

Is there documentation about this somewhere? E.g. what does "Editing" mean, which API is "API" about, what developer tools are being watched, etc.?

Specific implementation (non vendor) documentation has not been created. Some context for the Atlassian StatusPage tool selection can be found in T281530. I will update this task with the metric details in a follow up.

Will this end up under a *.wikimedia.org domain or something like wikimediastatus.org?

This is not in the immediate plan yet. The ONFIRE group discussed internally and thought that keeping this status page separate from all infrastructure (including DNS) could provide more resiliency in the event of a catastrophic event, in the sense that decoupling the status page from all internal and production infrastructure might be the better option. Also this implementation is in its early stages and the goal is to adopt the service on the observability side and continue to expand on what has been built.

And is this link ready to be spread to wiki users on various technical help pages?

Not quite but we are planning an update in the SRE staff meeting when it is fully ready for public consumption.

Will reopen the task to update documentation and to flag when ready for public consumption.

https://wikimedia.statuspage.io/ is live and will continue to see more development, usage and integration. Further efforts to be tracked in the Observability Workboard.

Is there documentation about this somewhere? E.g. what does "Editing" mean, which API is "API" about, what developer tools are being watched, etc.?

The section headers are not finalized -- honestly I'm not happy with them, despite having chose them. These were just the ideas that we hated the least. Documentation to be written, likely after some more discussion and some experience.

"Editing" is intended to reflect "actions that affect logged-in users or users making edits". Everything from database read-only time for maintance, to issues that predominantly affect logged-in users (for example high appserver latency) would go here.

"API" is intended to reflect any WMF-provided API (but not Wikimedia Enterprise). "Apps & APIs" would probably be clearer to users -- the mobile apps are highly dependent on the API servers of course.

Will this end up under a *.wikimedia.org domain or something like wikimediastatus.org?

I was thinking we would have a status.wikimedia.org that serves a HTTP 302 to the other domain.

We've also talked about registering other dedicated domains for this purpose, but haven't done that yet. As Leo said we do really want this to be entirely hosted off-infra so that it is accessible even in a true disaster scenario.

And is this link ready to be spread to wiki users on various technical help pages?

I'd say not yet, but soon. Before we popularized it I would want to make sure we've updated the IR procedure with instructions on how and when to update, and done some SRE training.

https://wikimedia.statuspage.io/ is live and will continue to see more development, usage and integration. Further efforts to be tracked in the Observability Workboard.

Is there documentation about this somewhere? E.g. what does "Editing" mean, which API is "API" about, what developer tools are being watched, etc.?

The section headers are not finalized -- honestly I'm not happy with them, despite having chose them. These were just the ideas that we hated the least. Documentation to be written, likely after some more discussion and some experience.

After some more thought, I've removed "Developer tools" for now. Not only is there a lot to potentially cover there, but also, my thinking is that technical Wikimedians who are engaged with the developer community already know where to go to check the status of things in a way that works for them -- whether that be Grafana, or IRC, etc.

In the event that there's broad disagreement with this, we can do some better scoping of what to include and add it back.

After some more thought, I've removed "Developer tools" for now. Not only is there a lot to potentially cover there, but also, my thinking is that technical Wikimedians who are engaged with the developer community already know where to go to check the status of things in a way that works for them -- whether that be Grafana, or IRC, etc.

In the event that there's broad disagreement with this, we can do some better scoping of what to include and add it back.

Makes sense to me. Can we include a link to Grafana on that page? "Details about other services may be found in our Grafana dashboards" or something.

I was thinking we would have a status.wikimedia.org that serves a HTTP 302 to the other domain.

I think we want to avoid offsite redirects when possible (c.f. T284222), but it makes sense in this case. Before doing so though we should define what privacy policy applies to the site (I noticed Cloudfront and Google requests) because people will ask.

We've also talked about registering other dedicated domains for this purpose, but haven't done that yet. As Leo said we do really want this to be entirely hosted off-infra so that it is accessible even in a true disaster scenario.

+1 to having a memorable domain that's independent of any service provider.

I was thinking we would have a status.wikimedia.org that serves a HTTP 302 to the other domain.

Came here to say this as well, tried http://status.wikimedia.org today and was surprised to find a deprecation notice there.

! In T202061#7184588, @Legoktm wrote:
I think we want to avoid offsite redirects when possible (c.f. T284222), but it makes sense in this case. Before doing so though we should define what privacy policy applies to the site (I noticed Cloudfront and Google requests) because people will ask.

We could also meet in the middle by serving a link and information about the atlassian privacy policy. That would be an improvement over the current "status.wm.o is deprecated" notice. AFIK the current page is managed manually, so proposing some wording here.

https://status.wikimedia.org has been moved to https://wikimedia.statuspage.io

To help ensure availability during wikimedia infrastructure outages, the wikimedia status page is now hosted with Atlassian statuspage.  Be advised that the Atlassian privacy policy applies when visiting wikimedia.statuspage.io, please see https://www.atlassian.com/legal/privacy-policy for further detail.

! In T202061#7176114, @CDanis wrote:
We've also talked about registering other dedicated domains for this purpose, but haven't done that yet. As Leo said we do really want this to be entirely hosted off-infra so that it is accessible even in a true disaster scenario.

Along with e.g. "wikipediastatus.org" this I think we should consider responding to status.wikipedia.org, and other domains. I think it's fairly commonplace and intuitive to try prefixing a domain with "status" when searching for their status page, and we could simply use the same vhost on wikitech-static to serve a static "hint" page as outlined above to point users towards the canonical status page address.

After some more experience with actually updating the page during incidents, and some more thought, in the sake of simplicity I think it would be best to remove the "Content Delivery Network" component and probably also the "Apps & API" component.

This would leave just two components:

  • "Reading": is the site accessible by logged-out/anonymous users? Is a read-only experience mostly working? From a user's perspective, issues with the CDN manifest here (and users shouldn't need to understand what a CDN is). Some API issues could manifest here. And while we haven't yet had an outage I'm aware of that only affected functionality on mobile apps, it seems reasonable to have them manifest under 'reading'.
  • "Editing": can changes be saved? Do logged-in users have good performance? etc. Some API issues will manifest here.

I think this would make things clearer for the average user, and also would reduce cognitive load on SREs posting updates.

Is there any public documentation regarding the selection of the statuspage.io service and which users this is supposed to serve?

And while we haven't yet had an outage I'm aware of that only affected functionality on mobile apps, it seems reasonable to have them manifest under 'reading'.

We did have an outage "recently" that only affected mobile apps. See https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-09-06_Wikifeeds

However, no disagreement on categorizing mobileapps under "Reading" for now. If we end up having a significant amount of outages that affect just mobile applications we can always revisit and split them off from "Reading".

"Editing": can changes be saved? Do logged-in users have good performance? etc. Some API issues will manifest here.

Issues when editing via an app will also manifest here too.

I think this would make things clearer for the average user, and also would reduce cognitive load on SREs posting updates.

Since we are targeting the average user, I think that indeed talking about the CDN or the APIs isn't making much sense. And with our current Apps specific rate of outages, it's not worth it to dedicate an entire section to them.

Is there any public documentation regarding the selection of the statuspage.io service and which users this is supposed to serve?

It's primarily designed to serve the general public and the news media, although of course we expect community members to also use it as a resource -- although we certainly don't mean to replace, for example, on-wiki technical village pumps. The focus is on very visible/widespread outages.

We selected statuspage.io with the following considerations:

  • Because we want the site to be working even in a widespread failure of Wikimedia infrastructure, any solution needs to be hosted externally
  • We decided we did not want to take on the engineering effort needed to run scalable external hosting + separate CDN
  • There are very few FLOSS status page projects that are more than just "toy" projects, and of those which aren't, even fewer are actively maintained
  • statuspage.io had some distinguishing features: not just the basic manually-posted up/down functionality, but also support for automated uploads of timeseries metrics, and SLO-like uptime history on each component

I'll put the above in a doc page on Wikitech Soon™

Hi @CDanis!

Would it be possible to also update status.wikimedia.org to redirect to wikimediastatus.net? Having the older deprecated link might be a bit confusing.

@lmata yeah, sorry, that's been on my backlog but I had been putting it off in the hopes that T292347 would get resolved first. For now I'll just plunge on and we'll clean it up later.

@lmata yeah, sorry, that's been on my backlog but I had been putting it off in the hopes that T292347 would get resolved first. For now I'll just plunge on and we'll clean it up later.

A reminder to maybe include the info currently on that page in a footer or somewhere, if people still need to find those resources?

status.wikimedia.org is now up-to-date.

One other thing I'm considering is asking ITS for a WMF Slack webhook URL to use to automatically post status updates to #general or a similar channel.

Thanks @CDanis! also +1 to the webhook/slack proposal

New incidents and other posts on the status page will now automatically be posted to #talk-to-sre on WMF Slack.

That is very cool, thanks! Would it be interesting to replicate similar behavior for #wikimedia-operations or #wikimedia-sre, or would that be too redundant? More of a question of whether we should or not as opposed to a feature request. ツ

While I'm aware that the Atlassian privacy policy applies... (which I'm sure we have investigated for at least some level of minimal compliance ?). I'd just like to mention the opportunity that we can provide our own link for a privacy policy (Manage -> Your page -> Page info), where perhaps we can illustrate concerns more clearly, as well as then link to Atlassian's privacy policy. The downside of course is that anything we link to is likely to be offline at times where an outage is in place, unlike Atlassian's direct page.

An alternate idea is to add a small note via "About this page" (Manage -> Your page -> Customise page and email), to make it clear that the standard WMF privacy policy does not apply, which would get added at the top of the page. Just a thought.

Also feature request for them, as we now have a contract: Reimplement their site to work without depending on polyfill.io, reduces one more domain dependency and http connection, better for privacy, performance and reliability.

I'm assuming the recaptcha dependency is for the subscribe button (even though we don't provide email/sms updates, probably because of privacy concerns ?)

lmata moved this task from In progress to Done on the SRE Observability (FY2021/2022-Q4) board.

StatusPage is now officially launched and in service. While relevant and still needed, the open items here are not on the critical path for continued use and are probably good to backlog in place of other priorities for a few quarters. So I'll be bumping this task back into the backlog.

lmata removed lmata as the assignee of this task.Sep 18 2022, 3:44 PM