Page MenuHomePhabricator

Increase trusted volunteer's visibility into production incidents
Open, Needs TriagePublicFeature

Description

Steps to replicate the issue (include links if applicable):

  • Become a trusted volunteer
  • See reports for X happening (or happened yesterday)
  • Be unable to have any visible insight into the production incident process

What happens?:
https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/message/GUGCSE7JKQNSXWKCFHPCXEL7BLAVRTD6/

What should have happened instead?:

Not sure what the solution is here, but I need to be more open, especially to trusted volunteers who clearly are not adversaries here. At the very minimum:

  • Volunteers should be able to confirm or deny the presence of production incidents (to give an example, I've had reports of people seeing a "too many requests" screen; is that a collateral to a misconfiguration or just anti-abuse working as intended? Should I file a bug ?)
  • Be able to see why a specific incident occurred and whether it was resolved. We (the community) are stakeholders in the process and do deserve to know if major outages are being caused by scrapers, by faulty code, etc.

Besides transparency, there are sometimes significant upsides to talking to volunteers about incidents, since a community perspective can bring up problems before they manifest. To give a recent example, T261752 was discovered during a discussion about how anti-abuse measures on Discord were affecting users. This kind of discussion should have occurred on Phabricator when/before the rate limits were finalized.

Details

Event Timeline

Peachey88 changed the subtype of this task from "Bug Report" to "Feature Request".

I agree with the sentiments of this ticket. I am a trusted volunteer (I'm in the Phabricator groups acl*security and WMF-NDA), and I can't see much about public or private incidents. It seems like most of this has moved to Google Docs, which is out of sight of most volunteers. Maybe someday I will apply for access to the IRC channel #mediawiki_security, but it would be nice to be able to read more about incidents via Phabricator (private incidents) and via Wikitech (public incidents).

Any changes that publish more details about incidents to Phabricator and to Wikitech would be helpful to trusted volunteers.

One thing that might be actionable right now: For public incidents, maybe we can start adding links to phab tickets to https://wikitech.wikimedia.org/wiki/Incident_status. That way for public incidents we still have a place to find them.

Although according to https://wikitech.wikimedia.org/wiki/Corto#Access_control, CortoBot defaults to private tasks, so that probably builds in a major delay for the public seeing public incidents, since these private tickets need to be made public first.

Maybe the Corto IRC bot's create command should be split into create public and create private, and if create public is selected, the Phab ticket starts its life off public? create public could also write the Phab ticket link to https://wikitech.wikimedia.org/wiki/Incident_status.

That wouldn't solve the google doc being restricted to WMF staff, but would let folks know that public incidents exist in a timely manner.

Change #1287424 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] corto: set default visibility to WMF-NDA

https://gerrit.wikimedia.org/r/1287424

(cross-referencing to T389664: Reconsider default incident visibility, where the default visibility of incident tasks was previously discussed FWICS)

Some initial context: The kinds of issues SRE are dealing with have changed significantly in the last ~year. Historically, many incidents weren't ever documented on wikitech due to DENY reasons. So there isn't really any major change for that aspect of open visibility. The only change over the last few years, is the increased quantity of these DENY-related tasks, related to both scrapers and attackers (which there are a few Diff posts about). The majority of these incidents aren't user-facing due to the protections we have in place and due to SREs following our incident response processes and so there is little to report outside of sensitive actions to protect the projects. I do think that we can do more on this and there has been some work done on standardising communicating events like this that I'm hoping we can move forward soon.

For live updates of ongoing incidents that are user-facing, SRE tries to use https://www.wikimediastatus.net/ whenever possible and it forms part of our response process also. I think we could stand to use this tool more proactively fwiw. For the most part, the wikitech reports were written between 1 and 3 weeks after the incident happened as the output of a longer postmortem discussion. That of course has value of its own, but it wasn't really useful for tracking live incidents. imo iff we handle the Phabricator tasks properly and resolve the visibility issue (see below) we actually get better up to date visibility on incidents because the tasks get at the same time as SRE are initially responding.

More generally I absolutely hear you both on the need for a quick view of current and historical incidents. One of the benefits of keeping the tasks in Phabricator means that we can write more complex queries and create dashboards - I've created an initial incident overview dashboard as a first attempt to get a chronological view of created incidents. It's not perfect however, and making this properly useful is down to SRE and others being more disciplined in keeping tasks up to date and resolved promptly. This doesn't solve the issue of access disparity between tasks.

Currently Corto defaults to creating issues with acl*security. I think the first course of action is to set this to NDA by default rather than this stricter level, with escalating to acl*security as an option if needed (see the linked issue in this task). imo creating public incidents for ongoing incidents is a little risky, but I think after an incident we should have to make a very good argument for keeping an incident task private. For example, I have just added a summary of this recent issue where ulsfo was depooled and content from Commons was impacted for some users. Ideally we'd see this kind of thing happening more promptly

Just a final note on Google docs: they have been a part of the incident response process (https://wikitech.wikimedia.org/wiki/Incident_response/Runbook) since at least 2020 - the ability to edit quickly, in parallel, with large amounts of info being added is pretty critical to addressing issues quickly. We haven't changed anything about how these are used. What has changed is that incident reports are not being transferred to wikitech after incidents have happened - the change is the proposal is to use the phabricator task as an incident report object as opposed to wikitech. Similar info will be shared, we absolutely want to ensure that we're being as transparent as we can be with information about our incidents.

Thanks for working on this and for the quick and thorough replies. I appreciate it.

For the most part, the wikitech reports were written between 1 and 3 weeks after the incident happened as the output of a longer postmortem discussion.

If I recall correctly, the report page on Wikitech would usually be created during the incident using a mostly empty template, then filled out later. The important part is that a public page for each incident was created pretty quickly, allowing the public to know about recent incidents quickly. Corto tickets defaulting to private visibility probably slows down the public knowing about public incidents, which is why I suggested a create public CortoBot command in a previous comment.

I've created an initial incident overview dashboard

Thanks! This looks pretty good and could perhaps replace the Wikitech page. I've gone ahead and linked to your Dashboard on the Wikitech page. Diff.

Just a final note on Google docs: they have been a part of the incident response process (https://wikitech.wikimedia.org/wiki/Incident_response/Runbook) since at least 2020 - the ability to edit quickly, in parallel, with large amounts of info being added is pretty critical to addressing issues quickly.

Yeap, those are good advantages for the SRE team. It sounds like Google Docs are like a fancier Etherpad and allow lots of people to edit at the same time. It's too bad that platform excludes volunteers though. We should be careful of this trend of moving information into volunteer-restricted and closed-source software such as Google Docs and Slack.

Perhaps a bot could be written to copy paste the contents of these Google Docs into a comment in the Phab ticket after a week or two or however much time it typically takes to write an incident report. And/or perhaps trusted volunteers could be granted read only access to the Google Docs.

[...] but I think after an incident we should have to make a very good argument for keeping an incident task private.

To be clear, I don't in any way doubt the sincerity of your statement here, but I suppose I am just somewhat concerned about the possibility that this ideal might end up slipping somewhat in the (either near or further) future, and that many incident tasks might remain private when they don't have to be (not necessarily through it being anyone's active intention to keep them private, but e.g. potentially just because nobody has actually gone through them to review/make them public).

For the most part, the wikitech reports were written between 1 and 3 weeks after the incident happened as the output of a longer postmortem discussion. That of course has value of its own, but it wasn't really useful for tracking live incidents.

To be clear, at least personally, I think there is a lot of value in being public/transparent with these sorts of postmortem reports. While the task description of (now-public) T425693: "upload at ulsfo depooled due to tcp timeout" does contain some information on what caused that incident & how it was dealt with, it seems like it's more in the form of (somewhat low-level) bullet-points (as is understandable, given that I imagine it might've been typed up during/soon after the incident itself), rather than the sort of (IMO insightful) post-incident analysis that has previously been published as part of incident reports (e.g. what went well, what went poorly, where did we get lucky; etc.).

Just a final note on Google docs: they have been a part of the incident response process (https://wikitech.wikimedia.org/wiki/Incident_response/Runbook) since at least 2020 - the ability to edit quickly, in parallel, with large amounts of info being added is pretty critical to addressing issues quickly. We haven't changed anything about how these are used. What has changed is that incident reports are not being transferred to wikitech after incidents have happened - the change is the proposal is to use the phabricator task as an incident report object as opposed to wikitech. Similar info will be shared, we absolutely want to ensure that we're being as transparent as we can be with information about our incidents.

As a random idea (potentially in addition to what's been written so far), maybe incident Google Docs could also be (considered to be) made public after the conclusion of each incident, if they don't contain any information that can't be public?

It's too bad [Google Docs] excludes volunteers though. We should be careful of this trend of moving information into volunteer-restricted and closed-source software such as Google Docs and Slack.

+1

Change #1287424 merged by Hnowlan:

[operations/puppet@production] corto: set default visibility to WMF-NDA

https://gerrit.wikimedia.org/r/1287424

As of a few minutes ago, all new incidents will be created as WMF-NDA by default. I'm working to move some historical events to WMF-NDA also.

Thanks for working on this and for the quick and thorough replies. I appreciate it.

For the most part, the wikitech reports were written between 1 and 3 weeks after the incident happened as the output of a longer postmortem discussion.

If I recall correctly, the report page on Wikitech would usually be created during the incident using a mostly empty template, then filled out later. The important part is that a public page for each incident was created pretty quickly, allowing the public to know about recent incidents quickly. Corto tickets defaulting to private visibility probably slows down the public knowing about public incidents, which is why I suggested a create public CortoBot command in a previous comment.

In my experience as a historical member of the ONFIRE group the wikitech reports were created by the person leading the incident review just before the review ritual, which happens every 2 weeks or so. It's worth noting that historically not all incidents were deemed worthy of a postmortem - I'd hazard a guess that less than a third of incidents end up warranting real discussion (which is why I think the corto incident tickets actually offer more insight into current incidents) .

I've created an initial incident overview dashboard

Thanks! This looks pretty good and could perhaps replace the Wikitech page. I've gone ahead and linked to your Dashboard on the Wikitech page. Diff.

Thank you!

Just a final note on Google docs: they have been a part of the incident response process (https://wikitech.wikimedia.org/wiki/Incident_response/Runbook) since at least 2020 - the ability to edit quickly, in parallel, with large amounts of info being added is pretty critical to addressing issues quickly.

Perhaps a bot could be written to copy paste the contents of these Google Docs into a comment in the Phab ticket after a week or two or however much time it typically takes to write an incident report. And/or perhaps trusted volunteers could be granted read only access to the Google Docs.

Unfortunately these docs contain highly sensitive information and so publishing them outright is kinda out of the question - but for example the summary that is contained in a Phabricator issue before any incident reviews would be a (carefully redacted) copy and paste of the meat of the gdoc for tracking incidents, which will hopefully lead to more information being available than previously.

[...] but I think after an incident we should have to make a very good argument for keeping an incident task private.

To be clear, I don't in any way doubt the sincerity of your statement here, but I suppose I am just somewhat concerned about the possibility that this ideal might end up slipping somewhat in the (either near or further) future, and that many incident tasks might remain private when they don't have to be (not necessarily through it being anyone's active intention to keep them private, but e.g. potentially just because nobody has actually gone through them to review/make them public).

This is a fair concern. Historically incident follow-up has not been our strong suit outside of the incident review rituals. That said, all of these changes are a result of trying to formalise our incident response and to improve how we follow up on incidents so adding these steps to the post-incident checklist for the incident coordinator will hopefully avoiding this kind of thing slipping - visibility and content updates are now part of our post-incident checklist for incident coordinators. A visible facet of our changes is more consistently tracking our action items from incidents.

For the most part, the wikitech reports were written between 1 and 3 weeks after the incident happened as the output of a longer postmortem discussion. That of course has value of its own, but it wasn't really useful for tracking live incidents.

To be clear, at least personally, I think there is a lot of value in being public/transparent with these sorts of postmortem reports. While the task description of (now-public) T425693: "upload at ulsfo depooled due to tcp timeout" does contain some information on what caused that incident & how it was dealt with, it seems like it's more in the form of (somewhat low-level) bullet-points (as is understandable, given that I imagine it might've been typed up during/soon after the incident itself), rather than the sort of (IMO insightful) post-incident analysis that has previously been published as part of incident reports (e.g. what went well, what went poorly, where did we get lucky; etc.).

I totally agree - the Phab tasks for an incident should contain the content of the postmortem/review ritual if it happens. As mentioned though not all incidents get a postmortem. If it's decided that there aren't enough worthwhile takeaways or nuances to an incident the discussion just never happens - In the case of T425693, the responding engineers agreed that there wasn't enough to talk about in a ritual so there isn't more to add.