Page MenuHomePhabricator

Incident: 2022-03-4 Banner sampling leading to a relatively wide site outage (mostly esams)
Closed, ResolvedPublic

Description

The setup of a particular banner for all wikis/all users with 100% sampling caused traffic layer instabilities due to the large amount of traffic received- causing wikis to be unreachable in the impacted region (initially esams datacenter users - mostly Europe, Africa and middle East), with some temporary issues on other datacenters (eqiad). The reasons for the issues was connections piled up all the way up to the applayer (eventgate-analytics-external), and varnish wasn't able to handle it.

This ticket has been created to track the followups and make sure the post mortem is documented on wikitech in the usual places and the incident is scored appropriately.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
jbond triaged this task as Medium priority.Mar 21 2022, 11:35 AM
lmata renamed this task from Banner sampling leading to a relatively wide site outage (mostly esams) to Incident: Banner sampling leading to a relatively wide site outage (mostly esams).Apr 28 2022, 10:20 PM
lmata renamed this task from Incident: Banner sampling leading to a relatively wide site outage (mostly esams) to Incident: 2022-03-4 Banner sampling leading to a relatively wide site outage (mostly esams).Apr 28 2022, 10:28 PM

Are there actionables on this task? I'm considering removing the Event Platform tag?

@Ottomata: The actionables of the task pending is to understand what the actionables are for the incident mentioned on the header (https://docs.google.com/document/d/1xYYzFlJcAP9pckqBWyiXUbs7HN5iThg85lkjv_RUh_o/edit) as that is pending documenting: https://wikitech.wikimedia.org/wiki/Incidents/2022-03-04_esams_availability_banner_sampling . After that, this task can be closed (no need to wait for the actionables to be closed, only documented).

I believe there is T303155 already, but unsure if there are others. Do you know which teams or individual people could help us with that? I believe it wouldn't be DE, but as your layer's availability was affected, maybe your team knows or could help routing the task?

@lmata what should we do with this follow up task?

I'm not sure that there's much more to do, is there? From a technical perspective, this has now been prevented from happening again accidentially by {T303326}, so we can no longer configure sampling from CN at rate of more than 1%

Do we just need to update https://wikitech.wikimedia.org/wiki/Incidents/2022-03-04_esams_availability_banner_sampling#Actionables and add a link to T303326, before closing this ticket and T303155?

Apologies if I'm missing something about the process.

lmata claimed this task.

Thank you @BTullis for T303036#8423773. I think that note should be enough documentation to close this task; however, if you could tie the update back to the incident report, that would prove most helpful. I'll be resolving this task for completion.