Page MenuHomePhabricator

Incident: 2021-12-03 mx2001->Gmail delivery issues
Closed, ResolvedPublic

Description

Tracking task for the 2021-12-03 mx2001 incident where mail queued up on mx2001 due to timeouts delivering to downstream mail systems e.g. gmail.

in Google the name of this incident is: "2021-12-03 mx2001->Gmail delivery issues"


https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-12-03_mx

Event Timeline

herron triaged this task as Medium priority.Dec 6 2021, 4:27 PM
herron created this task.

This is basically T297017 but I take it because I was IC and interpret this ticket as the doc part, to write the public incident report and put it on Wikitech. (in addition to existing internal Google doc)

This is basically T297017 but I take it because I was IC and interpret this ticket as the doc part, to write the public incident report and put it on Wikitech. (in addition to existing internal Google doc)

Sure the doc is part of it, also this is meant to link together work relating to the incident as a whole and track it through the SRE-OnFire review workflow, e.g. the bug report T297017, cleanup work in T297128, monitoring/alerting improvement actions (tasks TBD), etc.

Dzahn renamed this task from Incident - 2021-12-03: mx2001 to Incident: 2021-12-03 mx2001->Gmail delivery issues.Dec 6 2021, 5:55 PM
Dzahn removed Dzahn as the assignee of this task.Dec 6 2021, 11:49 PM
Dzahn added a subscriber: Dzahn.

created the public doc and also done with the private google doc from my end

@herron How do you see this is as the task creator, should this stay open until all subtasks are resolved? That means even though the actual incident is long over and the server is back in production we keep this open because for example we don't have the grafana graph yet? Fair enough, but was that the intention or more like "keep it on high prio until we are back to normal and the host is back in prod".

I think I just want to lower the priority though if we keep it as tracking ticket. edit: i see it is already medium and not high, so disregard that

@herron How do you see this is as the task creator, should this stay open until all subtasks are resolved? That means even though the actual incident is long over and the server is back in production we keep this open because for example we don't have the grafana graph yet? Fair enough, but was that the intention or more like "keep it on high prio until we are back to normal and the host is back in prod".

I think I just want to lower the priority though if we keep it as tracking ticket. edit: i see it is already medium and not high, so disregard that

Personally I like the approach of keeping it open while there are follow up actions to do, and adjusting the priority

I would have expected the wikitech timeline to contain a final entry for "mx2001" back into (i.e. T297128)

Plus, it doesn't list what the longterm fix was. Which was to uninstall 5.10.70 and rollback to earlier version (T297180), but that seemed to have been just short-term since apparently it isn't fixed in 5.10.84 either (producing T299107)

I would have expected the wikitech timeline to contain a final entry for "mx2001" back into (i.e. T297128)

Plus, it doesn't list what the longterm fix was. Which was to uninstall 5.10.70 and rollback to earlier version (T297180), but that seemed to have been just short-term since apparently it isn't fixed in 5.10.84 either (producing T299107)

Fair enough! Added both items in public report on wikitech and private report on Google docs. done.

lmata added a subscriber: lmata.

docs metadata and scorecard filled. Resolving.