Essentially:
- If we deploy and the log errors increase, revert immediately.
Essentially:
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | None | T115629 [EPIC] Enforce "no increase in log errors" during deployments | |||
Resolved | None | T115630 [EPIC] Reduce production log errors to zero* | |||
Resolved | • demon | T108749 Implement "WMF Log Errors count" KPI | |||
Declined | None | T115633 Proposal: Force any WARNINGs on Beta Cluster to fail completely | |||
Resolved | • demon | T81030 gdash reports for php/apache errors |
@greg: The problem is the new branch cut on tuesday. Since it incorporates a lot of new code it's difficult to avoid new errors sneaking in.
The errors need to be a lot more visible, honestly. If mediawiki-vagrant, and the beta cluster, would surface the errors in a way that's not easily ignored, then they would be a lot more likely to be fixed before they are holding up a deployment.
I'd like to see something like http://phpdebugbar.com/ enabled by-default on vagrant and beta cluster. Perhaps it could even be offered as a per-user preference on production.
The task got filed back in 2015 when release engineering had plans to improve the overall quality of deployment. After several years of efforts we collectively improved our logging system (Monolog, ELK), we have logging dashboards we closely track and have a process to triage all those errors (eg Wikimedia-production-error ).
We now enforces Zero* log by blocking the train whenever there are new logs, so I am claiming this goal to be a success.