Reasoning: This will force developers to fix their log spamming code before it hits production (as much of it as Beta Cluster can catch).
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | None | T115629 [EPIC] Enforce "no increase in log errors" during deployments | |||
Resolved | None | T115630 [EPIC] Reduce production log errors to zero* | |||
Declined | None | T115633 Proposal: Force any WARNINGs on Beta Cluster to fail completely |
Event Timeline
I had an idea to introduce a config setting for that into MW proper, to easily enable developers to see problems, no matter of environment they're developing in.
We should build an error console that integrates into the page so that errors surface instead of getting buried in the logs. Breaking beta cluster isn't really the solution, that will just lead to more interruptions in everyone's workflows.
MediaWiki has an integrated debug toolbar that does contain a bunch of logs https://www.mediawiki.org/wiki/Debugging_toolbar
$wgDebugToolbar = true;
But IIRC that renders the page to not be cached.
In order to test anything effectively we would need to bypass cache, right? You won't see an error, even a fatal, if you're just loading a cached page from varnish.
The thing I would like to see is a floating red box on the page that is always visible when there are errors in the console. In phabricator, they change the header color to red and you type ` to bring up an error console. I think we should do something similar to surface errors on testwiki and beta. See Phabricator "Dark Console" Documentation for more details about how phabricator does it. We had something similar at deviantART and we had a social contract that we would not sync anything that significantly increased error counts or page-generation times on staging. We had tech debt weeks where everyone concentrated on killing warnings and notices until eventually there were none.
The state of Beta-Cluster-Infrastructure is now maintained on a best effort basis. The logging stack does not really work and we do not actively triage error logs there. The intent was to block potential issues ahead of time, that is nowadays done by blocking the train whenever an alarm happens, notably in the early stage (testwiki or group0).