From Ori in the MW Core weekly meeting:
exception.log and fatal.log teem with sludge like New York sewer pipes. It’s crazy that no one “owns” them. It also means we need more disk space, more logstash hardware... No amount of structured logging can compensate for ignoring errors and fatals. During the HHVM migration, Brad was in charge of reviewing and triaging HHVM exceptions and fatals, filing bugs for anything that had not been encountered previously. We should do this on an ongoing basis. By “we” I mean “RelEng” And by “RelEng” I mean Chad, since he has the most experience by far. (Please.)
See also Ori's email to engineering@ (WMF staff only) "[Engineering] Log ownership and deployment process".
I (Greg) agree.
Proposed parts of a solution/next steps:
- daily triage
- weekly summary for the deployment/roadmap meeting (with naming culprits)
- a sprint to squash them down (if we can't get it there through other means)