Page MenuHomePhabricator

Make log responsibilities changes
Closed, ResolvedPublic

Description

From Ori in the MW Core weekly meeting:

exception.log and fatal.log teem with sludge like New York sewer pipes.
It’s crazy that no one “owns” them.
It also means we need more disk space, more logstash hardware...
No amount of structured logging can compensate for ignoring errors and fatals.
During the HHVM migration, Brad was in charge of reviewing and triaging HHVM exceptions and fatals, filing bugs for anything that had not been encountered previously.
We should do this on an ongoing basis.
By “we” I mean “RelEng”
And by “RelEng” I mean Chad, since he has the most experience by far.
(Please.)

See also Ori's email to engineering@ (WMF staff only) "[Engineering] Log ownership and deployment process".

I (Greg) agree.

Proposed parts of a solution/next steps:

  • daily triage
  • weekly summary for the deployment/roadmap meeting (with naming culprits)
  • a sprint to squash them down (if we can't get it there through other means)

Event Timeline

greg created this task.Feb 9 2015, 10:32 PM
greg raised the priority of this task from to High.
greg updated the task description. (Show Details)
greg added a project: Release-Engineering-Team.
greg added subscribers: greg, ori, demon.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 9 2015, 10:32 PM
greg renamed this task from Plan log responsibilities changes to Make log responsibilities changes.Feb 9 2015, 10:33 PM
greg set Security to None.
greg updated the task description. (Show Details)Feb 9 2015, 11:43 PM
Tgr added a subscriber: Tgr.Feb 10 2015, 11:58 PM

From a user perspective, we also still need to fix T40095: namely, the fact that the user who gets "Fatal exception of type MWException" is unable to look for meaningful information, and when reported the issue can't be fixed by any developer until someone gets hold of a shell user on IRC to look up for the actual error.

greg added a comment.Feb 11 2015, 9:12 PM

From a user perspective, we also still need to fix T40095: namely, the fact that the user who gets "Fatal exception of type MWException" is unable to look for meaningful information, and when reported the issue can't be fixed by any developer until someone gets hold of a shell user on IRC to look up for the actual error.

That's also true, yes, but offtopic for this task. Thanks for reminding me of it.

My point was: since 2012 we pretend that sysadmins are paying a lot of attention to logs, to the point we can afford not telling actual errors to the users. Since 2015, it would be nice for that to be true, hence huge +1 to this bug. :)

demon added a comment.Feb 11 2015, 9:27 PM

My point was: since 2012 we pretend that sysadmins are paying a lot of attention to logs, to the point we can afford not telling actual errors to the users.

That wasn't why it was changed. It was because it leaked personal information all the time.

greg added a comment.Feb 11 2015, 9:44 PM

(This is the email I just sent to the engineering@ list, posting here as well.)

As the person who sits where the buck proverbially stops on this issue I need to figure out how to make the rubber hit the road (ok, done with overused metaphors). ;)

I'm not going to halt all deployments next week until all fatals/exceptions are resolved; that's just not fair right now. There could be a situation where I do halt all (new code) deployments like that but not without warning/time to fix issues.

Right now I'm not sure exactly how to get us to a point that I'm happy with. Ori is right: we let this go on FAR too long. The amount of "noise" (it's not really noise, it's information, just people ignore it so it effectively becomes noise, where the real solution is to address it as information and fix the issue) in the logs is far too high for a functioning and trusted fast deployment organization. We're pretty fast with our deployment cadence relatively, but things like this show how much we've left behind in the wake in the quest for speed.

Just today Chad reported T89258 ("Bad allocation context for BannerChooser")[0] after looking at the logs. What should he or I or anyone do to get that addressed more quickly?

There's also the issue of tracking what fatals/exceptions are already reported in Phabricator which is only exasperated by having more than one person do the work. But at the same time one person can't do all of the work unless that one person only does this work and not much else. I think a tool like Sentry[1] would help here and I look forward to working with Gergo to get the work that he has started moved forward[2].

Moving forward today:

  • I've asked Chad and Mukunda to take on the task of reporting issues and getting them addressed as much as they can.
  • I will probably (unless someone convinces me not to) set up a Phabricator tag/project to track (as much as humanly possible) tasks reported from fatal/exception logs. I hope this will enable us to collectively address the issues, with the tooling we have now (ie: without Sentry) in a coordinated, efficient, and speedy fashion.
  • I ask everyone to respond to these bug reports as real and important work; if the developers/teams writing (or wrote) the code don't respond effectively then I'll have to move to more drastic measures. Again, I agree with Ori that status quo is not sustainable and I will do anything within my power to get us on better footing.

If you have any other suggestions please share them.

NB: Even though I completely agree with Ori, and have had the same feelings as he has, I didn't make this a team priority during the last cycle of planning. That's a long-winded/manager way of saying: My team will have to reassess our current workload/priorities/allocations to take this on in a concerted way.

Greg

[0] https://phabricator.wikimedia.org/T89258
[1] https://getsentry.com/welcome/
[2] http://sentry-beta.wmflabs.org

T85188 is only useful for 3rd-party wikis, on the WMF clusters logstash will replace it.

hashar assigned this task to greg.May 29 2015, 4:29 PM
hashar added a subscriber: hashar.

Seems @greg is acting as the team leader on this. We already made progress and do look / report errors in logstash. Not sure whether any more actions are needed though.

@greg @hashar Anything needed from TPG for this? If not, we'd like to remove it from our board, assuming it remains on yours. :)

greg closed this task as Resolved.Jul 15 2015, 9:57 PM

I'm going to call this done, actually. @demon and company have done awesome things lately and we're at a MUCH better place than we were 4 months ago.

Yup @demon definitely took the lead and we have weekly triage since early June. We had a few iterations to refine the dashboard column on Wikimedia-production-error and killed off most of the old bugs.

So it is all in pretty go shape together :-]