Make log responsibilities changes
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	greg
	Feb 9 2015, 10:32 PM

Description

From Ori in the MW Core weekly meeting:

exception.log and fatal.log teem with sludge like New York sewer pipes.
It’s crazy that no one “owns” them.
It also means we need more disk space, more logstash hardware...
No amount of structured logging can compensate for ignoring errors and fatals.
During the HHVM migration, Brad was in charge of reviewing and triaging HHVM exceptions and fatals, filing bugs for anything that had not been encountered previously.
We should do this on an ongoing basis.
By “we” I mean “RelEng”
And by “RelEng” I mean Chad, since he has the most experience by far.
(Please.)

See also Ori's email to engineering@ (WMF staff only) "[Engineering] Log ownership and deployment process".

I (Greg) agree.

Proposed parts of a solution/next steps:

daily triage
weekly summary for the deployment/roadmap meeting (with naming culprits)
a sprint to squash them down (if we can't get it there through other means)

Related Objects
Search...

Status	Assigned	Task
Resolved	greg	T89049 Make log responsibilities changes
Resolved	greg	T89292 Create project/tag to collect fatal/exception log related bugs
Open	None	T91724 Set up and test Sentry on Labs for non-JS error logging
Resolved	Tgr	T84956 Create basic puppet role for Sentry
Resolved	jcrespo	T112228 Need to run postgresql::user twice to set the password
Declined	Tgr	T85239 Channel PHP errors from Logstash to Sentry on the beta cluster
Open	None	T84845 improve cron spam visibility
Declined	None	T91358 Log Scribunto errors in Sentry
Resolved	Tgr	T90083 Log EventLogging schema validation errors in Sentry
Resolved	Tgr	T105374 Investigate if the XSS vulnerability addressed in Sentry 7.6.1 affects us

Event Timeline

greg created this task.Feb 9 2015, 10:32 PM

greg raised the priority of this task from to High.

greg updated the task description. (Show Details)

greg added a project: Release-Engineering-Team.

greg added subscribers: greg, ori, • demon.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 9 2015, 10:32 PM

greg renamed this task from Plan log responsibilities changes to Make log responsibilities changes.Feb 9 2015, 10:33 PM

greg set Security to None.

greg updated the task description. (Show Details)Feb 9 2015, 11:43 PM

Krenair subscribed.Feb 10 2015, 4:17 AM

• Cmcmahon subscribed.Feb 10 2015, 5:29 PM

Tgr subscribed.Feb 10 2015, 11:58 PM

fgiunchedi subscribed.Feb 11 2015, 9:44 AM

From a user perspective, we also still need to fix T40095: namely, the fact that the user who gets "Fatal exception of type MWException" is unable to look for meaningful information, and when reported the issue can't be fixed by any developer until someone gets hold of a shell user on IRC to look up for the actual error.

In T89049#1032549, @Nemo_bis wrote:

From a user perspective, we also still need to fix T40095: namely, the fact that the user who gets "Fatal exception of type MWException" is unable to look for meaningful information, and when reported the issue can't be fixed by any developer until someone gets hold of a shell user on IRC to look up for the actual error.

That's also true, yes, but offtopic for this task. Thanks for reminding me of it.

Legoktm subscribed.Feb 11 2015, 9:12 PM

My point was: since 2012 we pretend that sysadmins are paying a lot of attention to logs, to the point we can afford not telling actual errors to the users. Since 2015, it would be nice for that to be true, hence huge +1 to this bug. :)

In T89049#1032571, @Nemo_bis wrote:

My point was: since 2012 we pretend that sysadmins are paying a lot of attention to logs, to the point we can afford not telling actual errors to the users.

That wasn't why it was changed. It was because it leaked personal information all the time.

(This is the email I just sent to the engineering@ list, posting here as well.)

As the person who sits where the buck proverbially stops on this issue I need to figure out how to make the rubber hit the road (ok, done with overused metaphors). ;)

I'm not going to halt all deployments next week until all fatals/exceptions are resolved; that's just not fair right now. There could be a situation where I do halt all (new code) deployments like that but not without warning/time to fix issues.

Right now I'm not sure exactly how to get us to a point that I'm happy with. Ori is right: we let this go on FAR too long. The amount of "noise" (it's not really noise, it's information, just people ignore it so it effectively becomes noise, where the real solution is to address it as information and fix the issue) in the logs is far too high for a functioning and trusted fast deployment organization. We're pretty fast with our deployment cadence relatively, but things like this show how much we've left behind in the wake in the quest for speed.

Just today Chad reported T89258 ("Bad allocation context for BannerChooser")[0] after looking at the logs. What should he or I or anyone do to get that addressed more quickly?

There's also the issue of tracking what fatals/exceptions are already reported in Phabricator which is only exasperated by having more than one person do the work. But at the same time one person can't do all of the work unless that one person only does this work and not much else. I think a tool like Sentry[1] would help here and I look forward to working with Gergo to get the work that he has started moved forward[2].

Moving forward today:

I've asked Chad and Mukunda to take on the task of reporting issues and getting them addressed as much as they can.
I will probably (unless someone convinces me not to) set up a Phabricator tag/project to track (as much as humanly possible) tasks reported from fatal/exception logs. I hope this will enable us to collectively address the issues, with the tooling we have now (ie: without Sentry) in a coordinated, efficient, and speedy fashion.
I ask everyone to respond to these bug reports as real and important work; if the developers/teams writing (or wrote) the code don't respond effectively then I'll have to move to more drastic measures. Again, I agree with Ori that status quo is not sustainable and I will do anything within my power to get us on better footing.

If you have any other suggestions please share them.

NB: Even though I completely agree with Ori, and have had the same feelings as he has, I didn't make this a team priority during the last cycle of planning. That's a long-winded/manager way of saying: My team will have to reassess our current workload/priorities/allocations to take this on in a concerted way.

Greg

[0] https://phabricator.wikimedia.org/T89258
[1] https://getsentry.com/welcome/
[2] http://sentry-beta.wmflabs.org

greg added subtasks: T1345: Set up and test Sentry on Labs for JS error logging, T85188: Add PHP error logging to Sentry extension.Feb 11 2015, 9:45 PM

T85188 is only useful for 3rd-party wikis, on the WMF clusters logstash will replace it.

greg closed subtask T89292: Create project/tag to collect fatal/exception log related bugs as Resolved.Feb 12 2015, 5:18 PM

• ggellerman added a project: Team-Practices.Mar 2 2015, 10:02 PM

• Awjrichards edited projects, added Team-Practices (This-Week); removed Team-Practices.Mar 3 2015, 10:56 PM

• Awjrichards edited projects, added Team-Practices; removed Team-Practices (This-Week).Mar 3 2015, 10:59 PM

• Awjrichards moved this task from To Triage to Team radar on the Team-Practices board.Mar 4 2015, 1:05 AM

Tgr edited subtasks, added: T91724: Set up and test Sentry on Labs for non-JS error logging ; removed: T1345: Set up and test Sentry on Labs for JS error logging.Mar 6 2015, 1:42 AM

Seems @greg is acting as the team leader on this. We already made progress and do look / report errors in logstash. Not sure whether any more actions are needed though.

hashar moved this task from INBOX to In-progress on the Release-Engineering-Team board.May 29 2015, 4:29 PM

@greg @hashar Anything needed from TPG for this? If not, we'd like to remove it from our board, assuming it remains on yours. :)

I'm going to call this done, actually. @demon and company have done awesome things lately and we're at a MUCH better place than we were 4 months ago.

Yup @demon definitely took the lead and we have weekly triage since early June. We had a few iterations to refine the dashboard column on Wikimedia-production-error and killed off most of the old bugs.

So it is all in pretty go shape together :-]

greg moved this task from In-progress to Done on the Release-Engineering-Team board.Jul 28 2015, 8:37 PM

Make log responsibilities changesClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Make log responsibilities changes
Closed, ResolvedPublic
Actions

Related Objects
Search...