Page MenuHomePhabricator

[EPIC] Enable JS error reporting in the mobile website
Closed, ResolvedPublic

Description

Goal

We want to start getting visibility of client side errors are users are having. We'll start small to give us an insight into ways we should detect client side errors.

Background

When the code breaks in the browser, we should notice that it is broken to fix it ASAP. Now we don't.

Goals

  • Understand the state of errors in the mobile site
  • Through our experience inform the adoption of Sentry

Next steps

Longer term we want to use of something like Sentry and make bugs discoverable just as they are in logstash.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Jdlrobson triaged this task as Medium priority.EditedJun 12 2017, 5:33 PM
Jdlrobson added subscribers: bmansurov, phuedx.

Ready for dev @bmansurov, @pmiazga @phuedx ?

js errors are captured via window.onerror and mw.track and logged to the event logging schema using mw.trackSubscribe

Could we clarify this? Are we using the mw.eventLog.Schema class or the EventLogging subscriber protocol?

@phuedx I've clarified the item to mention both approaches. I believe mobilefrontend heavily uses eventlogging.Schema instances so we should probably follow suit here. What do you think?

What would our schema look like? Why limit ourselves to MF? Why not do it for all extensions at once?

@bmansurov Something like Schema:UploadWizardExceptionFlowEvent but instead of the flow* we should probably log the full URL (with hash fragments too).

Why limit ourselves to MF? Why not do it for all extensions at once?

I think getting error logging for 50% of the visitors is better than the 0% we have. I believe we should have something like sentry for all projects, but from what I've researched there's a non trivial amount of work to do to get it ready.

If you dive in and identify tasks we can help out with it'd be great to put them in our backlog to help when we have room.

It is also unclear how we're going to query and act on the logged exceptions, so maybe we should think about that too before we implement anything.

Why limit ourselves to MF? Why not do it for all extensions at once?

If we use Sentry in MobileFrontend won't we be capturing all bugs inside the mobile experience?
Anyway, we should start small like we do everything to identify/tease out the issues. I've added the epic for all extensions.

@Jdlrobson Sentry is not available on production and the Sentry project is stalled for now.

Jdlrobson updated the task description. (Show Details)

We ended up on estimates of 3, with @pmiazga the lone 5. We're gonna talk about this again some more. In mean time let's think about our different estimates and how we can align in our understanding.

@pmiazga, @Jdlrobson; Sorry I missed the end of the conversation. What's the reasoning behind the 5?

Jdlrobson moved this task from Backlog to Team: web on the MobileFrontend board.
Jdlrobson moved this task from Team: web to Team:Growth on the MobileFrontend board.
Jdlrobson set the point value for this task to 5.

Is it possible to throttle events on the server? I would hate to miss rare errors (that we might even be trying to repro locally) due to sampling.

I don't think so sadly, and you are totally right.

Do we need to talk about this some more?
Also given the disk space issues with EventLogging right now, maybe this would not be a good idea. @phuedx thoughts?

Is it something we want to do? I'm still convinced that we should use something like Sentry instead of reinventing the wheel. We have a sentry instance and a Sentry extension. @Tgr worked on it some time ago, now that task is stalled but maybe it's a good time to work on it? I'm more than happy to make the Sentry available on production wikis.

This might be a good technical goal for next quarter if we want to increase the scope. It's certainly something that would be very useful.

Adding event logging for errors is significantly less effort than pushing Sentry to production. Actually making use of that data is, IMO, not. Reporting errors is - relatively - the easy part; most of Sentry is about managing reports, alerting, allowing users to communicate about reports etc. So I think it's worthwhile to do use Sentry instead, even if it's a fair amount of work. (In the case of UW it was sort of an emergency thing: we knew the bounce rate was ridiculously high, we suspected JS errors as the cause, and needed some way to identify them. Also since it was a one-time effort and not maintenance, just running SQL queries to identify the top N errors was an acceptable approach. I left the team by the time the data was actually used so @matmarex can offer more insight on how well that worked out.)

Also note that Sentry can collect basically everything (PHP, errors from Toolforge projects in arbitrary languages, puppet, app crash reports, CSP reports...) and once the instance is set up, adding new types is near-zero effort, so this would be very valuable beyond Reading Web. Maybe there is a way to trade that value off for resources, e.g. get the Cloud team to help out.

For my work on errors found via UploadWizard's abuse of EventLogging to log JavaScript errors, you can view T136230 and its subtasks. The last time folks asked about it, I wrote this comment: T137660#2379881 (there's a short description of how the UploadWizard error logging works there).

The biggest limitation is that EventLogging events have very limited length – specifically, there's no space to include a backtrace with errors. (In rare cases, you might not even have enough for a full URL.) Depending on your codebase, this can make it pretty much impossible to decipher certain errors. It can definitely help fix a lot of low-hanging fruit, though.

I'm leaning towards tagging this with epic and for us to start considering this as a quarterly goal rather than a 5 point task. Thoughts?

Thanks for the feedback @Tgr and @matmarex.

I agree @Jdlrobson.

We should definitely aim to have sentry, but we also don't know anything about our runtime errors. Having the event logging added for some time and seeing what volume of errors we get could be very beneficial to help fix a lot of low-hanging fruit as @matmarex mentioned, but also to provide justification to get resources to help with the development of sentry. It may be worth doing it if we have cycles.

Anyway if we change this to an epic then the existing AC is one of the options or just a small part of the plan.

Jdlrobson renamed this task from Enable JS error reporting in the mobile website to [EPIC] Enable JS error reporting in the mobile website.Oct 17 2017, 9:05 PM
Jdlrobson added a project: Epic.
Jdlrobson updated the task description. (Show Details)
Jdlrobson removed the point value for this task.
Jdlrobson moved this task from Needs Prioritization to Epics/Goals on the Web-Team-Backlog board.

It seems like we're advocating for using EventLogging as it'd be cheaper to catch low hanging fruit in the short-term without taking the time to find out:

  • What's left to do to set up Sentry
  • Whether other teams can spare bandwidth to help us out
    • This'd require that we plan well and kick off this project at a quarter boundary but that's fine, right?
  • Who's going to own Sentry moving forward
    • RelEng?
    • Ops?
    • What I mean is, if we help push Sentry past the post, then who do we hand it off to?

and underestimating (or not addressing) the cost of reporting these errors and managing those reports.

What I mean to say is, I think that the following AC should be a blocker before adding code to the codebase:

We have informed ourselves of how we'll make use of [Sentry] at some point when it is ready and if we can help move the project forward.

Is it possible to throttle events on the server? I would hate to miss rare errors (that we might even be trying to repro locally) due to sampling.

The sampling AC is based on the hypothesis that there are so many errors happening the wild that we'd flood the EventLogging pipeline without it.

A small first step to take would be to simply count the number of errors (by incrementing a counter in StatsD, say). This'd give us an idea of the volume of events, which, as @Jhernandez pointed out in T167699#3656874, we don't know. Knowing the volume of events would allow us to select an appropriate sampling rate.

If we'd rather skip that step, then we could simply not write the events to the MySQL backend, only to the Hive backend. The former is the reason we collaborate with Analytics Engineering when targetting a rate of more than 10 events/second. Per T176469#3689416, we know that Hive can handle 300 events/second without issue.

@Jdlrobson, since we're going to be making lots of changes to JavaScript next year, do we want to consider this for the refactor project?

@Niedzielski conmitting to getting sentry setup is too big a project to commit to but we could commit time to setting up some EventLogging capturing of errors. Is that what you had in mind?

@Jdlrobson, yeah, let's just report the errors and we can check them in our chores and on Tuesdays. The sooner the better!

Related task and schema on Android.

I'm afraid that if we do the reporting via EventLogging we will never do the Event handling properly.

@Niedzielski kindly (correct me if I'm wrong) is going to write a strawman proposal (subtask) to capture this. The team feels like error capturing would be useful, if an expectation is set that we will tear it down and write about success/struggles at the end of the refactor project.

Removed description as a lot of it is confusing given how we tackled T202026.

It seems that all subtasks of this epic are closed \o/ Is there any work remaining?

I think after T206702 has been verified and no follow ups are needed, i think we can close this goal card.

For the long term solution we have T106915 which we can hopefully bump priority for when we have data from the above two things.

Jdlrobson claimed this task.

We're done here. It's been a long time coming, but we're done. @phuedx I've marked this with a cake on https://www.mediawiki.org/wiki/Reading/Web/Release_timeline#October