[EPIC] Enable JS error reporting in the mobile website
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Jhernandez
	Jun 12 2017, 5:07 PM

Description

Goal

We want to start getting visibility of client side errors are users are having. We'll start small to give us an insight into ways we should detect client side errors.

Background

When the code breaks in the browser, we should notice that it is broken to fix it ASAP. Now we don't.

Module that captures errors in core
How it is done in #uploadwizard
- JS exceptions
- ResourceLoader errors

Goals

Understand the state of errors in the mobile site
Through our experience inform the adoption of Sentry

Next steps

Longer term we want to use of something like Sentry and make bugs discoverable just as they are in logstash.

Related Objects
Search...

Status	Assigned	Task
Resolved	Jdlrobson	T195473 [GOAL] Invest in the MobileFrontend & MinervaNeue frontend architecture
Resolved	Jdlrobson	T195475 [EPIC] Automate asset bundling in MobileFrontend
Resolved	Jdlrobson	T166905 [EPIC] Talk about and improve our frontend code architecture
Declined	None	T106915 Use Sentry in production
Resolved	Jdlrobson	T167699 [EPIC] Enable JS error reporting in the mobile website
Resolved	Jdlrobson	T202026 Report client-side JavaScript errors in MobileFrontend practically
Resolved	Jdlrobson	T203814 Turn on MinervaErrorLogSamplingRate (Schema:WebClientError)
Resolved	None	T203815 [Bug] Beta cluster events not logged
Resolved	• nray	T205582 It should be possible to count errors via statsv in production
Resolved	Jdlrobson	T206702 Enable client side error counting on Minerva production (wikipedia only)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Ready for dev @bmansurov, @pmiazga @phuedx ?

Jdlrobson added a subscriber: pmiazga.Jun 12 2017, 5:36 PM

looks ready for dev

Jdlrobson moved this task from Needs Prioritization to Triaged but Future on the Web-Team-Backlog board.Jun 12 2017, 6:34 PM

js errors are captured via window.onerror and mw.track and logged to the event logging schema using mw.trackSubscribe

Could we clarify this? Are we using the mw.eventLog.Schema class or the EventLogging subscriber protocol?

@phuedx I've clarified the item to mention both approaches. I believe mobilefrontend heavily uses eventlogging.Schema instances so we should probably follow suit here. What do you think?

What would our schema look like? Why limit ourselves to MF? Why not do it for all extensions at once?

Jdlrobson moved this task from Triaged but Future to Upcoming on the Web-Team-Backlog board.Jun 14 2017, 4:08 AM

@bmansurov Something like Schema:UploadWizardExceptionFlowEvent but instead of the flow* we should probably log the full URL (with hash fragments too).

Why limit ourselves to MF? Why not do it for all extensions at once?

I think getting error logging for 50% of the visitors is better than the 0% we have. I believe we should have something like sentry for all projects, but from what I've researched there's a non trivial amount of work to do to get it ready.

If you dive in and identify tasks we can help out with it'd be great to put them in our backlog to help when we have room.

It is also unclear how we're going to query and act on the logged exceptions, so maybe we should think about that too before we implement anything.

Why limit ourselves to MF? Why not do it for all extensions at once?

If we use Sentry in MobileFrontend won't we be capturing all bugs inside the mobile experience?
Anyway, we should start small like we do everything to identify/tease out the issues. I've added the epic for all extensions.

phuedx updated the task description. (Show Details)Jul 5 2017, 11:35 AM

@Jdlrobson Sentry is not available on production and the Sentry project is stalled for now.

phuedx updated the task description. (Show Details)Jul 11 2017, 4:38 PM

Jdlrobson updated the task description. (Show Details)Jul 11 2017, 4:44 PM

Jdlrobson updated the task description. (Show Details)

We ended up on estimates of 3, with @pmiazga the lone 5. We're gonna talk about this again some more. In mean time let's think about our different estimates and how we can align in our understanding.

Jdlrobson moved this task from Upcoming to Needs Prioritization on the Web-Team-Backlog board.Jul 11 2017, 4:48 PM

@pmiazga, @Jdlrobson; Sorry I missed the end of the conversation. What's the reasoning behind the 5?

Jdlrobson updated the task description. (Show Details)Jul 13 2017, 5:45 PM

Jdlrobson moved this task from Backlog to Team: web on the MobileFrontend board.

Jdlrobson moved this task from Team: web to Team:Growth on the MobileFrontend board.

Jdlrobson moved this task from Needs Prioritization to Upcoming on the Web-Team-Backlog board.Jul 17 2017, 4:25 PM

Jdlrobson updated the task description. (Show Details)Jul 18 2017, 4:46 PM

Jdlrobson set the point value for this task to 5.

ovasileva added a project: Readers-Web-Kanbanana-Board-Old.Jul 19 2017, 3:43 PM

Jdlrobson moved this task from Upcoming to 2017-18 Q1 on the Web-Team-Backlog board.Jul 25 2017, 5:28 PM

Jdlrobson removed a project: Readers-Web-Kanbanana-Board-Old.Aug 9 2017, 1:46 PM

Jdlrobson moved this task from 2017-18 Q1 to Upcoming on the Web-Team-Backlog board.

Jdlrobson mentioned this in T173309: Mobile target cannot have a default enabled gadget for anonymous users.Aug 21 2017, 10:17 PM

Is it possible to throttle events on the server? I would hate to miss rare errors (that we might even be trying to repro locally) due to sampling.

I don't think so sadly, and you are totally right.

Do we need to talk about this some more?
Also given the disk space issues with EventLogging right now, maybe this would not be a good idea. @phuedx thoughts?

Is it something we want to do? I'm still convinced that we should use something like Sentry instead of reinventing the wheel. We have a sentry instance and a Sentry extension. @Tgr worked on it some time ago, now that task is stalled but maybe it's a good time to work on it? I'm more than happy to make the Sentry available on production wikis.

This might be a good technical goal for next quarter if we want to increase the scope. It's certainly something that would be very useful.

Adding event logging for errors is significantly less effort than pushing Sentry to production. Actually making use of that data is, IMO, not. Reporting errors is - relatively - the easy part; most of Sentry is about managing reports, alerting, allowing users to communicate about reports etc. So I think it's worthwhile to do use Sentry instead, even if it's a fair amount of work. (In the case of UW it was sort of an emergency thing: we knew the bounce rate was ridiculously high, we suspected JS errors as the cause, and needed some way to identify them. Also since it was a one-time effort and not maintenance, just running SQL queries to identify the top N errors was an acceptable approach. I left the team by the time the data was actually used so @matmarex can offer more insight on how well that worked out.)

Also note that Sentry can collect basically everything (PHP, errors from Toolforge projects in arbitrary languages, puppet, app crash reports, CSP reports...) and once the instance is set up, adding new types is near-zero effort, so this would be very valuable beyond Reading Web. Maybe there is a way to trade that value off for resources, e.g. get the Cloud team to help out.

For my work on errors found via UploadWizard's abuse of EventLogging to log JavaScript errors, you can view T136230 and its subtasks. The last time folks asked about it, I wrote this comment: T137660#2379881 (there's a short description of how the UploadWizard error logging works there).

The biggest limitation is that EventLogging events have very limited length – specifically, there's no space to include a backtrace with errors. (In rare cases, you might not even have enough for a full URL.) Depending on your codebase, this can make it pretty much impossible to decipher certain errors. It can definitely help fix a lot of low-hanging fruit, though.

I'm leaning towards tagging this with epic and for us to start considering this as a quarterly goal rather than a 5 point task. Thoughts?

Thanks for the feedback @Tgr and @matmarex.

I agree @Jdlrobson.

We should definitely aim to have sentry, but we also don't know anything about our runtime errors. Having the event logging added for some time and seeing what volume of errors we get could be very beneficial to help fix a lot of low-hanging fruit as @matmarex mentioned, but also to provide justification to get resources to help with the development of sentry. It may be worth doing it if we have cycles.

Anyway if we change this to an epic then the existing AC is one of the options or just a small part of the plan.

Jdlrobson renamed this task from Enable JS error reporting in the mobile website to [EPIC] Enable JS error reporting in the mobile website.Oct 17 2017, 9:05 PM

Jdlrobson added a project: Epic.

Jdlrobson updated the task description. (Show Details)

Jdlrobson removed the point value for this task.

Jdlrobson moved this task from Needs Prioritization to Epics/Goals on the Web-Team-Backlog board.

It seems like we're advocating for using EventLogging as it'd be cheaper to catch low hanging fruit in the short-term without taking the time to find out:

What's left to do to set up Sentry
Whether other teams can spare bandwidth to help us out
- This'd require that we plan well and kick off this project at a quarter boundary but that's fine, right?
Who's going to own Sentry moving forward
- RelEng?
- Ops?
- What I mean is, if we help push Sentry past the post, then who do we hand it off to?

and underestimating (or not addressing) the cost of reporting these errors and managing those reports.

What I mean to say is, I think that the following AC should be a blocker before adding code to the codebase:

We have informed ourselves of how we'll make use of [Sentry] at some point when it is ready and if we can help move the project forward.

In T167699#3566937, @Niedzielski wrote:

Is it possible to throttle events on the server? I would hate to miss rare errors (that we might even be trying to repro locally) due to sampling.

The sampling AC is based on the hypothesis that there are so many errors happening the wild that we'd flood the EventLogging pipeline without it.

A small first step to take would be to simply count the number of errors (by incrementing a counter in StatsD, say). This'd give us an idea of the volume of events, which, as @Jhernandez pointed out in T167699#3656874, we don't know. Knowing the volume of events would allow us to select an appropriate sampling rate.

If we'd rather skip that step, then we could simply not write the events to the MySQL backend, only to the Hive backend. The former is the reason we collaborate with Analytics Engineering when targetting a rate of more than 10 events/second. Per T176469#3689416, we know that Hive can handle 300 events/second without issue.

• bmansurov unsubscribed.Dec 22 2017, 9:48 PM

@Jdlrobson, since we're going to be making lots of changes to JavaScript next year, do we want to consider this for the refactor project?

@Niedzielski conmitting to getting sentry setup is too big a project to commit to but we could commit time to setting up some EventLogging capturing of errors. Is that what you had in mind?

@Jdlrobson, yeah, let's just report the errors and we can check them in our chores and on Tuesdays. The sooner the better!

Related task and schema on Android.

• Niedzielski awarded a token.Jun 19 2018, 7:08 PM

I'm afraid that if we do the reporting via EventLogging we will never do the Event handling properly.

@Niedzielski kindly (correct me if I'm wrong) is going to write a strawman proposal (subtask) to capture this. The team feels like error capturing would be useful, if an expectation is set that we will tear it down and write about success/struggles at the end of the refactor project.

@Jdlrobson, that's right. Done in T202026.

Jdlrobson added a subtask: T203814: Turn on MinervaErrorLogSamplingRate (Schema:WebClientError).Sep 7 2018, 4:56 PM

Jdlrobson edited projects, added Web-Team-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q1); removed Web-Team-Backlog.

Jdlrobson moved this task from To Do to Quarterly Goals on the Web-Team-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q1) board.

• Niedzielski added a subtask: T203815: [Bug] Beta cluster events not logged.Sep 7 2018, 4:56 PM

Jdlrobson added a parent task: T195475: [EPIC] Automate asset bundling in MobileFrontend.Sep 7 2018, 4:57 PM

• Nuria closed subtask T203815: [Bug] Beta cluster events not logged as Resolved.Sep 7 2018, 11:46 PM

Jdlrobson added a subtask: T205582: It should be possible to count errors via statsv in production.Sep 26 2018, 9:06 PM

ovasileva moved this task from Readers-Web-Kanbanana-Board-2018-19-Q1 to Readers-Web-Kanbanana-Board-2018-19-Q2 on the Web-Team-Backlog board.Oct 2 2018, 5:30 PM

ovasileva edited projects, added Web-Team-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2); removed Web-Team-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q1).

MBinder_WMF moved this task from To Do to Quarterly Goals on the Web-Team-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2) board.Oct 2 2018, 5:30 PM

Removed description as a lot of it is confusing given how we tackled T202026.

Jdlrobson closed subtask T202026: Report client-side JavaScript errors in MobileFrontend practically as Resolved.Oct 4 2018, 8:02 PM

Jdlrobson closed subtask T203814: Turn on MinervaErrorLogSamplingRate (Schema:WebClientError) as Resolved.Oct 5 2018, 6:12 PM

• nray closed subtask T205582: It should be possible to count errors via statsv in production as Resolved.Oct 12 2018, 3:42 PM

It seems that all subtasks of this epic are closed \o/ Is there any work remaining?

I think after T206702 has been verified and no follow ups are needed, i think we can close this goal card.

For the long term solution we have T106915 which we can hopefully bump priority for when we have data from the above two things.

Jdlrobson closed subtask T206702: Enable client side error counting on Minerva production (wikipedia only) as Resolved.Oct 18 2018, 11:52 PM

We're done here. It's been a long time coming, but we're done. @phuedx I've marked this with a cake on https://www.mediawiki.org/wiki/Reading/Web/Release_timeline#October

phuedx mentioned this in T217142: [Proposal] Use the Kafka-Logstash logging infrastructure to log client-side errors.Feb 27 2019, 2:20 PM

[EPIC] Enable JS error reporting in the mobile websiteClosed, ResolvedPublicActions