Page MenuHomePhabricator

[EPIC] Enable WebClientError on production
Closed, DeclinedPublic

Description

WebClientError EventLogging reports are currently only permitted on the beta cluster. statsv is used in production but doesn't contain as much detail. This task is to identify an acceptable sampling rate and submit a patch to enable reporting in production at said rate. Additionally, consider removing the statsv logging if it is no longer needed.

Developer notes

Tagging as epic as there are several things to do here still.

  • Understand the EventLogging URL length constraints and how to deal with them. Since the first version was a proof of concept to count errors, we kept the original pass very simple but did hit T206257. We'd need to work out what fields to trim and how to do this (we bikeshedded a little when attempting to do this in the first pass)
  • Talk to analytics and get buy-in/permission. The previous attempt to enable this in production was descoped (see T203814#4576030 and description edits
  • Add stack traces to WebClientError events. Right now we're not including them in the current implementation because of the URL length. These will be essential. See T202026 for more information.
  • Work out how to deal with error spiking. Even at a small sample, an error that hits 100% of users could bring down our entire EventLogging cluster. We'd need to talk to analytics about whether rolling back a deploy is enough to do this.
  • Work out how to deal with the frustration of non-deduplication. The feedback we got from people who have used EventLogging is that having the stack traces allowed them to fix the common bugs, but the other ones were harder to track down. We have no idea of the spread of our errors - all of them could be different, or many of them could be duplicates. We should think about what queries we can use to de-duplicate errors (And work out how many people they impact) and how our stack trace can be structured to help facilitate that.
  • Work out a suitable sampling rate, based on the possibility of error spikes, the need to identify some of the bugs our users are using.
  • Progressive roll out - once all above is done we should slowly ramp up the sampling rate cautiously.

Event Timeline

Restricted Application changed the subtype of this task from "Deadline" to "Task". · View Herald TranscriptNov 12 2018, 4:14 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Jdlrobson renamed this task from Enable WebClientError on production to [EPIC] Enable WebClientError on production.Nov 12 2018, 4:39 PM
Jdlrobson added a project: Epic.
Jdlrobson updated the task description. (Show Details)
Jdlrobson moved this task from Needs Prioritization to Epics/Goals on the Web-Team-Backlog board.
Jdlrobson lowered the priority of this task from Medium to Low.Jan 3 2019, 10:38 PM

Adding @phuedx who I think has a personal interest in client side error reporting. Please feel free to ignore otherwise.

Add stack traces to WebClientError events. Right now we're not including them in the current implementation because of the URL length. These will be essential. See T202026 for more information.

While I was investigating T217142: [Proposal] Use the Kafka-Logstash logging infrastructure to log client-side errors, I figured that, until RL supports source maps, the RL URLs in the stacktrace are meaningless and can be discarded without loss of information, e.g.

maybeLog @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:4
get @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:4
(anonymous) @ load.php?debug=false&lang=en&modules=ext.centralNotice.choiceData|ext.navigationTiming%2CwikimediaEvents|ext.quicksurveys.lib|ext.relatedArticles.readMore.bootstrap|ext.visualEditor.targetLoader|jquery%2Coojs-router|jquery.client|mediawiki.Title%2CUri%2Capi%2CjqueryMsg%2Clanguage%2Cuser%2Cutil|mediawiki.ui.anchor|mobile.init%2Csite%2Cstartup|mobile.messageBox.styles|mobile.pagelist.styles|mobile.pagesummary.styles|mobile.startup.images|mobile.startup.images.variants|skins.minerva.icons.images.scripts.misc|skins.minerva.icons.images.variants|skins.minerva.icons.page.issues.default.color|skins.minerva.icons.page.issues.medium.color|skins.minerva.icons.page.issues.uncolored|skins.minerva.mainMenu.icons%2Cstyles|skins.minerva.notifications%2Coptions%2Cscripts%2Ctalk%2Ctoggling|skins.minerva.notifications.badge|skins.minerva.options.share.icon&skin=minerva&version=0zmdnc2:409
mw.loader.implement.css @ load.php?debug=false&lang=en&modules=ext.centralNotice.choiceData|ext.navigationTiming%2CwikimediaEvents|ext.quicksurveys.lib|ext.relatedArticles.readMore.bootstrap|ext.visualEditor.targetLoader|jquery%2Coojs-router|jquery.client|mediawiki.Title%2CUri%2Capi%2CjqueryMsg%2Clanguage%2Cuser%2Cutil|mediawiki.ui.anchor|mobile.init%2Csite%2Cstartup|mobile.messageBox.styles|mobile.pagelist.styles|mobile.pagesummary.styles|mobile.startup.images|mobile.startup.images.variants|skins.minerva.icons.images.scripts.misc|skins.minerva.icons.images.variants|skins.minerva.icons.page.issues.default.color|skins.minerva.icons.page.issues.medium.color|skins.minerva.icons.page.issues.uncolored|skins.minerva.mainMenu.icons%2Cstyles|skins.minerva.notifications%2Coptions%2Cscripts%2Ctalk%2Ctoggling|skins.minerva.notifications.badge|skins.minerva.options.share.icon&skin=minerva&version=0zmdnc2:413
runScript @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:13
(anonymous) @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:14
flushCssBuffer @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:6
requestAnimationFrame (async)
addEmbeddedCSS @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:6
execute @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:15
doPropagation @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:7
requestIdleCallback (async)
requestPropagation @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:8
setAndPropagate @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:8
markModuleReady @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:13
runScript @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:14
(anonymous) @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:14
flushCssBuffer @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:6
requestAnimationFrame (async)
addEmbeddedCSS @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:6
execute @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:15
doPropagation @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:7
requestIdleCallback (async)
requestPropagation @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:8
setAndPropagate @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:8
markModuleReady @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:13
runScript @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:14
(anonymous) @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:14
flushCssBuffer @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:6
requestAnimationFrame (async)
addEmbeddedCSS @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:6
execute @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:15
doPropagation @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:7
requestIdleCallback (async)
requestPropagation @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:8
setAndPropagate @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:8
markModuleReady @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:13
runScript @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:14
(anonymous) @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:14
flushCssBuffer @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:6
requestAnimationFrame (async)
addEmbeddedCSS @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:6
execute @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:15
doPropagation @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:7
requestIdleCallback (async)
requestPropagation @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:8
setAndPropagate @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:8
implement @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:21
(anonymous) @ load.php?debug=false&lang=en&modules=ext.centralNotice.choiceData|ext.navigationTiming%2CwikimediaEvents|ext.quicksurveys.lib|ext.relatedArticles.readMore.bootstrap|ext.visualEditor.targetLoader|jquery%2Coojs-router|jquery.client|mediawiki.Title%2CUri%2Capi%2CjqueryMsg%2Clanguage%2Cuser%2Cutil|mediawiki.ui.anchor|mobile.init%2Csite%2Cstartup|mobile.messageBox.styles|mobile.pagelist.styles|mobile.pagesummary.styles|mobile.startup.images|mobile.startup.images.variants|skins.minerva.icons.images.scripts.misc|skins.minerva.icons.images.variants|skins.minerva.icons.page.issues.default.color|skins.minerva.icons.page.issues.medium.color|skins.minerva.icons.page.issues.uncolored|skins.minerva.mainMenu.icons%2Cstyles|skins.minerva.notifications%2Coptions%2Cscripts%2Ctalk%2Ctoggling|skins.minerva.notifications.badge|skins.minerva.options.share.icon&skin=minerva&version=0zmdnc2:1`

Can be reduced to:

const st = ^
const st2 = st.replace( / @.+\n?/g, '\n' );

// =>
// maybeLog
// get
// (anonymous)
// mw.loader.implement.css
// runScript
// (anonymous)
// flushCssBuffer
// requestAnimationFrame (async)
// addEmbeddedCSS
// execute
// doPropagation
// requestIdleCallback (async)
// requestPropagation
// setAndPropagate
// markModuleReady
// runScript
// (anonymous)
// flushCssBuffer
// requestAnimationFrame (async)
// addEmbeddedCSS
// execute
// doPropagation
// requestIdleCallback (async)
// requestPropagation
// setAndPropagate
// markModuleReady
// runScript
// (anonymous)
// flushCssBuffer
// requestAnimationFrame (async)
// addEmbeddedCSS
// execute
// doPropagation
// requestIdleCallback (async)
// requestPropagation
// setAndPropagate
// markModuleReady
// runScript
// (anonymous)
// flushCssBuffer
// requestAnimationFrame (async)
// addEmbeddedCSS
// execute
// doPropagation
// requestIdleCallback (async)
// requestPropagation
// setAndPropagate
// implement
// (anonymous)


console.log( st.length, st2.length, st2.length / st1.length * 100 ); // => 6841 788 11.5187...

Work out how to deal with the frustration of non-deduplication. The feedback we got from people who have used EventLogging is that having the stack traces allowed them to fix the common bugs, but the other ones were harder to track down. We have no idea of the spread of our errors - all of them could be different, or many of them could be duplicates. We should think about what queries we can use to de-duplicate errors (And work out how many people they impact) and how our stack trace can be structured to help facilitate that.

Fingerprinting an error on the client and then including that as part of the event should work.

FWIW the old plan for the Multimedia team (which was never resourced beyond the MVP level) was T382: RfC: Server-side Javascript error logging and the associated RfC, which is now very dated but might contain some useful information.

@phuedx was wondering about the state of this today and whether we can remove the EventLogging code ($wgMinervaErrorLogSamplingRate) given recent developments.

Declining in favor of the direction we are taking in T217142