Page MenuHomePhabricator

[WIP] [Proposal] Use the Kafka-Logstash logging infrastructure to log client-side errors
Open, Needs TriagePublic

Description

There've been numerous discussions about logging client-side errors, which, for myriad reasons, have stalled. This proposal describes a practical approach to logging client-side errors using technologies that are already deployed and actively maintained by WMF and aims to leverage the work already being done on this infrastructure by SRE.

We propose that client-side errors are caught, normalised, encoded, and sent to a "beacon endpoint" (i.e. /beacon/error). Requests to that endpoint are tailed, and the associated information formatted and added to Kafka on a well-known topic. Logstash will then consume that stream of information and transform it as necessary.

Some well-known examples of this "requests to beacon endpoint to Kafka to $consumer" pipeline are EventLogging and statsv.

1 Prior art

1.1 Within WMF

1.1.1 Readers Web's MinervaClientError metric

Readers Web began counting the number of client-side errors occurring for users using the Minerva skin in T205582: It should be possible to count errors via statsv in production as a result of a fairly-isolated discussion about the problem (see T167699). We now have some sense of how many client-side errors are occurring and how that varies over time: https://grafana.wikimedia.org/d/000000566/overview?orgId=1&from=now-30d&to=now&panelId=15&fullscreen

1.2 Without WMF

1.2.1 Sentry

TBD

2 Considerations

2.1 Limitations in URL length

Requests made to a beacon endpoint are currently expected to use the HTTP GET method with the data in the URL's query string. However, we're limited to the amount of data we can include in the URL, e.g. EventLogging has a maximum URL size of 2000.

However, unless the user is browsing the site with ResourceLoader's debug mode enabled (or ResourceLoader is made to support source maps), the majority of the information in a stacktrace can be safely discarded, e.g. consider the following stacktrace:

maybeLog @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:4
get @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:4
(anonymous) @ load.php?debug=false&lang=en&modules=ext.centralNotice.choiceData|ext.navigationTiming%2CwikimediaEvents|ext.quicksurveys.lib|ext.relatedArticles.readMore.bootstrap|ext.visualEditor.targetLoader|jquery%2Coojs-router|jquery.client|mediawiki.Title%2CUri%2Capi%2CjqueryMsg%2Clanguage%2Cuser%2Cutil|mediawiki.ui.anchor|mobile.init%2Csite%2Cstartup|mobile.messageBox.styles|mobile.pagelist.styles|mobile.pagesummary.styles|mobile.startup.images|mobile.startup.images.variants|skins.minerva.icons.images.scripts.misc|skins.minerva.icons.images.variants|skins.minerva.icons.page.issues.default.color|skins.minerva.icons.page.issues.medium.color|skins.minerva.icons.page.issues.uncolored|skins.minerva.mainMenu.icons%2Cstyles|skins.minerva.notifications%2Coptions%2Cscripts%2Ctalk%2Ctoggling|skins.minerva.notifications.badge|skins.minerva.options.share.icon&skin=minerva&version=0zmdnc2:409
mw.loader.implement.css @ load.php?debug=false&lang=en&modules=ext.centralNotice.choiceData|ext.navigationTiming%2CwikimediaEvents|ext.quicksurveys.lib|ext.relatedArticles.readMore.bootstrap|ext.visualEditor.targetLoader|jquery%2Coojs-router|jquery.client|mediawiki.Title%2CUri%2Capi%2CjqueryMsg%2Clanguage%2Cuser%2Cutil|mediawiki.ui.anchor|mobile.init%2Csite%2Cstartup|mobile.messageBox.styles|mobile.pagelist.styles|mobile.pagesummary.styles|mobile.startup.images|mobile.startup.images.variants|skins.minerva.icons.images.scripts.misc|skins.minerva.icons.images.variants|skins.minerva.icons.page.issues.default.color|skins.minerva.icons.page.issues.medium.color|skins.minerva.icons.page.issues.uncolored|skins.minerva.mainMenu.icons%2Cstyles|skins.minerva.notifications%2Coptions%2Cscripts%2Ctalk%2Ctoggling|skins.minerva.notifications.badge|skins.minerva.options.share.icon&skin=minerva&version=0zmdnc2:413
runScript @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:13
(anonymous) @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:14
flushCssBuffer @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:6
requestAnimationFrame (async)
addEmbeddedCSS @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:6
execute @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:15
doPropagation @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:7
requestIdleCallback (async)
requestPropagation @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:8
setAndPropagate @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:8
markModuleReady @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:13
runScript @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:14
(anonymous) @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:14
flushCssBuffer @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:6
requestAnimationFrame (async)
addEmbeddedCSS @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:6
execute @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:15
doPropagation @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:7
requestIdleCallback (async)
requestPropagation @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:8
setAndPropagate @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:8
markModuleReady @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:13
runScript @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:14
(anonymous) @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:14
flushCssBuffer @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:6
requestAnimationFrame (async)
addEmbeddedCSS @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:6
execute @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:15
doPropagation @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:7
requestIdleCallback (async)
requestPropagation @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:8
setAndPropagate @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:8
markModuleReady @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:13
runScript @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:14
(anonymous) @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:14
flushCssBuffer @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:6
requestAnimationFrame (async)
addEmbeddedCSS @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:6
execute @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:15
doPropagation @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:7
requestIdleCallback (async)
requestPropagation @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:8
setAndPropagate @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:8
implement @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:21
(anonymous) @ load.php?debug=false&lang=en&modules=ext.centralNotice.choiceData|ext.navigationTiming%2CwikimediaEvents|ext.quicksurveys.lib|ext.relatedArticles.readMore.bootstrap|ext.visualEditor.targetLoader|jquery%2Coojs-router|jquery.client|mediawiki.Title%2CUri%2Capi%2CjqueryMsg%2Clanguage%2Cuser%2Cutil|mediawiki.ui.anchor|mobile.init%2Csite%2Cstartup|mobile.messageBox.styles|mobile.pagelist.styles|mobile.pagesummary.styles|mobile.startup.images|mobile.startup.images.variants|skins.minerva.icons.images.scripts.misc|skins.minerva.icons.images.variants|skins.minerva.icons.page.issues.default.color|skins.minerva.icons.page.issues.medium.color|skins.minerva.icons.page.issues.uncolored|skins.minerva.mainMenu.icons%2Cstyles|skins.minerva.notifications%2Coptions%2Cscripts%2Ctalk%2Ctoggling|skins.minerva.notifications.badge|skins.minerva.options.share.icon&skin=minerva&version=0zmdnc2:1`

The above could be reduced to the following:

// const st = ...
const st2 = st.replace( / @.+\n?/g, '\n' );

console.log( st.length, st2.length ); // 6841 788

// For completeness:
console.log( st2 );
// =>
// maybeLog
// get
// (anonymous)
// mw.loader.implement.css
// runScript
// (anonymous)
// flushCssBuffer
// requestAnimationFrame (async)
// addEmbeddedCSS
// execute
// doPropagation
// requestIdleCallback (async)
// requestPropagation
// setAndPropagate
// markModuleReady
// runScript
// (anonymous)
// flushCssBuffer
// requestAnimationFrame (async)
// addEmbeddedCSS
// execute
// doPropagation
// requestIdleCallback (async)
// requestPropagation
// setAndPropagate
// markModuleReady
// runScript
// (anonymous)
// flushCssBuffer
// requestAnimationFrame (async)
// addEmbeddedCSS
// execute
// doPropagation
// requestIdleCallback (async)
// requestPropagation
// setAndPropagate
// markModuleReady
// runScript
// (anonymous)
// flushCssBuffer
// requestAnimationFrame (async)
// addEmbeddedCSS
// execute
// doPropagation
// requestIdleCallback (async)
// requestPropagation
// setAndPropagate
// implement
// (anonymous)

2.2 Burstiness

If a syntax error were introduced in a JavaScript asset that's delivered to all clients, then we'd see upwards of 5500 errors reported per second. The simplest way of dealing with this issue is to enable client-side error reporting for 1% of all pageviews.

We might also consider creating a service that acts as:

  • A leaky bucket if we don't want to permit bursts or token bucket if we do (but at a fixed rate);
  • A classful version of the above, if we want to "roll up" errors based on one or more normalized properties.

These services would allow us to maximise the number of clients that can report errors, thereby increasing the likelihood of capturing relatively low-rate errors. However, the introduction of any such service would require a long-term maintenance commitment from at least SRE.

2.3 Pre-existing tools for exploration

In 1.1.1, it was noted that Readers Web are already counting the number of client-side errors for users using the Minerva skin. We do so using statsv. Currently, we leverage that statsv makes requests to a beacon endpoint and that the Analytics team provides tools like Turnilo, which allow us to explore request data at a high-level, to find trends in client-side errors, e.g. a graph client-side errors for users using the Minerva skin by continent. It's worth noting that if this proposal were implemented, this facility would still be available.

Event Timeline

phuedx created this task.Feb 26 2019, 3:32 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 26 2019, 3:32 PM
phuedx updated the task description. (Show Details)Feb 27 2019, 2:20 PM

CC @Jhernandez, @Tgr, @faidon, @fgiunchedi, @herron for visibility. My apologies in advance for the description churn, y'all.

fgiunchedi added a subscriber: Ottomata.

Looks good to me from a quick read @phuedx ! Adding Wikimedia-Logstash and cc @Ottomata for visibility.

Jdlrobson added a subscriber: Jdlrobson.
phuedx updated the task description. (Show Details)Feb 27 2019, 6:11 PM
phuedx updated the task description. (Show Details)Feb 28 2019, 9:42 AM

I think the URL length limitation will bite us if we stick to GET beacons only. How does Sentry work to intake the errors?

Ottomata removed a subscriber: Ottomata.Feb 28 2019, 3:22 PM

@Ottomata Would it be possible to send a JSON body via POST to be able to send a bit more payload if we route the events through event-gate?

Yes. We'd likely have to an EventGate endpoint implementation to allow unschemaed data through, but this would not be difficult. Unless we have a generic schema for client error logs! :) Might not be a bad idea.

Related task T77110: Push messages to logstash from JS. Not sure if it should be a subtask of this one or something else.

phuedx updated the task description. (Show Details)Mar 6 2019, 1:59 PM
phuedx added a comment.Mar 6 2019, 2:05 PM

Unless we have a generic schema for client error logs! :) Might not be a bad idea.

I think this is a great idea.

phuedx updated the task description. (Show Details)Mar 6 2019, 2:06 PM
pmiazga added a subscriber: pmiazga.Mar 7 2019, 5:38 PM
CDanis added a subscriber: CDanis.Mar 13 2019, 4:22 PM

Notes from today's meeting https://etherpad.wikimedia.org/p/clients-error-logging

Main takeaway(s) (IMHO, please integrate as needed):

  • Client side errors need better visibility (e.g. what caused the error, not just error count)
  • Something like Sentry might be what we need, however deployment to production isn't trivial
  • A PoC based on logging pipeline and visualized in kibana will get us off the ground quickly/cheaply
  • Client error reporting is a use case for MEP / eventgate as well
Tgr added a comment.Thu, Apr 11, 12:55 AM

To recap what was said at the meeting and in some earlier conversations, there are three independent work streams:

  • Capturing errors on the client side.
    • Extension:Sentry does this (it loads Raven.js, the old Sentry client on demand when an error happens) but the code is four years old now so probably should be updated to the current Sentry client, @sentry/browser (which might or might not be simple). The old code contained some awkwards workarounds, largely because Raven.js was a wrapper around TraceKit which was a valuable library with lots of hard-won compatibility logic for various browser quirks but very inflexible in how it expected to be used (e.g. it really did not expect to be loaded only after the error was thrown). Back then they were on the verge of removing it and rewriting Raven.js so that probably happened since then.
    • We could also write our own client. The benefit of using the official one is support for lots of browser quirks, including parsing of different stack trace renderings in different browsers (which could be reimplemented but it takes time). The benefit of a custom client is that it will probably end up being much smaller (@sentry/browser is 50K minified). Also stack traces are only useful if there are source maps, plus ResourceLoader client side caching would also have to be dealt with (probably also doable with source maps - see T90524 for details).
  • A pipeline for getting error logs to their destination.
    • In theory this could be a Sentry server. When I showed our traffic numbers to Sentry devs a few years ago they said a single beefy bare-metal server could probably take it (but their own traffic numbers were a bit below our worst case predictions so it was just a guess). But there's no guarantee and anyway we'd probably want decoupling and a standard pipeline for all events.
    • Using an EventLogging-ish process with a beacon endpoint being read into Kafka by varnishkafka, as described in this task (that was the old plan as well: T501). Problematic due to size limits (we'd have to drop stack traces at the minimum).
    • Using EventGate to send reports to Kafka, which seems the clear winner. With both this and the previous option, we'd want some kind of rate limiting in Kafka (Kafka itself scales well).
  • Storing/displaying the erros.
    • Sentry would be the nice option here (it has lots of features for supporting browsing, searching, displaying and managing errors) but it's a fair bit of work to package (unless we decide running third-party docker containers is acceptable now). Bonus is that it has clients for just about any programming language and we have plenty of systems (both WMF and community-maintained) where scaling is not a concern so we can just stick in a client and get error management for free. So probably a good long-term goal.
    • Just dumping the errors into logstash would probably get us decent searching and trend monitoring capabilities for free, so that seems the reasonable thing to do initially. (This was the old plan - T502.) Even if we use Sentry eventually, we might want to keep Logstash in the pipline for deduplication.

The first could be done by Reading Web or whichever frontend team is interested in this, the second needs help from Analytics, the third needs help from SRE.

Thanks Gergo, great summary! :D

colewhite added a comment.EditedWed, Apr 24, 3:48 PM

Copying thoughts to task:

I have two concerns:

  1. Mitigating the thundering-herd -- In the event of all clients emitting errors simultaneously, how will this system keep from tipping over and possibly cascading and taking other services with it?
  1. Mitigating abuse -- In the event of a malicious user repeatedly hit this endpoint at a high rate, either in a distributed or non-distributed fashion, will this endpoint blindly accept any valid-looking data or can we discriminate signal from the noise?

I apologize if these questions were answered elsewhere, but I want to make sure they are at least voiced.

Ottomata added a comment.EditedWed, Apr 24, 3:58 PM

Qs:

  • Would all error events go to the same Kafka topic (I think Filippo said yes)
  • Should (Sentry?) error events be validated against a JSONSchema? (Andrew thinks yes!)
  • Does Sentry have a JSONSchema for its data format(s)? (looks like no? https://github.com/getsentry/sentry/issues/1018)
Tgr added a comment.Wed, Apr 24, 4:40 PM

So currently on an error in the beta cluster, Extension:Sentry will send a GET to https://sentry-beta.wmflabs.org/api/4/store/?sentry_version=4&sentry_client=raven-js/1.1.16&sentry_key=...&sentry_data=... where the second ... is an URL-encoded JSON object like this:

{
  "project": "4",
  "logger": "javascript",
  "platform": "javascript",
  "request": {
    "url": "https://meta.wikimedia.beta.wmflabs.org/wiki/Special:Version",
    "headers": {
      "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36",
      "Referer": "https://meta.wikimedia.beta.wmflabs.org/wiki/Main_Page"
    }
  },
  "exception": {
    "value": "Script error."
  },
  "stacktrace": {
    "frames": [
      {
        "filename": "https://meta.wikimedia.beta.wmflabs.org/wiki/Special:Version",
        "in_app": true
      }
    ]
  },
  "culprit": "https://meta.wikimedia.beta.wmflabs.org/wiki/Special:Version",
  "message": "Script error.",
  "tags": {
    "version": "1.34.0-alpha",
    "debug": false,
    "skin": "vector",
    "action": "view",
    "ns": -1,
    "pageName": "Special:Version",
    "userGroups": [
      "autopatrolled",
      "oauthadmin",
      "*",
      "user",
      "autoconfirmed"
    ],
    "language": "en"
  },
  "event_id": "8752081ad1d147a6b672ac6e3611009d"
}

(The stack trace is truncated due to some browser security restrictions around the method I use to generate fake errors, normally there would be more info there.)

We can set the reporting URL to something else with a change in operations/mediawiki-config, it comes from the $wgSentryDsn global. If we want to send the JSON in a nicer way (POST body, navigator.sendBeacon etc), I didn't look into that now but my recollection is that you can define your own reporting callback.

Note that this is four years old code, we'll want to upgrade to the current version of the Sentry client (Raven.js) which might use a different format (I haven't checked). We don't need to do that for the beta prototype though.

EventGate only accepts events via POST, so we'll at least need to do that. I'll also either need to make a JSONSchema for whatever sentry sends, or make an eventgate implementation (requires some minimal coding) that doesn't do any event validation.

If we eventually do use this, I'd like for us to make Mediawiki data fields closer to the event schema conventions we've been using for EventBus events. E.g. page_title and user_groups instead of pageName and userGroups. That doesn't have to happen for beta prototype :)

Tgr added a subscriber: Milimetric.Wed, Apr 24, 6:45 PM

EventGate only accepts events via POST, so we'll at least need to do that.

At least in the old client version that's currently on Beta that seems trivial to do, just override globalOptions.shouldSendCallback to do the reporting and cancel the default reporting. Maybe @Milimetric is interested in picking that up as a first task?

If we eventually do use this, I'd like for us to make Mediawiki data fields closer to the event schema conventions we've been using for EventBus events. E.g. page_title and user_groups instead of pageName and userGroups.

Also trivial, it's handled here.