Page MenuHomePhabricator

[Proposal] Use the Kafka-Logstash logging infrastructure to log client-side errors
Closed, ResolvedPublic

Description

There've been numerous discussions about logging client-side errors, which, for myriad reasons, have stalled. This proposal describes a practical approach to logging client-side errors using technologies that are already deployed and actively maintained by WMF and aims to leverage the work already being done on this infrastructure by SRE.

We propose that client-side errors are caught, normalised, encoded, and sent to a "beacon endpoint" (i.e. /beacon/error). Requests to that endpoint are tailed, and the associated information formatted and added to Kafka on a well-known topic. Logstash will then consume that stream of information and transform it as necessary.

Some well-known examples of this "requests to beacon endpoint to Kafka to $consumer" pipeline are EventLogging and statsv.

1 Prior art

1.1 Within WMF

1.1.1 Readers Web's MinervaClientError metric

Readers Web began counting the number of client-side errors occurring for users using the Minerva skin in T205582: It should be possible to count errors via statsv in production as a result of a fairly-isolated discussion about the problem (see T167699). We now have some sense of how many client-side errors are occurring and how that varies over time: https://grafana.wikimedia.org/d/000000566/overview?orgId=1&from=now-30d&to=now&panelId=15&fullscreen

1.2 Without WMF

1.2.1 Sentry

TBD

2 Considerations

2.1 Limitations in URL length

Requests made to a beacon endpoint are currently expected to use the HTTP GET method with the data in the URL's query string. However, we're limited to the amount of data we can include in the URL, e.g. EventLogging has a maximum URL size of 2000.

However, unless the user is browsing the site with ResourceLoader's debug mode enabled (or ResourceLoader is made to support source maps), the majority of the information in a stacktrace can be safely discarded, e.g. consider the following stacktrace:

maybeLog @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:4
get @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:4
(anonymous) @ load.php?debug=false&lang=en&modules=ext.centralNotice.choiceData|ext.navigationTiming%2CwikimediaEvents|ext.quicksurveys.lib|ext.relatedArticles.readMore.bootstrap|ext.visualEditor.targetLoader|jquery%2Coojs-router|jquery.client|mediawiki.Title%2CUri%2Capi%2CjqueryMsg%2Clanguage%2Cuser%2Cutil|mediawiki.ui.anchor|mobile.init%2Csite%2Cstartup|mobile.messageBox.styles|mobile.pagelist.styles|mobile.pagesummary.styles|mobile.startup.images|mobile.startup.images.variants|skins.minerva.icons.images.scripts.misc|skins.minerva.icons.images.variants|skins.minerva.icons.page.issues.default.color|skins.minerva.icons.page.issues.medium.color|skins.minerva.icons.page.issues.uncolored|skins.minerva.mainMenu.icons%2Cstyles|skins.minerva.notifications%2Coptions%2Cscripts%2Ctalk%2Ctoggling|skins.minerva.notifications.badge|skins.minerva.options.share.icon&skin=minerva&version=0zmdnc2:409
mw.loader.implement.css @ load.php?debug=false&lang=en&modules=ext.centralNotice.choiceData|ext.navigationTiming%2CwikimediaEvents|ext.quicksurveys.lib|ext.relatedArticles.readMore.bootstrap|ext.visualEditor.targetLoader|jquery%2Coojs-router|jquery.client|mediawiki.Title%2CUri%2Capi%2CjqueryMsg%2Clanguage%2Cuser%2Cutil|mediawiki.ui.anchor|mobile.init%2Csite%2Cstartup|mobile.messageBox.styles|mobile.pagelist.styles|mobile.pagesummary.styles|mobile.startup.images|mobile.startup.images.variants|skins.minerva.icons.images.scripts.misc|skins.minerva.icons.images.variants|skins.minerva.icons.page.issues.default.color|skins.minerva.icons.page.issues.medium.color|skins.minerva.icons.page.issues.uncolored|skins.minerva.mainMenu.icons%2Cstyles|skins.minerva.notifications%2Coptions%2Cscripts%2Ctalk%2Ctoggling|skins.minerva.notifications.badge|skins.minerva.options.share.icon&skin=minerva&version=0zmdnc2:413
runScript @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:13
(anonymous) @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:14
flushCssBuffer @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:6
requestAnimationFrame (async)
addEmbeddedCSS @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:6
execute @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:15
doPropagation @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:7
requestIdleCallback (async)
requestPropagation @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:8
setAndPropagate @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:8
markModuleReady @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:13
runScript @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:14
(anonymous) @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:14
flushCssBuffer @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:6
requestAnimationFrame (async)
addEmbeddedCSS @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:6
execute @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:15
doPropagation @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:7
requestIdleCallback (async)
requestPropagation @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:8
setAndPropagate @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:8
markModuleReady @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:13
runScript @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:14
(anonymous) @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:14
flushCssBuffer @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:6
requestAnimationFrame (async)
addEmbeddedCSS @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:6
execute @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:15
doPropagation @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:7
requestIdleCallback (async)
requestPropagation @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:8
setAndPropagate @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:8
markModuleReady @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:13
runScript @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:14
(anonymous) @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:14
flushCssBuffer @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:6
requestAnimationFrame (async)
addEmbeddedCSS @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:6
execute @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:15
doPropagation @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:7
requestIdleCallback (async)
requestPropagation @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:8
setAndPropagate @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:8
implement @ load.php?debug=false&lang=en&modules=startup&only=scripts&skin=minerva&target=mobile:21
(anonymous) @ load.php?debug=false&lang=en&modules=ext.centralNotice.choiceData|ext.navigationTiming%2CwikimediaEvents|ext.quicksurveys.lib|ext.relatedArticles.readMore.bootstrap|ext.visualEditor.targetLoader|jquery%2Coojs-router|jquery.client|mediawiki.Title%2CUri%2Capi%2CjqueryMsg%2Clanguage%2Cuser%2Cutil|mediawiki.ui.anchor|mobile.init%2Csite%2Cstartup|mobile.messageBox.styles|mobile.pagelist.styles|mobile.pagesummary.styles|mobile.startup.images|mobile.startup.images.variants|skins.minerva.icons.images.scripts.misc|skins.minerva.icons.images.variants|skins.minerva.icons.page.issues.default.color|skins.minerva.icons.page.issues.medium.color|skins.minerva.icons.page.issues.uncolored|skins.minerva.mainMenu.icons%2Cstyles|skins.minerva.notifications%2Coptions%2Cscripts%2Ctalk%2Ctoggling|skins.minerva.notifications.badge|skins.minerva.options.share.icon&skin=minerva&version=0zmdnc2:1`

The above could be reduced to the following:

// const st = ...
const st2 = st.replace( / @.+\n?/g, '\n' );

console.log( st.length, st2.length ); // 6841 788

// For completeness:
console.log( st2 );
// =>
// maybeLog
// get
// (anonymous)
// mw.loader.implement.css
// runScript
// (anonymous)
// flushCssBuffer
// requestAnimationFrame (async)
// addEmbeddedCSS
// execute
// doPropagation
// requestIdleCallback (async)
// requestPropagation
// setAndPropagate
// markModuleReady
// runScript
// (anonymous)
// flushCssBuffer
// requestAnimationFrame (async)
// addEmbeddedCSS
// execute
// doPropagation
// requestIdleCallback (async)
// requestPropagation
// setAndPropagate
// markModuleReady
// runScript
// (anonymous)
// flushCssBuffer
// requestAnimationFrame (async)
// addEmbeddedCSS
// execute
// doPropagation
// requestIdleCallback (async)
// requestPropagation
// setAndPropagate
// markModuleReady
// runScript
// (anonymous)
// flushCssBuffer
// requestAnimationFrame (async)
// addEmbeddedCSS
// execute
// doPropagation
// requestIdleCallback (async)
// requestPropagation
// setAndPropagate
// implement
// (anonymous)

2.2 Burstiness

If a syntax error were introduced in a JavaScript asset that's delivered to all clients, then we'd see upwards of 5500 errors reported per second. The simplest way of dealing with this issue is to enable client-side error reporting for 1% of all pageviews.

We might also consider creating a service that acts as:

  • A leaky bucket if we don't want to permit bursts or token bucket if we do (but at a fixed rate);
  • A classful version of the above, if we want to "roll up" errors based on one or more normalized properties.

These services would allow us to maximise the number of clients that can report errors, thereby increasing the likelihood of capturing relatively low-rate errors. However, the introduction of any such service would require a long-term maintenance commitment from at least SRE.

2.3 Pre-existing tools for exploration

In 1.1.1, it was noted that Readers Web are already counting the number of client-side errors for users using the Minerva skin. We do so using statsv. Currently, we leverage that statsv makes requests to a beacon endpoint and that the Analytics team provides tools like Turnilo, which allow us to explore request data at a high-level, to find trends in client-side errors, e.g. a graph client-side errors for users using the Minerva skin by continent. It's worth noting that if this proposal were implemented, this facility would still be available.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Yes. We'd likely have to an EventGate endpoint implementation to allow unschemaed data through, but this would not be difficult. Unless we have a generic schema for client error logs! :) Might not be a bad idea.

Related task T77110: Push messages to logstash from JS. Not sure if it should be a subtask of this one or something else.

Unless we have a generic schema for client error logs! :) Might not be a bad idea.

I think this is a great idea.

Notes from today's meeting https://etherpad.wikimedia.org/p/clients-error-logging

Main takeaway(s) (IMHO, please integrate as needed):

  • Client side errors need better visibility (e.g. what caused the error, not just error count)
  • Something like Sentry might be what we need, however deployment to production isn't trivial
  • A PoC based on logging pipeline and visualized in kibana will get us off the ground quickly/cheaply
  • Client error reporting is a use case for MEP / eventgate as well

To recap what was said at the meeting and in some earlier conversations, there are three independent work streams:

  • Capturing errors on the client side.
    • Extension:Sentry does this (it loads Raven.js, the old Sentry client on demand when an error happens) but the code is four years old now so probably should be updated to the current Sentry client, @sentry/browser (which might or might not be simple). The old code contained some awkwards workarounds, largely because Raven.js was a wrapper around TraceKit which was a valuable library with lots of hard-won compatibility logic for various browser quirks but very inflexible in how it expected to be used (e.g. it really did not expect to be loaded only after the error was thrown). Back then they were on the verge of removing it and rewriting Raven.js so that probably happened since then.
    • We could also write our own client. The benefit of using the official one is support for lots of browser quirks, including parsing of different stack trace renderings in different browsers (which could be reimplemented but it takes time). The benefit of a custom client is that it will probably end up being much smaller (@sentry/browser is 50K minified). Also stack traces are only useful if there are source maps, plus ResourceLoader client side caching would also have to be dealt with (probably also doable with source maps - see T90524 for details).
  • A pipeline for getting error logs to their destination.
    • In theory this could be a Sentry server. When I showed our traffic numbers to Sentry devs a few years ago they said a single beefy bare-metal server could probably take it (but their own traffic numbers were a bit below our worst case predictions so it was just a guess). But there's no guarantee and anyway we'd probably want decoupling and a standard pipeline for all events.
    • Using an EventLogging-ish process with a beacon endpoint being read into Kafka by varnishkafka, as described in this task (that was the old plan as well: T501). Problematic due to size limits (we'd have to drop stack traces at the minimum).
    • Using EventGate to send reports to Kafka, which seems the clear winner. With both this and the previous option, we'd want some kind of rate limiting in Kafka (Kafka itself scales well).
  • Storing/displaying the erros.
    • Sentry would be the nice option here (it has lots of features for supporting browsing, searching, displaying and managing errors) but it's a fair bit of work to package (unless we decide running third-party docker containers is acceptable now). Bonus is that it has clients for just about any programming language and we have plenty of systems (both WMF and community-maintained) where scaling is not a concern so we can just stick in a client and get error management for free. So probably a good long-term goal.
    • Just dumping the errors into logstash would probably get us decent searching and trend monitoring capabilities for free, so that seems the reasonable thing to do initially. (This was the old plan - T502.) Even if we use Sentry eventually, we might want to keep Logstash in the pipline for deduplication.

The first could be done by Reading Web or whichever frontend team is interested in this, the second needs help from Analytics, the third needs help from SRE.

Thanks Gergo, great summary! :D

Copying thoughts to task:

I have two concerns:

  1. Mitigating the thundering-herd -- In the event of all clients emitting errors simultaneously, how will this system keep from tipping over and possibly cascading and taking other services with it?
  1. Mitigating abuse -- In the event of a malicious user repeatedly hit this endpoint at a high rate, either in a distributed or non-distributed fashion, will this endpoint blindly accept any valid-looking data or can we discriminate signal from the noise?

I apologize if these questions were answered elsewhere, but I want to make sure they are at least voiced.

Qs:

  • Would all error events go to the same Kafka topic (I think Filippo said yes)
  • Should (Sentry?) error events be validated against a JSONSchema? (Andrew thinks yes!)
  • Does Sentry have a JSONSchema for its data format(s)? (looks like no? https://github.com/getsentry/sentry/issues/1018)

So currently on an error in the beta cluster, Extension:Sentry will send a GET to https://sentry-beta.wmflabs.org/api/4/store/?sentry_version=4&sentry_client=raven-js/1.1.16&sentry_key=...&sentry_data=... where the second ... is an URL-encoded JSON object like this:

{
  "project": "4",
  "logger": "javascript",
  "platform": "javascript",
  "request": {
    "url": "https://meta.wikimedia.beta.wmflabs.org/wiki/Special:Version",
    "headers": {
      "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36",
      "Referer": "https://meta.wikimedia.beta.wmflabs.org/wiki/Main_Page"
    }
  },
  "exception": {
    "value": "Script error."
  },
  "stacktrace": {
    "frames": [
      {
        "filename": "https://meta.wikimedia.beta.wmflabs.org/wiki/Special:Version",
        "in_app": true
      }
    ]
  },
  "culprit": "https://meta.wikimedia.beta.wmflabs.org/wiki/Special:Version",
  "message": "Script error.",
  "tags": {
    "version": "1.34.0-alpha",
    "debug": false,
    "skin": "vector",
    "action": "view",
    "ns": -1,
    "pageName": "Special:Version",
    "userGroups": [
      "autopatrolled",
      "oauthadmin",
      "*",
      "user",
      "autoconfirmed"
    ],
    "language": "en"
  },
  "event_id": "8752081ad1d147a6b672ac6e3611009d"
}

(The stack trace is truncated due to some browser security restrictions around the method I use to generate fake errors, normally there would be more info there.)

We can set the reporting URL to something else with a change in operations/mediawiki-config, it comes from the $wgSentryDsn global. If we want to send the JSON in a nicer way (POST body, navigator.sendBeacon etc), I didn't look into that now but my recollection is that you can define your own reporting callback.

Note that this is four years old code, we'll want to upgrade to the current version of the Sentry client (Raven.js) which might use a different format (I haven't checked). We don't need to do that for the beta prototype though.

EventGate only accepts events via POST, so we'll at least need to do that. I'll also either need to make a JSONSchema for whatever sentry sends, or make an eventgate implementation (requires some minimal coding) that doesn't do any event validation.

If we eventually do use this, I'd like for us to make Mediawiki data fields closer to the event schema conventions we've been using for EventBus events. E.g. page_title and user_groups instead of pageName and userGroups. That doesn't have to happen for beta prototype :)

EventGate only accepts events via POST, so we'll at least need to do that.

At least in the old client version that's currently on Beta that seems trivial to do, just override globalOptions.shouldSendCallback to do the reporting and cancel the default reporting. Maybe @Milimetric is interested in picking that up as a first task?

If we eventually do use this, I'd like for us to make Mediawiki data fields closer to the event schema conventions we've been using for EventBus events. E.g. page_title and user_groups instead of pageName and userGroups.

Also trivial, it's handled here.

At the last meeting one of the topics has been the Sentry js client, and its potential impact on performance even though we'd be loading it on demand (i.e. on errors). At this stage we're not close to deploying to production but looping in Performance-Team for an early heads up.

Ok, I've set up an eventgate-logging instance in beta at http://deployment-eventgate-1.deployment-prep.eqiad.wmflabs:8992/v1/events.

This instance only knows about a single schema:

title: client/error
description: Represents a MW User Blocks Change event
$id: /client/error/1.0.0
$schema: https://json-schema.org/draft-07/schema#
type: object
additionalProperties: true
properties:
  # global event fields
  $schema:
    type: string
    description: >
      The URI identifying the jsonschema for this event. This may be just
      a short uri containing only the name and revision at the end of the
      URI path.  e.g. /schema_name/12345 is acceptable. This often will
      (and should) match the schema's $id field.
  ### Meta data object.  All events schemas should have this.
  meta:
    type: object
    properties:
      uri:
        type: string
        format: uri-reference
        maxLength: 8192
        description: the unique URI identifying the event / resource
      request_id:
        type: string
        description: The unique ID of the request that caused the event.
      id:
        type: string
        pattern: '^[a-fA-F0-9]{8}(-[a-fA-F0-9]{4}){3}-[a-fA-F0-9]{12}$'
        maxLength: 36
        description: the unique ID of this event; should match the dt field
      dt:
        type: string
        format: date-time
        maxLength: 26
        description: the time stamp of the event, in ISO8601 format
      domain:
        type: string
        description: the domain the event pertains to
        minLength: 1
      stream:
        type: string
        description: The name of the stream/queue that this event belongs in.
        minLength: 1
    required:
      - stream

required:
  - $schema
  - meta

additionalProperties: true means that anything really will be allowed through, but you must at the very least set: "$schema": "/client/error/1.0.0" and "meta": {"stream": "client.error"}}

Here's the example event I used:

{"$schema": "/client/error/1.0.0", "meta": {"stream": "client.error"}, "message": "bad error"}

These events will be produced to the kafka main cluster at e.g. deployment-kafka-main-1.deployment-prep.eqiad.wmnet:9092 topic name 'eqiad.client.error'.

Current Sentry events look like this:

1{
2 "project": "4",
3 "logger": "javascript",
4 "platform": "javascript",
5 "request": {
6 "url": "https://en.wikipedia.beta.wmflabs.org/wiki/Main_Page",
7 "headers": {
8 "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36"
9 }
10 },
11 "exception": {
12 "type": "TypeError",
13 "value": "Cannot read property 'getParamValue' of undefined"
14 },
15 "stacktrace": {
16 "frames": [
17 {
18 "filename": "https://en.wikipedia.beta.wmflabs.org/w/load.php?lang=en&modules=startup&only=scripts&skin=vector",
19 "lineno": 7,
20 "colno": 824,
21 "function": "doPropagation",
22 "in_app": true
23 },
24 {
25 "filename": "https://en.wikipedia.beta.wmflabs.org/w/load.php?lang=en&modules=startup&only=scripts&skin=vector",
26 "lineno": 15,
27 "colno": 376,
28 "function": "execute",
29 "in_app": true
30 },
31 {
32 "filename": "https://en.wikipedia.beta.wmflabs.org/w/load.php?lang=en&modules=startup&only=scripts&skin=vector",
33 "lineno": 14,
34 "colno": 3,
35 "function": "runScript",
36 "in_app": true
37 }
38 ]
39 },
40 "culprit": "https://en.wikipedia.beta.wmflabs.org/w/load.php?lang=en&modules=startup&only=scripts&skin=vector",
41 "message": "Cannot read property 'getParamValue' of undefined",
42 "tags": {
43 "version": "1.34.0-alpha",
44 "debug": false,
45 "skin": "vector",
46 "action": "view",
47 "ns": 0,
48 "pageName": "Main_Page",
49 "userGroups": [
50 "abusefilter",
51 "bureaucrat",
52 "oversight",
53 "sysop",
54 "*",
55 "user",
56 "autoconfirmed"
57 ],
58 "language": "en",
59 "source": "module-execute",
60 "module": "ext.centralNotice.startUp"
61 },
62 "event_id": "8bfb9c90a00749c2a3c09a4814f2ce9a"
63}

I guess $schema and meta need to be added; is there any other pre-processing needed?

We should eventually review the custom fields we set, but I believe it would POST just fine with the addition of $schema and meta.stream.

@Milimetric I just set up a webproxy so this can be POSTed to publicly: http://eventgate-logging.wmflabs.org/v1/events

Change 510606 had a related patch set uploaded (by Milimetric; owner: Milimetric):
[mediawiki/extensions/Sentry@master] [WIP] Point beta Sentry client at EventGate

https://gerrit.wikimedia.org/r/510606

Sentry extension change ready to review ^

Change 510606 merged by jenkins-bot:
[mediawiki/extensions/Sentry@master] Allow Sentry client to POST events to EventGate

https://gerrit.wikimedia.org/r/510606

Change 519243 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[operations/mediawiki-config@master] Enable sending JS errors to EventGate

https://gerrit.wikimedia.org/r/519243

Change 519243 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable sending JS errors to EventGate

https://gerrit.wikimedia.org/r/519243

Mentioned in SAL (#wikimedia-operations) [2019-06-26T16:14:43Z] <urbanecm@deploy1001> Synchronized wmf-config/CommonSettings.php: [[:gerrit:519243|Enable sending JS errors to EventGate]] (T217142) (duration: 00m 55s)

Change 519249 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[operations/mediawiki-config@master] Fix $wgSentryEventGateUri

https://gerrit.wikimedia.org/r/519249

Change 519249 merged by jenkins-bot:
[operations/mediawiki-config@master] Fix $wgSentryEventGateUri

https://gerrit.wikimedia.org/r/519249

Mentioned in SAL (#wikimedia-operations) [2019-06-26T16:42:41Z] <urbanecm@deploy1001> Synchronized wmf-config/CommonSettings.php: [[:gerrit:519249|Fix $wgSentryEventGateUri]] (T217142) (duration: 09m 52s)

Tested with https://en.wikipedia.beta.wmflabs.org/wiki/User:Tgr/common.js , the errors are sent to https://eventgate-logging.wmflabs.org/v1/events and result in a 201. I don't see anything in https://logstash-beta.wmflabs.org , maybe there's some processing delay?

Per IRC discussion, beta Logstash needs to be configured to ingest from the eqiad.client.error Kafka topic.

Change 519603 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] logstash: add consumer for client errors

https://gerrit.wikimedia.org/r/519603

With 519603 cherry-picked on beta puppetmaster and @Tgr's js I'm seeing the errors flowing into logstash now! https://logstash-beta.wmflabs.org/goto/63fbde5f0288c92e214d6c03e4dffe7a

Neat!

{
  "_index": "logstash-2019.06.28",
  "_type": "clienterror",
  "_id": "AWuePijpAbPxH_Ymd-b9",
  "_score": 1,
  "_source": {
    "exception": {
      "type": "Error",
      "value": "Unknown module: jquery.wikibase.sitelinkview"
    },
    "request": {
      "headers": {
        "User-Agent": "*SNIP*"
      },
      "url": "https://wikidata.beta.wmflabs.org/wiki/Q15905"
    },
    "$schema": "/client/error/1.0.0",
    "level": "ERROR",
    "logger": "javascript",
    "project": "4",
    "message": "Unknown module: jquery.wikibase.sitelinkview",
    "type": "clienterror",
    "platform": "javascript",
    "normalized_message": "Unknown module: jquery.wikibase.sitelinkview",
    "tags": [
      "es",
      "normalized_message_untrimmed"
    ],
    "culprit": "https://wikidata.beta.wmflabs.org/w/load.php?lang=en&modules=startup&only=scripts&raw=1&skin=vector",
    "@timestamp": "2019-06-28T13:20:02.266Z",
    "stacktrace": {
      "frames": [
        {
          "in_app": "true",
          "filename": "https://wikidata.beta.wmflabs.org/w/load.php?lang=en&modules=startup&only=scripts&raw=1&skin=vector",
          "lineno": "97",
          "colno": "386",
          "function": "?"
        },
        {
          "in_app": "true",
          "filename": "https://wikidata.beta.wmflabs.org/w/load.php?lang=en&modules=startup&only=scripts&raw=1&skin=vector",
          "lineno": "97",
          "colno": "190",
          "function": "?"
        },
        {
          "in_app": "true",
          "filename": "https://wikidata.beta.wmflabs.org/w/load.php?lang=en&modules=startup&only=scripts&raw=1&skin=vector",
          "lineno": "21",
          "colno": "126",
          "function": "load"
        },
        {
          "in_app": "true",
          "filename": "https://wikidata.beta.wmflabs.org/w/load.php?lang=en&modules=startup&only=scripts&raw=1&skin=vector",
          "lineno": "9",
          "colno": "656",
          "function": "resolveStubbornly"
        },
        {
          "in_app": "true",
          "filename": "https://wikidata.beta.wmflabs.org/w/load.php?lang=en&modules=startup&only=scripts&raw=1&skin=vector",
          "lineno": "9",
          "colno": "306",
          "function": "sortDependencies"
        },
        {
          "in_app": "true",
          "filename": "https://wikidata.beta.wmflabs.org/w/load.php?lang=en&modules=startup&only=scripts&raw=1&skin=vector",
          "lineno": "8",
          "colno": "792",
          "function": "sortDependencies"
        }
      ]
    },
    "meta": {
      "stream": "client.error"
    },
    "original_tags": [
      [
        "debug",
        "false"
      ],
      [
        "ns",
        "0"
      ],
      [
        "page_name",
        "Q15905"
      ],
      [
        "skin",
        "vector"
      ],
      [
        "action",
        "view"
      ],
      [
        "language",
        "en"
      ],
      [
        "source",
        "resolve"
      ],
      [
        "version",
        "1.34.0-alpha"
      ],
      [
        "user_groups",
        [
          "*"
        ]
      ],
      "input-clienterror-eqiad",
      "kafka",
      "truncated_by_filter_truncate"
    ],
    "@version": "1"
  },
  "fields": {
    "@timestamp": [
      1561728002266
    ]
  }
}

Awesome! I think this particular report is captured in T226321.

Dashboard: https://logstash-beta.wmflabs.org/app/kibana#/dashboard/AWukuPDnAbPxH_YmoOuJ

  • tags (which contain lots of useful information such as MediaWiki version, language, skin...) are handled extremely poorly by kibana in their current format. They should be converted into top-level properties (preferably with field names matching those in PHP errors so that Kibana visualisations / saved searches can be reused). Not sure what's the right place for that: client, EventGate, logstash filter? (The last one has its own domain language so it would be the most painful to maintain.)
  • the huge ResourceLoader URLs make stack traces visually noisy, and the culprit (name of script where the error occurred) useless. That will be fixed by source maps in some far future date but for now it would be nice to have functionname-only stack traces, and to display the function name from the top stack frame as the culprit. I see no way to do that in kibana (via the GUI at least) so again that would require some preprocessing.

On the plus side, the aggregate report for error messages is very useful, plus we have stack traces, even if they are a bit ugly.

Not sure what's the right place for that: client, EventGate, logstash filter?

I think client is the right place. We need to bikeshed the actual error schema we will use. EventGate will only validate that the incoming events conform to the schema.

  • tags (which contain lots of useful information such as MediaWiki version, language, skin...) are handled extremely poorly by kibana in their current format. They should be converted into top-level properties (preferably with field names matching those in PHP errors so that Kibana visualisations / saved searches can be reused). Not sure what's the right place for that: client, EventGate, logstash filter? (The last one has its own domain language so it would be the most painful to maintain.)

To be fair, the DSL for Logstash filters is simple (https://www.elastic.co/guide/en/logstash/current/plugins-filters-mutate.html#plugins-filters-mutate-rename; see also https://gerrit.wikimedia.org/g/operations/puppet/+/a7c77e6048f956bf8abe575ee05498e3eafa56fe/modules/profile/files/logstash/filter-eventlogging.conf for an in-production example). However, I think that having the client send as simple and immediately useful a message as possible should be a goal – assuming adequate validation using EventGate.

I think that having the client send as simple and immediately useful a message as possible should be a goal

Indeed, as we will likely also import these events into Hadoop/Hive for longer term querying and analysis. The fewer transformations we have to do, the better.

Change 522924 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/Sentry@master] Make EventGate message format more RL- and ELK-friendly

https://gerrit.wikimedia.org/r/522924

Change 522924 merged by Milimetric:
[mediawiki/extensions/Sentry@master] Make EventGate message format more RL- and ELK-friendly

https://gerrit.wikimedia.org/r/522924

Change 519603 merged by Filippo Giunchedi:
[operations/puppet@production] logstash: add consumer for client errors

https://gerrit.wikimedia.org/r/519603

phuedx renamed this task from [WIP] [Proposal] Use the Kafka-Logstash logging infrastructure to log client-side errors to [Proposal] Use the Kafka-Logstash logging infrastructure to log client-side errors.May 6 2020, 12:28 PM

This is no longer a Work In Progress. Thanks to the concerted efforts of @fgiunchedi, @jlinehan, @Ottomata, and @Krinkle along with support from many others, the MVP of this is Done™.

FYI, the error logging is still only deployed to testwiki and hawiki.
Further rollouts to be handled by PI! :)