Page MenuHomePhabricator

Develop small client-side error logger for MediaWiki
Closed, ResolvedPublic

Description

@Milimetric and @jlinehan will pair to come up with a client that:

  • is small enough to be accepted for deployment on every page load
  • is robust and maintainable in the way a third party library like Sentry would be
  • conforms to a schema approved by WMF for sending client errors

@Krinkle should chime in with his latest thoughts on performance
@phuedx should hang out with us and opine on the interface and what devs would like to see from this client

Related Objects

Event Timeline

fdans triaged this task as Medium priority.Oct 10 2019, 4:55 PM
fdans moved this task from Incoming to Event Platform on the Analytics board.

What is the timeline for this development, in which quarter is the perf review needed?

What is the timeline for this development, in which quarter is the perf review needed?

Currently we're aiming for having a client in production on one wiki by the end of the quarter. The wiki can be low-traffic, but some kind of performance review would probably need to happen this quarter.

To be perfectly honest i doubt we can do this work by EOQ, we can probably start some of this work this quarter but not much else.

I think an attainable goal is to have an client side error library ready in vagrant for @Krinkle to CR by end of quarter cc @Milimetric @Ottomata

Nuria renamed this task from Prototype client to log errors to Prototype client to log errors in vagrant.Oct 21 2019, 2:17 AM

Prototype error producer: https://github.com/linehan/wmf-error-mvp

Created a prototype client MVP by reducing Sentry down to its bare essentials. It turns out that if you do that, there is really not much left (still about ~2.1Kb minified, but it can probably be reduced further).

  1. A callback replacing the default window.onerror function.
  2. A function that uses some regex to re-format the browser-specific Error.stack or Error.stacktrace strings into an array of JSON objects (one for each stack frame) which can then be re-stringified.

That's really it. Given this, plus the fact that we want to be as low-footprint as possible, I don't know whether it makes sense for us to add an external dependency by maintaining a fork of Sentry, along with our own build target in the fork, etc. etc., for a bunch (seriously... a ton) of features and ridiculously complicated plumbing that we will never use. The Sentry code, despite what appear to be good intentions, is genuine spaghetti. OTOH, any engineer can understand what the MVP code is doing in a couple of minutes.

The main reason to track upstream is because stacktraces are not standardized and might change. That's still true, but

  1. Changes will cause the software to fail into a state where we simply don't receive any errors, which is what we currently have.
  2. We can still track upstream manually -- the change will only amount to adding a regex rule.
  3. Until this gets on a standards track, the browser vendors are unlikely to change -- they'd have to re-write their browser code, too, which uses these.

I don't think we need to get much fancier than the MVP, either. Right now we have nothing for error logging, so let's get a +1 into production, rather than trying to plop a +10 in all at once. If we're happy with our +1, then we can think about how we want to iterate.

@Milimetric take a look and we can talk architecture etc. and see what we want to do.

Prototype error producer: https://github.com/linehan/wmf-error-mvp

Nice, agreeing 100% that we should just use this code rather than trying to add a new dependency,

Thinking about the MVP it should be possible (and this is just an idea) to add this code to WikimediaEvents and report errors per browser via mw.track as counters to graphite. The final objective is to report this via eventgate public instance cause until we do that we would not have traces, just counts. If we can get this code CR-ed in mediawiki codebases (in WikimediaEvemts or other extension) by end of quarter I think we will be doing great.

mmm.. seeing this T236386: Set up eventgate-logging-external in production looks like we are readay to start thinking about deploying eventgate main for errors. In that case it seems that the next step that needs to happen here is @Milimetric and @jlinehan agreeing on an error schema

In that case it seems that the next step that needs to happen here is @Milimetric and @jlinehan agreeing on an error schema

Yep, submitted a new patchset for review https://gerrit.wikimedia.org/r/536792, so coming right along.

client/error schema has been merged.

eventgate-logging.wmflabs.org has been updated to used this schema. There's no stream config schema enforcement in beta, so you can submit an event that validates with this schema with any meta.stream set. It is producing to Kafka at deployment-logstash2.deployment-prep.eqiad.wmflabs:9092.

you can submit an event that validates with this schema with any meta.stream set. It is producing to Kafka at deployment-logstash2.deployment-prep.eqiad.wmflabs:9092.

Now working. Was able to use the MVP to produce an actual event with stacktrace etc. to the remote, and verified its receipt in Kafka.

Here is the test setup:

function three() { 
        throw new Error("petaQ!");    
}   

function two() { 
        three();                                                                                                
}                                                                                                               

function one() {       
        two();                                                                                                 
}                                                                                                               

 window.onload = function() {
         document.getElementById("clicker").addEventListener("click", function(e) {
                 one(); 
         }); 
}

And here is what the event looks like (taken directly from kafkacat and pretty-printed):

{
  "meta": {
    "id": "123e4567-e89b-12d3-a456-426655440000",
    "dt": "2019-10-31T14:20:25.889Z",
    "stream": "error_test"
  },
  "$schema": "/client/error/1.0.0",
  "type": "[object Error]",
  "message": "petaQ!",
  "url": "http://olm.ec/misc/tracekit/wmf-error-mvp/test/",
  "user_agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
  "stack_trace": "[{\"url\":\"http://olm.ec/misc/tracekit/wmf-error-mvp/test/T_T.test.js\",\"func\":\"three\",\"args\":[],\"line\":3,\"column\":15},{\"url\":\"http://olm.ec/misc/tracekit/wmf-error-mvp/test/T_T.test.js\",\"func\":\"two\",\"args\":[],\"line\":8,\"column\":9},{\"url\":\"http://olm.ec/misc/tracekit/wmf-error-mvp/test/T_T.test.js\",\"func\":\"one\",\"args\":[],\"line\":13,\"column\":9},{\"url\":\"http://olm.ec/misc/tracekit/wmf-error-mvp/test/T_T.test.js\",\"func\":\"HTMLButtonElement.<anonymous>\",\"args\":[],\"line\":21,\"column\":17}]",
  "tags": {
    "user_language": "klingon",
    "user_forehead": "rippled"
  }
}

The stack trace is pretty ugly, but I guess that's true of most stack traces. Rather than just serialize the intermediate JSON representation, we could also opt to re-format it client-side into a single string format, e.g.

{
  "meta": {
    "id": "123e4567-e89b-12d3-a456-426655440000",
    "dt": "2019-10-31T14:41:16.629Z",
    "stream": "error_test"
  },
  "$schema": "/client/error/1.0.0",
  "type": "[object Error]",
  "message": "petaQ!",
  "url": "http://olm.ec/misc/tracekit/wmf-error-mvp/test/",
  "user_agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
  "stack_trace": "at three (http://olm.ec/misc/tracekit/wmf-error-mvp/test/T_T.test.js:3:15), at two (http://olm.ec/misc/tracekit/wmf-error-mvp/test/T_T.test.js:8:9), at one (http://olm.ec/misc/tracekit/wmf-error-mvp/test/T_T.test.js:13:9), at HTMLButtonElement.<anonymous> (http://olm.ec/misc/tracekit/wmf-error-mvp/test/T_T.test.js:21:17), ",
  "tags": {
    "user_language": "klingon",
    "user_forehead": "rippled"
  }
}

The above example uses Chrome's stacktrace format. I'll leave it up to the folks who are going to be reading these regularly what their preferences are, if any.

Was able to use the MVP to produce an actual event with stacktrace etc.

Did we decided where is the error client code going to live in mediawiki stack?

Did we decided where is the error client code going to live in mediawiki stack?

Not yet, it might be a good point to have a meeting with @fgiunchedi, @phuedx, and the rest of the error logging gang, and discuss how to go forward on that.

Let's include the perf team as well, this is client side code cc @Gilles and @Krinkle

Did we decided where is the error client code going to live in mediawiki stack?

Not yet, it might be a good point to have a meeting with @fgiunchedi, @phuedx, and the rest of the error logging gang, and discuss how to go forward on that.

Sounds good to me -- I can tomorrow or Wed < 13.00 UTC and I'm OOO after that until next week, whichever suits best!

Per our meeting we are going to submit a code patch for gerrit for this code to live in mw.core

And @Tgr gets to be our well versed MW developer that can code review first patches!

Will comment on architecture next, but just jotting down thoughts about code as I read it:

https://github.com/linehan/wmf-error-mvp/blob/master/src/T_T.js#L203 - we should output blank/unknown frames even when no regex matches, that way we know if we're missing something. Maybe if frame is null here, make frame.url something like 'guessed Opera but could not parse ' + line[i].

https://github.com/linehan/wmf-error-mvp/blob/master/src/T_T.js#L71 - yeah, I think we need a util for browser detection somewhere central anyway, right now multiple extensions do it for different reasons. It would save some regex calls, especially for firefox :), but that's negligible since we'll probably cap error sending per user agent. The main thing is less code to maintain (maybe store split / regex line parsing / frame-building functions in a config object that gets applied generally?)

I started out thinking Sentry was the gold standard and I now prefer the way you're going about it, Jason. Because even if we want all those weird corner cases it makes much more sense for us to build up to them as we need them than to not understand exactly what we're seeing and why. Also love that this is much easier to maintain and get through code review, and would bring value to core.

Will comment on architecture next, but just jotting down thoughts about code as I read it:

https://github.com/linehan/wmf-error-mvp/blob/master/src/T_T.js#L203 - we should output blank/unknown frames even when no regex matches

I think I agree, we want to get the event, even without the normalized stacktrace. If the stacktrace can't be normalized, it will be omitted, yeah? We have whole browses like KaiOS browser and UA browser that aren't captured by these regex.

https://github.com/linehan/wmf-error-mvp/blob/master/src/T_T.js#L71 - yeah, I think we need a util for browser detection somewhere central anyway

Like this idea but I guess it's probably out of scope for right now at least :)

The main thing is less code to maintain (maybe store split / regex line parsing / frame-building functions in a config object that gets applied generally?)

The thing that makes it annoying to factor as-written is that once you detect the browser, it's not just selecting the regex you want to use, but also the mapping from the regex to the frame object's fields. You can get clever with it, or give each its own function, it might be worth doing. Like, figure out what browser and then pass everything into the function and build the frames and return them. Only one extra call of overhead, and you only need to compute the branch the first time. So that's probably better. Well, adding more functions tends to harm compressibility, but tends to help comprehensibility :)

I do really want to watch out for compressibility though -- by which I mean during minification. If the code is explicit and clear when unminified, and also compresses to a small minified size, that's the best of both worlds imo. So, for example, changing the frame object into an array that uses just positional arguments, would save all the bytes used to indicate the fields. Stuff like that adds up. There's a right and a wrong way to do it though, I don't want it to be a mess.

Let's move this to gerrit and we can code review there.

ua regexes already exist on https://github.com/wikimedia/mediawiki/blob/2aed14b686/resources/src/startup/startup.js

Sounds good, I'll submit a patch today and flag everybody and we can go from there.

jlinehan raised the priority of this task from Medium to High.Nov 19 2019, 5:24 PM

Change 553376 had a related patch set uploaded (by Jason Linehan; owner: Jason Linehan):
[mediawiki/core@master] Adds module to instrument client runtime errors.

https://gerrit.wikimedia.org/r/553376

Rough first patchset submitted to get the conversation started. Very little is certain here since this is essentially the MVP transplanted into core with very few alterations, so bring your opinion. Hopefully everybody involved will be generous with their input and we can iterate through various patchsets and get to something good.

I am pinging @awight here for his comments as a long time mediawiki developer, adding him to CR

I am pinging @awight here for his comments as a long time mediawiki developer, adding him to CR

Thanks that's great, the more eyes the better. Right now this thing is ugly.

@fgiunchedi I'm curious about these stack traces. I think we may have covered this before, but in production the code is minified, right? So what parts of these stack traces are actually helpful to engineers? I guess i'm trying to understand the workflow.

The reason I ask is because right now, in my opinion, this module is pretty heavy for what it is doing, and I want to make sure we have a reason for everything it is doing. For example, if we were to provide a simplified stacktrace consisting only of the names of the functions, it would massively reduce the size and complexity of this module.

Maybe we don't need to go that far, but are things like line and column numbers really crucial? Is the URL of the script in which the error was thrown? The URL (or something like it) is being provided in a separate field. This kind of thing.

What is the critical/key information that developers are looking for when analyzing errors? Or is this primarily for use in a development setting where minification has not yet happened?

Or are we using the stacktraces to "fingerprint" the errors? Is that the idea?

@Nuria etc anybody else is welcome to chime in too.

Or is this primarily for use in a development setting where minification has not yet happened?

The primary usage I think will be production where it will be enabled for a few wikis and it will run on minimized code, thus the "nameOfFunction" might not help as much as the full stack trace .

Looks fun! I'm wondering whether there has been consideration of https://github.com/wikimedia/mediawiki/blob/master/resources/src/mediawiki.base/mediawiki.errorLogger.js yet, though? I think it might do everything we need…

@fgiunchedi I'm curious about these stack traces. I think we may have covered this before, but in production the code is minified, right? So what parts of these stack traces are actually helpful to engineers? I guess i'm trying to understand the workflow.

In my limited experience, stack traces are almost always valuable. The main drawback is that they're heavy, and for most errors you'll get thousands of the same thing, when one stack trace would have been enough. We might want to do some kind of sampling here?

Minified code can be mapped back to the full source using standardized source maps, I'm sure we already generate these.

The primary usage I think will be production where it will be enabled for a few wikis and it will run on minimized code, thus the "nameOfFunction" might not help as much as the full stack trace .

Yeah! Exactly, which is why I'm curious how this will be used. Will engineers see the stack trace and then visit the line,column location in the minified production code and try to decipher it?

Not all names will be replaced during minification, depending on how the encapsulation happens. Are we relying on this? What about the arguments to the functions?

Would it be helpful if I produced some samples of what a stack trace looks like with minified code?

Again, the reason is that every one of these fields we don't need will be a substantial savings to complexity and size of the module.

Looks fun! I'm wondering whether there has been consideration of https://github.com/wikimedia/mediawiki/blob/master/resources/src/mediawiki.base/mediawiki.errorLogger.js yet, though? I think it might do everything we need…

Good, great! @awight! @Nuria and I were just talking about the fact that we can't be the first code to override the onerror handler. I agree this looks like the perfect place. What about the stack trace normalization code. Should we look to move that into this module as well?

In my limited experience, stack traces are almost always valuable.

Absolutely, I'd never want to eliminate the entire thing, I just mean is there a strict subset of (script URL, line number, column number, function name, function arguments) that is just as useful to us in practice as the full set. If there were, then this code could get much simpler. If there is no such subset, that's fine but at least we asked.

Minified code can be mapped back to the full source using standardized source maps, I'm sure we already generate these.

Yeah I asked some folks about this (maybe not the right folks) and it seemed like we don't. If we have source maps (or plan to have them to use with these stack traces) then that's a whole different conversation.

Edit: turned 3 separate comments into 1 :)

We welcome any advice of how to integrate these two.

That would depend on how flexible we can be about the acceptance criterion, I'd think:

conforms to a schema approved by WMF for sending client errors

So far, I see the desired schema is quite granular, so maybe we should have a refinery step after raw data collection?

The existing logger emits into the mw.track error stream global.error, which seems to only be subscribed by a feature-specific sink that @EBernhardson is working on . That uses custom sampling logic rather than mw.eventLogging.eventInSample, so I'm not entirely sure, but it seems to be logging at 100%. Anyway, you could have a second (sampling?) subscriber to that stream, augment with more raw data, and event log or whatever is the ideal here.

it seems to be logging at 100%.

So it sounds like the volume is low enough to consume unsampled. Although you might still consider *only loading the module* for some small proportion of pageviews, to save some script load overhead.

So it sounds like the volume is low enough to consume unsampled.

Rather, we will be adding config and enabling it for just 1 wiki at first. Probably it will be enabled for a few wikis forever, never for the whole firehouse

@phuedx should hang out with us and opine on the interface and what devs would like to see from this client

I was OoO November through December. I'm happy to meet whenever you're free.

@phuedx please do chime in on CR, that would be best as we are aiming to merge this code next week.

Change 573338 had a related patch set uploaded (by Jason Linehan; owner: Jason Linehan):
[mediawiki/extensions/WikimediaEvents@master] Adds instrumentation for MediaWiki client runtime errors.

https://gerrit.wikimedia.org/r/573338

Change 573338 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Add instrumentation for MediaWiki client runtime errors

https://gerrit.wikimedia.org/r/573338

Change 573338 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Add instrumentation for MediaWiki client runtime errors

https://gerrit.wikimedia.org/r/573338

Adding as (next week's) train blocker becaue this landed before the core commit it depends on (https://gerrit.wikimedia.org/r/553376).

AH! Sorry I didn't quite realize I wasn't supposed to merge. We can revert until the dependency is merged, shall we?

Although, it isn't quite a blocker, is it? The code isn't active unless it is enabled via config.

Change 573589 had a related patch set uploaded (by Jason Linehan; owner: Jason Linehan):
[mediawiki/core@master] Adds cross-browser stack trace to error logging object.

https://gerrit.wikimedia.org/r/573589

Ah ha, this is the dependency^ let's go! (Trying to get all this merged for next week's train)

Change 553376 abandoned by Jason Linehan:
Adds module to instrument client runtime errors.

Reason:
Will be continued as a patch to mw.errorLogger and a patch to WikimediaEvents

https://gerrit.wikimedia.org/r/553376

Change 553376 restored by Krinkle:
Adds module to instrument client runtime errors.

Reason:
I'm uploading a rebased version here so as to keep the discussion and iteration for future reference, this can be useful when investigating this in the future and is the general practice for this.

https://gerrit.wikimedia.org/r/553376

Change 573589 abandoned by Krinkle:
Adds cross-browser stack trace to error logging object.

Reason:
Re-uploaded at I57cc46458a741 with the coding style issues amended. Test failure not yet addressed.

https://gerrit.wikimedia.org/r/573589

Change 553376 merged by jenkins-bot:
[mediawiki/core@master] mediawiki.base: Add cross-browser normal stack trace to error object

https://gerrit.wikimedia.org/r/553376

Thanks to @Krinkle and @Ottomata and @phuedx for all the timely code review, this is now merged and we can move on to how and where we enable it and start looking at results.

jlinehan renamed this task from Prototype client to log errors in vagrant to Develop small client-side error logger for MediaWiki.Feb 21 2020, 3:22 PM