Page MenuHomePhabricator

Do log errors with no url or file_uri instead of dropped from client side error instrumentation
Closed, ResolvedPublic

Description

Originally reported by @Krinkle

[…]

  • There seems to be something fundamentally broken about the pipeline since the overall client-error traffic, even after filtering out logged-in traffic, seems to be heavily biased toward Firefox. More than 95% in fact. Barring a miracle growth in market share in the past 24 hours, or something special about Commons' audience in pariticular, I would guess there is a bug in the instrumentation that is causing the vast majority of (real) errors to dropped by our instrumentation and never reported.
    • I suspect the file_url check in the instrumentation is at fault. This was added to filter out a bit of noise from cross-site script imports of which the URL is hidden by the browser for security reasons, but I suspect it is instead causing > 90% of genuine errors to be dropped as well because cached script run from localStorage and thus don't have an HTTP-based URL. Due to performance issues in Firefox, we have disabled use of localStorage in Firefox, which would explain why the errors we do get are mostly from the <5% of users using Firefox, or from edge cases where scripts couldn't be cached.

Errors that have no file_url or url are dropped by our client side error instrumentation.
For modules in Chrome, certain errors (currently only reproducable for ResourceLoaderWikiModule modules) are unexpectedly being triggered without this information so are correctly being dropped. We would like these errors to not be dropped but for that we need to ensure they carry the required information.

Replication steps

Put the following code inside MediaWiki:Common.js

window.error = function () {
console.log('window.error', arguments);
}
mw.trackSubscribe( 'global.error', function ( s, yp ) { 
console.log('s', yp);
}) 
    setTimeout(function () {
      mw.undefinedExample();
    }, 1000);

and visit a page in the Chrome browser.

The error will be reported with an empty url, no file_url and anonymized stack trace.

The issue does not seem to impact issues inside the code itself. When I put the same code in ~/git/core/skins/Vector/resources/skins.vector.js/skin.js the file_url and url are both present.

Error thrown by module created from MediaWiki:Common.js

{errorMessage: "Uncaught TypeError: mw.undefinedExample is not a function", url: "", lineNumber: 3, columnNumber: 26, stackTrace: "at <anonymous>:3:26↵", …}columnNumber: 26errorMessage: "Uncaught TypeError: mw.undefinedExample is not a function"errorObject: TypeError: mw.undefinedExample is not a function
    at <anonymous>:3:26lineNumber: 3stackTrace: "at <anonymous>:3:26↵"url: ""__proto__: Objectconstructor: ƒ Object()hasOwnProperty: ƒ hasOwnProperty()isPrototypeOf: ƒ isPrototypeOf()propertyIsEnumerable: ƒ propertyIsEnumerable()toLocaleString: ƒ toLocaleString()toString: ƒ toString()valueOf: ƒ valueOf()__defineGetter__: ƒ __defineGetter__()__defineSetter__: ƒ __defineSetter__()__lookupGetter__: ƒ __lookupGetter__()__lookupSetter__: ƒ __lookupSetter__()get __proto__: ƒ __proto__()set __proto__: ƒ __proto__()

Error thrown by Vector script

{errorMessage: "Uncaught TypeError: mw.undefinedExample is not a function", url: "http://localhost:8888/w/load.php?lang=en&modules=j…Csite%7Cskins.vector.js&skin=vector&version=n8enw", lineNumber: 155, columnNumber: 1579, stackTrace: "at http://localhost:8888/w/load.php?lang=en&module…ins.vector.js&skin=vector&version=n8enw:155:1579↵", …}
columnNumber: 1579
errorMessage: "Uncaught TypeError: mw.undefinedExample is not a function"
errorObject: TypeError: mw.undefinedExample is not a function at http://localhost:8888/w/load.php?lang=en&modules=jquery%2Csite%7Cskins.vector.js&skin=vector&version=n8enw:155:1579
lineNumber: 155
stackTrace: "at http://localhost:8888/w/load.php?lang=en&modules=jquery%2Csite%7Cskins.vector.js&skin=vector&version=n8enw:155:1579↵"
url: "http://localhost:8888/w/load.php?lang=en&modules=jquery%2Csite%7Cskins.vector.js&skin=vector&version=n8enw"

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Absence of a URI doesn't have anything to do with wiki modules specifically. It applies to all modules once they are cached. If anything, wiki modules are cached less often due to them sometimes being user-specific or uncacheable for other reasons. I believe this is why we are seeing a disproportionate amount of errors from gadgets (and much fewer from our own deployed code), exactly because this issue is not affecting wiki modules such as gadgets as much.

While local storage execution has no URL (browser UIs sometimes call this "VM:123" or "(anonymous)" but it seems file_uri remains empty since these aren't valid URIs), it does have a stack trace.

I think it would make sense to simply remove this check, and instead check only for absence of a stack trace, which I suspect was the intent of the check intially - on the assumption that file_uri would be the first uri in any multi-line stack trace. However, we now know that this is incorrect.

Krinkle renamed this task from ResourceLoaderWikiModules create errors with no url or file_uri meaning they are dropped from client side error instrumentation to Do log errors with no url or file_uri instead of dropped from client side error instrumentation.Nov 2 2020, 9:29 PM
Krinkle removed Krinkle as the assignee of this task.

leaving aside the known issue that almost all client errors are currently being dropped due to the faulty file_uri check.

That's not correct. As I've pointed out in my investigations in T266517 it looks like we're only dropping client errors in Chrome from ResourceLoaderWiki modules […] I encourage you to do your own investigations here. I am sure you will reach the same conclusion.

Some evidence that goes against your theory:

I already did. I wouldn't have said this based on a theory.

  • Right now our highest volume error is in TimedMediaHandler (T262947) […] That was the highest error before we added the file_uri check and continued to be the highest after and wasn't impacted by the change to WikimediaEvents.

When following T262947's task description steps in a fresh Chrome browser with no previous Wikipedia page views, we find an error with a long load.php URL for many freshly fetched modules. When reproducing it again after a refresh, or more generallly on non-initial page views, we still find a load.php URL associated with the error, albeit a much shorter one. This time it fetched only jquery.ui and Kaltura. It having such a URL means it didn't load from localStorage cache.

This is because TMH uses jquery.ui, which is larger than our 100 kB threshold for modules cached in localStorage. As such, they always load from a URL. (From the third page view on, these will enjoy fragmented client-side HTTP-based caching instead).

  • If we consider all errors, in a 24hr period 11,771 of our errors come from Chrome 86 and 5,241 from Firefox 82. If I filter by the (albeit flawed) gadgets DSL filter, the Firefox to Chrome ratio changes drastically. (you must remember this has been my obsession for the last month and I've been logging at logstash daily and I would have noticed if we were dropping high profile errors with the file_uri change)

OK. Let's take a closer look. Upon loading the mw-client-errors Logstash dashboard (without any filters enabled) the top five user agents are:

#CountUser-Agent
*50,600(all, past 24 hours)
18,300Windows 10, Firefox 82.0
27,200Windows 10 Chrome 86.0.4240.111
33,800Windows 10 Chrome 86.0.4240.183
41,800macOS 10.15.7, Safari 14.0
51,800Windows 10, Edge 86.0.622.61

Firefox being more common than any of these major browsers, whilst saddening to my heart, should be suspicious no matter what. Yes, combining the two Chrome patch releases would, just about, give Chrome the top position as most common UA. Having latest Firefox be 80% of Chrome's share is still highly suspect though, and that's before we even filter out unrelated noise from Chrome extensions.

When focussing on the top Chrome version only, I find from spot-checking the top 10 most common error messages that the majority falls in one of two categories:

  1. Having a URL like chrome-extension://, which are not related to this task or indeed our sites in general.
  2. Having a message like Error: module already implemented, which, as all JS stacks, have their error URL set to where the error is thrown from, not where the stack started. This means these errors have their URL is attributed to load.php?module=startup, the only unfragmented always-cachable load.php URL, which bootstraps the client and thus can't come from localStorage. If we look at any of the individual samples, the stack trace is generally missing the URLs higher up the stack, because the code came from localStorage, attributing it to "", "anonymous", "VM", or (article URL) depending on which browser and how it was loaded. Except for those loaded by users by raw URL (as opposed to by normally loading a gadget module), since that bypasses caching.

All others are from page views where the faulty code was either loaded for the first time on that page and thus not yet cached, or from modules like TMH that are too big for localStorage.

I think it would make sense to simply remove this check

I don't doubt we are dropping events, but right now I am not convinced that we are dropping them significantly. Flooding logstash with errors that cannot be easily debugged due to anonymous stack traces and absent file uris would only give us inflated numbers and more noise to contend with, not alert us to new unknown errors. As the main person championing this it would also cause me unnecessary work.

When the change ( (I9a5fc15fe0aa93a856ab0d835882300f00984d1b ) went live on 19th August, we were still at low volume < 1000 errors a day, adding new wikis and fixing various gadgets as well as hot fixing production code, so it's hard to know exactly how many errors are not being unreported but I should note that deploy didn't drastically drop our numbers:

Screen Shot 2020-11-02 at 1.56.30 PM.png (440×2 px, 76 KB)

It's possible we are underreporting the number of errors, but that's okay. Once we have deployed to English Wikipedia, fixed more of our existing bugs, and are seeing a reasonable volume of errors with lots of wiggle room I am happy for us to explore modifying or removing the check, but right now this is keeping triaging manageable.

Here's an easy way to demonstrate the issue.

  1. In Chrome, load https://en.m.wikipedia.org/wiki/Main_Page, focus the search bar, and type an "x". (This makes us fetch all relevant code from the server)
  2. Refresh the page.
  3. Run the following in the console:
mw.trackSubscribe('', console.log.bind(console)); // To see global.error entries
mw.config.values = {}; // This will make MobileFrontend search code crash, unfairly
  1. Focus the search bar and type an "x". (This time we're loading code from local cache)

Observe the empty url and thus dropped event:

global.error 
{
  "errorMessage": "Uncaught TypeError: Cannot read property 'prefix' of null",
  "url": "",
  "lineNumber": 168,
  "columnNumber": 306,
  "stackTrace": "at a.getApiData  <anonymous>:168:306\nat a.search  <anonymous>:169:272\nat <anonymous>:192:482\n",
  "errorObject": {}
}

capture.png (1×1 px, 176 KB)

This is applicable to all regular production source code.

I think it would make sense to simply remove this check

I don't doubt we are dropping events, but right now I am not convinced that we are dropping them significantly.

I hope the above removes any remaining doubt.

Flooding logstash with errors that cannot be easily debugged due to anonymous stack traces and absent file uris would only give us inflated numbers and more noise to contend with, not alert us to new unknown errors. […]

Imho file urls aren't particularly relevant or useful when debugging JavaScript. The stacktrace is all you need imho, which we get either way. Once we have source maps working end-to-end, we can consider compressing variable names for a tiny perfwin, but until then our names are generally more than unique enough to figure things out.

Besides, on the URLs we do get today, it only tells you it was one of 100+ mostly unrelated modules.

I don't see how this could only inflate known errors. Afaik there is no meaningful correlation between loading code from cache and exposing an error. If anything, I imagine it's quite common for most code to first be loaded and not fully excercise all code paths on the first view in that feature's interaction. Pretty much all errors that happen after the first view or use of a feature, we are completely missing out on.

Right now this is blocking us from expanding our pipeline to load errors from Chrome browser that occur on start up (T266517)

If I understand correctly, you are suggesting that the error load we plan for is limited to the ideal scenario where we don't have significant errors. That sounds dangerous to me. If it is true that we couldn't handle this error happening and while opening up Chrome, then we must not under any circumstances open up Chrome. Whether we fix the Graph bug must not factor into infrastructure capacity, especially infrastructure meant to handle unexpected errors.

Given we have enabled to major wikis already, I assume we have somewhere documented an answer to these two questions or questions similar to it:

  • What is the order of magnitude influx we can expect if a majority of Grade A page loads encounters one or two errors? (including which wikis the estimate is based on)
  • What is the order of magnitude influx EventGate, Kafka, and Logstash have agreed to support for this feature?

Can you point to these and conclude whether we are currently over capacity? (if it weren't for the Chrome bug). Based on that we need to e.g. reduce wikis, introduce sampling, or apply other measures to limit the maximum amount of influx when unexpected errors happen. Or, if it turns out that the planned capacity already accounts for this and the current rate is orders of magnitudes below it (even with the Graph bug), then it's fine to go ahead.

This is not about eventgate capacity, this is about my own capacity :)

Change 634295 had a related patch set uploaded (by Jdlrobson; owner: Jason Linehan):
[mediawiki/extensions/WikimediaEvents@master] clientError: Adds new fields and updates behavior.

https://gerrit.wikimedia.org/r/634295

We're now in a good place to do this, now that we've removed the user generated content based Graph errors and have a good handle on all our existing bugs.

When allowing these into the pipeline, it's important we can debug them via other means than the file url and filter them if necessary so I've folded this into the pending schema changes.

Change 668141 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/extensions/WikimediaEvents@master] Revise file url logging rules

https://gerrit.wikimedia.org/r/668141

Change 668141 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Revise file url logging rules

https://gerrit.wikimedia.org/r/668141

So this is live now, but not really useful. It seems these additional errors are being anomyized and send as "Script error" without a stack trace so they are totally useless for debugging purposes.

On group 1 I'm seeing 350 errors in the last hour without a file uri, all from Chrome, and not a single one of them is actionable as none of them have useful stack traces. In total there are 516 errors in the last hour. 311 of them are the error "Script error."

So my hypothesise in T266517#6604763 is currently holding true:
We were underreporting errors at the trade-off of actionable errors.

You can filter these with file_url=undefined.

Screen Shot 2021-03-10 at 3.24.49 PM.png (658×2 px, 162 KB)

What was your goal here @Krinkle ? I ignore script errors in triaging, so this only impacts the overall error count. It would be useful to understand why code in Chrome loaded from localStorage is anonymized as "Script error" and fix that if possible, but we should resolve this ticket and probably create a new bug for that.

This has led to a huge spike in "Script error"

Screen Shot 2021-03-11 at 1.06.38 PM.png (926×2 px, 135 KB)

Screen Shot 2021-03-11 at 1.06.50 PM.png (846×2 px, 142 KB)

but only a tiny spike in other errors

Screen Shot 2021-03-11 at 1.07.32 PM.png (374×2 px, 70 KB)

I recommend we now either explicitly filter out these inactionable errors or restore the support for filtering out errors where file uri is undefined.

I should note the other errors are pretty useless to us. All of them have stack traces such as "at <anonymous>:1:60" so without the file uri there's little we can do about them.

Change 670945 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/extensions/WikimediaEvents@master] Do not log script errors without file uri

https://gerrit.wikimedia.org/r/670945

Change 670879 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/extensions/WikimediaEvents@wmf/1.36.0-wmf.34] Do not log script errors without file uri

https://gerrit.wikimedia.org/r/670879

Change 670945 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Do not log script errors without file uri

https://gerrit.wikimedia.org/r/670945

Jdlrobson claimed this task.

The problem statement given in the title is resolved, however I don't think it's resolved in the way that was originally intended. My guess is that code executed from localStorage is somehow being anonymized as "Script error." and we should do some further investigation in ResourceLoader to understand why and avoid it [1] .

[1] https://www.sitepoint.com/what-the-heck-does-script-error-mean/

Change 670879 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@wmf/1.36.0-wmf.34] Do not log script errors without file uri

https://gerrit.wikimedia.org/r/670879

Mentioned in SAL (#wikimedia-operations) [2021-03-11T22:34:18Z] <brennen@deploy1002> Synchronized php-1.36.0-wmf.34/extensions/WikimediaEvents/modules/ext.wikimediaEvents/clientError.js: Backport: [[gerrit:670879|Do not log script errors without file uri (T266517)]] (duration: 01m 07s)

Mentioned in SAL (#wikimedia-operations) [2021-03-11T22:36:41Z] <brennen> train status: 1.36.0-wmf.34 (T274938): T277229 and T266517 related issues hopefully resolved, rolling forward to all wikis

[…]

So my hypothesise in T266517#6604763 is currently holding true:
We were underreporting errors at the trade-off of actionable errors.

Were you able to reproduce these scripts errors yourself by triggering an error from a cached script? Did we not already confirm weeks ago that errors from scripts executed after local storage cache have a useful message and trace? They only lack a file becuase they don't come from a networked URL.

I may be wrong here, but I don't think the problems introduced by the above commits have anything to do with the task at hand.

The hypothesis I think is obviously false as indicated by T266517#6604780. The current instrumentation is only logging errors triggered by users in the same page load as when those scripts were downloaded. Not only does that drastically reduce the probability of catching in general (since most script execution happens from a cache, thankfully), it also makes it impossible to catch erorrs from interactions that by design can only happen after a past page load that has downloaded the code in question.

Re-opening since it's not solved. (Feel free to re-prioritise.)

Krinkle lowered the priority of this task from Unbreak Now! to Medium.Mar 12 2021, 2:01 AM

i'll setup a chat next week so we can get on the same page here and will create a new ticket with what we discuss. At this point this ticket is far too confusing to follow, in particular the title "Do log errors with no url or file_uri instead of dropped from client side error instrumentation". We are logging errors with no file uri.

Change 670996 had a related patch set uploaded (by Krinkle; owner: Jdlrobson):
[mediawiki/extensions/WikimediaEvents@master] clientError: Revert back to ignoring errors with undefined file_url

https://gerrit.wikimedia.org/r/670996

Change 670996 abandoned by Jdlrobson:
[mediawiki/extensions/WikimediaEvents@master] clientError: Revert back to ignoring errors with undefined file_url

Reason:
i debugged this morning with Sam. 1.36.0-wmf.34 is in a weird state, master code is fine.

https://gerrit.wikimedia.org/r/670996