Page MenuHomePhabricator

Do log errors with no url or file_uri instead of dropped from client side error instrumentation
Open, Needs TriagePublic

Description

Originally reported by @Krinkle

[…]

  • There seems to be something fundamentally broken about the pipeline since the overall client-error traffic, even after filtering out logged-in traffic, seems to be heavily biased toward Firefox. More than 95% in fact. Barring a miracle growth in market share in the past 24 hours, or something special about Commons' audience in pariticular, I would guess there is a bug in the instrumentation that is causing the vast majority of (real) errors to dropped by our instrumentation and never reported.
    • I suspect the file_url check in the instrumentation is at fault. This was added to filter out a bit of noise from cross-site script imports of which the URL is hidden by the browser for security reasons, but I suspect it is instead causing > 90% of genuine errors to be dropped as well because cached script run from localStorage and thus don't have an HTTP-based URL. Due to performance issues in Firefox, we have disabled use of localStorage in Firefox, which would explain why the errors we do get are mostly from the <5% of users using Firefox, or from edge cases where scripts couldn't be cached.

Errors that have no file_url or url are dropped by our client side error instrumentation.
For modules in Chrome, certain errors (currently only reproducable for ResourceLoaderWikiModule modules) are unexpectedly being triggered without this information so are correctly being dropped. We would like these errors to not be dropped but for that we need to ensure they carry the required information.

Replication steps

Put the following code inside MediaWiki:Common.js

window.error = function () {
console.log('window.error', arguments);
}
mw.trackSubscribe( 'global.error', function ( s, yp ) { 
console.log('s', yp);
}) 
    setTimeout(function () {
      mw.undefinedExample();
    }, 1000);

and visit a page in the Chrome browser.

The error will be reported with an empty url, no file_url and anonymized stack trace.

The issue does not seem to impact issues inside the code itself. When I put the same code in ~/git/core/skins/Vector/resources/skins.vector.js/skin.js the file_url and url are both present.

Error thrown by module created from MediaWiki:Common.js

{errorMessage: "Uncaught TypeError: mw.undefinedExample is not a function", url: "", lineNumber: 3, columnNumber: 26, stackTrace: "at <anonymous>:3:26↵", …}columnNumber: 26errorMessage: "Uncaught TypeError: mw.undefinedExample is not a function"errorObject: TypeError: mw.undefinedExample is not a function
    at <anonymous>:3:26lineNumber: 3stackTrace: "at <anonymous>:3:26↵"url: ""__proto__: Objectconstructor: ƒ Object()hasOwnProperty: ƒ hasOwnProperty()isPrototypeOf: ƒ isPrototypeOf()propertyIsEnumerable: ƒ propertyIsEnumerable()toLocaleString: ƒ toLocaleString()toString: ƒ toString()valueOf: ƒ valueOf()__defineGetter__: ƒ __defineGetter__()__defineSetter__: ƒ __defineSetter__()__lookupGetter__: ƒ __lookupGetter__()__lookupSetter__: ƒ __lookupSetter__()get __proto__: ƒ __proto__()set __proto__: ƒ __proto__()

Error thrown by Vector script

{errorMessage: "Uncaught TypeError: mw.undefinedExample is not a function", url: "http://localhost:8888/w/load.php?lang=en&modules=j…Csite%7Cskins.vector.js&skin=vector&version=n8enw", lineNumber: 155, columnNumber: 1579, stackTrace: "at http://localhost:8888/w/load.php?lang=en&module…ins.vector.js&skin=vector&version=n8enw:155:1579↵", …}
columnNumber: 1579
errorMessage: "Uncaught TypeError: mw.undefinedExample is not a function"
errorObject: TypeError: mw.undefinedExample is not a function at http://localhost:8888/w/load.php?lang=en&modules=jquery%2Csite%7Cskins.vector.js&skin=vector&version=n8enw:155:1579
lineNumber: 155
stackTrace: "at http://localhost:8888/w/load.php?lang=en&modules=jquery%2Csite%7Cskins.vector.js&skin=vector&version=n8enw:155:1579↵"
url: "http://localhost:8888/w/load.php?lang=en&modules=jquery%2Csite%7Cskins.vector.js&skin=vector&version=n8enw"

Event Timeline

Restricted Application added a project: Performance-Team. · View Herald TranscriptOct 26 2020, 10:46 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Jdlrobson updated the task description. (Show Details)Oct 26 2020, 10:46 PM
Jdlrobson added a subscriber: Krinkle.
Jdlrobson updated the task description. (Show Details)Oct 26 2020, 10:53 PM
Gilles assigned this task to Krinkle.Nov 2 2020, 7:38 PM
Gilles moved this task from Inbox to Backlog: Small & Maintenance on the Performance-Team board.
Krinkle updated the task description. (Show Details)Nov 2 2020, 8:02 PM

Absence of a URI doesn't have anything to do with wiki modules specifically. It applies to all modules once they are cached. If anything, wiki modules are cached less often due to them sometimes being user-specific or uncacheable for other reasons. I believe this is why we are seeing a disproportionate amount of errors from gadgets (and much fewer from our own deployed code), exactly because this issue is not affecting wiki modules such as gadgets as much.

While local storage execution has no URL (browser UIs sometimes call this "VM:123" or "(anonymous)" but it seems file_uri remains empty since these aren't valid URIs), it does have a stack trace.

I think it would make sense to simply remove this check, and instead check only for absence of a stack trace, which I suspect was the intent of the check intially - on the assumption that file_uri would be the first uri in any multi-line stack trace. However, we now know that this is incorrect.

Krinkle renamed this task from ResourceLoaderWikiModules create errors with no url or file_uri meaning they are dropped from client side error instrumentation to Do log errors with no url or file_uri instead of dropped from client side error instrumentation.Nov 2 2020, 9:29 PM
Krinkle removed Krinkle as the assignee of this task.
Krinkle added a comment.EditedNov 5 2020, 1:11 AM

leaving aside the known issue that almost all client errors are currently being dropped due to the faulty file_uri check.

That's not correct. As I've pointed out in my investigations in T266517 it looks like we're only dropping client errors in Chrome from ResourceLoaderWiki modules […] I encourage you to do your own investigations here. I am sure you will reach the same conclusion.

Some evidence that goes against your theory:

I already did. I wouldn't have said this based on a theory.

  • Right now our highest volume error is in TimedMediaHandler (T262947) […] That was the highest error before we added the file_uri check and continued to be the highest after and wasn't impacted by the change to WikimediaEvents.

When following T262947's task description steps in a fresh Chrome browser with no previous Wikipedia page views, we find an error with a long load.php URL for many freshly fetched modules. When reproducing it again after a refresh, or more generallly on non-initial page views, we still find a load.php URL associated with the error, albeit a much shorter one. This time it fetched only jquery.ui and Kaltura. It having such a URL means it didn't load from localStorage cache.

This is because TMH uses jquery.ui, which is larger than our 100 kB threshold for modules cached in localStorage. As such, they always load from a URL. (From the third page view on, these will enjoy fragmented client-side HTTP-based caching instead).

  • If we consider all errors, in a 24hr period 11,771 of our errors come from Chrome 86 and 5,241 from Firefox 82. If I filter by the (albeit flawed) gadgets DSL filter, the Firefox to Chrome ratio changes drastically. (you must remember this has been my obsession for the last month and I've been logging at logstash daily and I would have noticed if we were dropping high profile errors with the file_uri change)

OK. Let's take a closer look. Upon loading the mw-client-errors Logstash dashboard (without any filters enabled) the top five user agents are:

#CountUser-Agent
*50,600(all, past 24 hours)
18,300Windows 10, Firefox 82.0
27,200Windows 10 Chrome 86.0.4240.111
33,800Windows 10 Chrome 86.0.4240.183
41,800macOS 10.15.7, Safari 14.0
51,800Windows 10, Edge 86.0.622.61

Firefox being more common than any of these major browsers, whilst saddening to my heart, should be suspicious no matter what. Yes, combining the two Chrome patch releases would, just about, give Chrome the top position as most common UA. Having latest Firefox be 80% of Chrome's share is still highly suspect though, and that's before we even filter out unrelated noise from Chrome extensions.

When focussing on the top Chrome version only, I find from spot-checking the top 10 most common error messages that the majority falls in one of two categories:

  1. Having a URL like chrome-extension://, which are not related to this task or indeed our sites in general.
  2. Having a message like Error: module already implemented, which, as all JS stacks, have their error URL set to where the error is thrown from, not where the stack started. This means these errors have their URL is attributed to load.php?module=startup, the only unfragmented always-cachable load.php URL, which bootstraps the client and thus can't come from localStorage. If we look at any of the individual samples, the stack trace is generally missing the URLs higher up the stack, because the code came from localStorage, attributing it to "", "anonymous", "VM", or (article URL) depending on which browser and how it was loaded. Except for those loaded by users by raw URL (as opposed to by normally loading a gadget module), since that bypasses caching.

All others are from page views where the faulty code was either loaded for the first time on that page and thus not yet cached, or from modules like TMH that are too big for localStorage.

I think it would make sense to simply remove this check

I don't doubt we are dropping events, but right now I am not convinced that we are dropping them significantly. Flooding logstash with errors that cannot be easily debugged due to anonymous stack traces and absent file uris would only give us inflated numbers and more noise to contend with, not alert us to new unknown errors. As the main person championing this it would also cause me unnecessary work.

When the change ( (I9a5fc15fe0aa93a856ab0d835882300f00984d1b ) went live on 19th August, we were still at low volume < 1000 errors a day, adding new wikis and fixing various gadgets as well as hot fixing production code, so it's hard to know exactly how many errors are not being unreported but I should note that deploy didn't drastically drop our numbers:

It's possible we are underreporting the number of errors, but that's okay. Once we have deployed to English Wikipedia, fixed more of our existing bugs, and are seeing a reasonable volume of errors with lots of wiggle room I am happy for us to explore modifying or removing the check, but right now this is keeping triaging manageable.

Krinkle added a comment.EditedNov 5 2020, 1:32 AM

Here's an easy way to demonstrate the issue.

  1. In Chrome, load https://en.m.wikipedia.org/wiki/Main_Page, focus the search bar, and type an "x". (This makes us fetch all relevant code from the server)
  2. Refresh the page.
  3. Run the following in the console:
mw.trackSubscribe('', console.log.bind(console)); // To see global.error entries
mw.config.values = {}; // This will make MobileFrontend search code crash, unfairly
  1. Focus the search bar and type an "x". (This time we're loading code from local cache)

Observe the empty url and thus dropped event:

global.error 
{
  "errorMessage": "Uncaught TypeError: Cannot read property 'prefix' of null",
  "url": "",
  "lineNumber": 168,
  "columnNumber": 306,
  "stackTrace": "at a.getApiData  <anonymous>:168:306\nat a.search  <anonymous>:169:272\nat <anonymous>:192:482\n",
  "errorObject": {}
}

This is applicable to all regular production source code.

I think it would make sense to simply remove this check

I don't doubt we are dropping events, but right now I am not convinced that we are dropping them significantly.

I hope the above removes any remaining doubt.

Flooding logstash with errors that cannot be easily debugged due to anonymous stack traces and absent file uris would only give us inflated numbers and more noise to contend with, not alert us to new unknown errors. […]

Imho file urls aren't particularly relevant or useful when debugging JavaScript. The stacktrace is all you need imho, which we get either way. Once we have source maps working end-to-end, we can consider compressing variable names for a tiny perfwin, but until then our names are generally more than unique enough to figure things out.

Besides, on the URLs we do get today, it only tells you it was one of 100+ mostly unrelated modules.

I don't see how this could only inflate known errors. Afaik there is no meaningful correlation between loading code from cache and exposing an error. If anything, I imagine it's quite common for most code to first be loaded and not fully excercise all code paths on the first view in that feature's interaction. Pretty much all errors that happen after the first view or use of a feature, we are completely missing out on.