Page MenuHomePhabricator

Edits to pt:MediaWiki:Common.js and new bugs that create client side error spike should log alerts
Closed, ResolvedPublic

Description

On October 5th 2020, an edit to pt:MediaWiki:Common.js led to a spike in client side errors. In the space of an hour 20,000 errors were logged, compared with 15,000 in the 12hrs across all wikis that preceded that [1]. This code appears to have been run for any anonymous user. Had this occurred on English Wikipedia where volume is considerably higher it would have likely brought down the system and is one of the big reasons why we haven't enabled on English Wikipedia.

Screen Shot 2020-10-05 at 1.35.52 PM.png (386×2 px, 67 KB)

In an ideal world, the editor of the site JS would have had access to logstash (or some anonymized equivalent) and have been able to check their change - in 5 minutes it would have been clear there was a problem. However, the error was left for 30 minutes, until luckily I chanced upon it in an open tab set to auto-refresh and was able to rush a fix myself (edit). I won't always catch such errors.

The stack trace gave no indication for where the error originated so these kind of errors cannot be filtered easily. The only clue was "site" module in the name of the file_uri however this applies also for errors we care about. See T262493 for more details on the larger problem here.

at Anonymous function  Unknown script code:2:535
at mightThrow  URL1:50:132
at Anonymous function  URL1:50:808

It's proposed that when this happens an alert is triggered to ensure someone deals with the issue ASAP.

[1]: https://www.mediawiki.org/wiki/Reading/Web/Notable_incidents#October

Event Timeline

@Krinkle I have a ResourceLoader question.
I think the biggest problem here (and with identifying bugs we have is around site and user gadgets) is they run in a global function. Debugging currently involves going to the wiki and then checking all the known wiki pages.

I don't know much of the history about this other than this code comment but presumably it is to allow access to variables such as wgAction

Giving we can use bind in our code now, I was wondering if it would be practical to use a named function loadSiteOrUserScript bound to window to get the same result. The benefit of this, is that it would be 100% clear if errors were coming from site or user scripts making them much easier to diagnose.

Is there any sort of gadgets which would break if run like this (?) :

--- a/includes/resourceloader/ResourceLoader.php
+++ b/includes/resourceloader/ResourceLoader.php
@@ -1224,13 +1224,9 @@ MESSAGE;
                                                $scripts = $content['scripts'] ?? '';
                                                if ( is_string( $scripts ) ) {
                                                        if ( $name === 'site' || $name === 'user' ) {
-                                                               // Legacy scripts that run in the global scope without a closure.
-                                                               // mw.loader.implement will use globalEval if scripts is a string.
-                                                               // Minify manually here, because general response minification is
-                                                               // not effective due it being a string literal, not a function.
-                                                               if ( !$context->getDebug() ) {
-                                                                       $scripts = self::filter( 'minify-js', $scripts ); // T107377
-                                                               }
+                                                               // Legacy scripts are run in the global scope (via binding to window)
+                                                               // They are run inside a named function to provide useful stack traces.
+                                                               $scripts = new XmlJsCode( "(function loadSiteOrUserScript() {\n{$scripts}\n}.bind( window )())" );
                                                        } else {--- a/includes/resourceloader/ResourceLoader.php
+++ b/includes/resourceloader/ResourceLoader.php
@@ -1224,13 +1224,9 @@ MESSAGE;
                                                $scripts = $content['scripts'] ?? '';
                                                if ( is_string( $scripts ) ) {
                                                        if ( $name === 'site' || $name === 'user' ) {
-                                                               // Legacy scripts that run in the global scope without a closure.
-                                                               // mw.loader.implement will use globalEval if scripts is a string.
-                                                               // Minify manually here, because general response minification is
-                                                               // not effective due it being a string literal, not a function.
-                                                               if ( !$context->getDebug() ) {
-                                                                       $scripts = self::filter( 'minify-js', $scripts ); // T107377
-                                                               }
+                                                               // Legacy scripts are run in the global scope (via binding to window)
+                                                               // They are run inside a named function to provide useful stack traces.
+                                                               $scripts = new XmlJsCode( "(function loadSiteOrUserScript() {\n{$scripts}\n}.bind( window )())" );
                                                        } else {

Given we can use bind in our code now, […]

bind() doesn't affect scope.

Is there any sort of gadgets which would break if run like this […]

Yes. Any user script or gadget reading or defining global variables would break. Changing this would be a hard breaking change to pull off, and also break general expectations with regards to JavaScript in a browser since having a scope is how scripts work everywhere else. Whether browser extensions, Greasemonkey scripts, <script> files, ESLint/JSHint linting of code (e.g. as Wikimedia Commons community do), these all assume global scope.

I won't go into details here why this is hard to change for gadgets, but I could at T262493. Although this won't solve that task given that traces are based on execution, not declaration. Anything apart from the most basic top-level synchronous statements would not have this wrapper function in their stack trace.

In an ideal world, the editor of the site JS would have had access to logstash […]

The browser already provides pretty good error reporting to gadget authors, incl preview ability from MW edit screen or browser console. What would end-users gain from seeing Logsash data?

But providing some kind of boolean on/off alert could be useful I suppose to notify editors. Once T262493 is solved, we should let Logstash intake define a Prometheus counter for some of the things we then exclude, which we can alert on per-wiki from Grafana and e.g. have it email wikitech-ambassadors, or just for volunteers to remember to look at from time to time, and probably people will end up developing a bot that notifies WP:VPT on enwiki if it goes off etc.

Okay so to confirm I understand, you are saying gadgets do things like:

var x = 1

and then accessing x in other scripts expecting x to be global scope. Is that correct?

Do we have data around how many different user scripts / gadgets are relying on this?

The browser already provides pretty good error reporting to gadget authors, incl preview ability from MW edit screen or browser console.

That doesn't account for browser specific bugs or perceived edge cases. The bugs I'm seeing are bugs that are not reproducible on all gadgets and all pages. Chinese Wikipedia for example the errors on IE11 were huge due to a script running ES6 code. Given we allow code to run on anonymous users via MediaWiki:Common.js I would argue we need to do a lot better. Right now the onus is on my team to identify and fix these issues as they occur and I'd rather it wasn't. Having a clear way to identify gadgets from WMF code is the problem we need to solve here. I'm wasting a lot of time doing searches across our codebases and global-search to determine where an issue is occurring.

I am wondering if it would be possible to pull these module into its own ResourceLoader grouping so the file uri is different? I can understand there are likely performance reasons not to do it for the site module since it runs for anons, but I think clearly defining code that comes from user scripts would go a long way to reducing the time debugging these.

Jdlrobson renamed this task from Edit to pt:MediaWiki:Common.js led to huge client side error spike to Edits to pt:MediaWiki:Common.js and new bugs that create client side error spike should log alerts.Feb 5 2021, 7:26 PM
Jdlrobson updated the task description. (Show Details)
Jdlrobson added projects: observability, SRE.

I think alerts are the way to go with this. The alert would essentially trigger when over 10,000 errors occur in a single hour (usual rate is < 2,000 errors an hour)
When this happens, IRC notifications would be sent to #wikimedia-releng and #wikimedia-operations and a phabricator ticket would be created tagged "production error" pointing to the spike and tagging teams including Release-Engineering-Team and Web-Team-Backlog (and others TBC). Only one Phabricator ticket would need to be open at any given time when this does happen.

The process for generating an alert from logs is to export a metric based on an ES query then alert off of the generated metric. We use prometheus-es-exporter to do the metrics export. Once there is a metric in place, we can alert on sum_over_time(metric[-1h]) >= 10000 reasonably easy.

As to creating Phabricator tickets from alerts, it involves writing an alert handler script and installing it into Icinga. I imagine deduplication as described would have to occur somewhere in this handler.

@colewhite is there somebody I could pair with to write this metric? Puppet is not by forte... but I'd gladly setup 30 mins in my calendar to get this done. :)

@Jdlrobson I'd be happy to help. Ping me on IRC or elsewhere to coordinate a time and we'll make it happen.

Change 667695 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[operations/puppet@production] prometheus: Setup configuration for client error alerts

https://gerrit.wikimedia.org/r/667695

Change 667695 merged by Cwhite:
[operations/puppet@production] prometheus: Setup configuration for client error alerts

https://gerrit.wikimedia.org/r/667695

Change 667737 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] icinga: add grafana dashboard alert for client errors

https://gerrit.wikimedia.org/r/667737

Change 667737 merged by Cwhite:
[operations/puppet@production] icinga: add grafana dashboard alert for client errors

https://gerrit.wikimedia.org/r/667737

Jdlrobson closed this task as Resolved.EditedMar 3 2021, 1:16 AM
Jdlrobson claimed this task.

Thanks so much for your help with this one @colewhite
Am broadcasting this out here > T276296

Change 668113 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[operations/puppet@production] Use more generic non-team specific name for alerts

https://gerrit.wikimedia.org/r/668113

Change 668113 merged by Cwhite:
[operations/puppet@production] Use more generic non-team specific name for alerts

https://gerrit.wikimedia.org/r/668113