Page MenuHomePhabricator

[Regression pre-wmf.16] Infinite error throwing loop when MediaSearchWidget triggers jQuery JSONP exception in Beta Cluster (due to loop with Sentry)
Closed, ResolvedPublic8 Story Points

Description

Steps to reproduce:

  1. Open Visual Editor
  2. Click on Insert
  3. Click on Media
  4. Click on dialog box for the search panel
  5. Try searching for something other than the default parameter 'sandbox'.

Observation:
The browser tab freezes at this point with no response.
Bug was found in beta environment using google chrome browser.

Details

Related Gerrit Patches:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 17 2016, 10:29 PM

I can't reproduce this with the same URL in Chrome.

The default search parameter 'sandbox' works. But sometimes when I hit backspace on the search box and retype a different search parameter for a couple of times it fails as demonstrated by the video. There maybe instances it functions all good during the first trial but on the second or third trial it is not functioning for me.

I can reproduce it on Beta cluster with Chrome only. But yeah, it is a bit inconsistent. Maybe happening when you modify the search term very fast? Not sure.

Ryasmeen renamed this task from Browser tab freezes when attempting to Insert media on Visual Editor to [Regression pre-wmf.16] Browser tab freezes when attempting to Insert media on Visual Editor.Aug 18 2016, 12:43 AM

This is also happening when trying to add images inside Gallery. For me it happens when I modify the search term very quick before it loads all the search results.

Ryasmeen renamed this task from [Regression pre-wmf.16] Browser tab freezes when attempting to Insert media on Visual Editor to [Regression pre-wmf.16] VisualEditor crashes when attempting to Insert media on Visual Editor.Aug 19 2016, 2:04 AM
Esanders added a comment.EditedAug 19 2016, 6:04 PM

So jQuery JSONP is firing an error, which is known behaviour when you cancel JSONP requests (the random global function is deleted and then called), but it looks like mw.track (attached to window.onerror) is triggering the infinite loop.

Esanders renamed this task from [Regression pre-wmf.16] VisualEditor crashes when attempting to Insert media on Visual Editor to [Regression pre-wmf.16] Infinite error throwing loop when VE triggers jQuery JSONP exception.Aug 19 2016, 6:24 PM

Looping call stack:

window.onerror is in mediawiki.errorLogger.js

Pinging @Tgr @Krinkle

Tgr added a comment.Aug 19 2016, 6:42 PM

Is there any reason to use JSONP for an unauthenticated crosswiki request that could just be done with non-preflight CORS?

Tgr added a comment.Aug 19 2016, 6:47 PM

Note that Extension:Sentry (which is where traceKitWindowOnError comes from) is deployed in beta but not production so this probably does not affect production.

I can't reproduce this at all (beta, Chrome on Ubuntu). Can someone get a debug mode stacktrace?

That may be a way to work around this issue, but the root cause should be fixed too.

Esanders renamed this task from [Regression pre-wmf.16] Infinite error throwing loop when VE triggers jQuery JSONP exception to [Regression pre-wmf.16] Infinite error throwing loop when MediaSearchWidget triggers jQuery JSONP exception.Aug 19 2016, 6:49 PM
Esanders added a project: Multimedia.
Esanders added a subscriber: matmarex.
Restricted Application added a subscriber: Matanya. · View Herald TranscriptAug 19 2016, 6:49 PM

I can't reproduce either.

Jdforrester-WMF added a comment.EditedAug 19 2016, 7:16 PM

Note that Extension:Sentry (which is where traceKitWindowOnError comes from) is deployed in beta but not production so this probably does not affect production.

That's really bad; when will that be fixed so that we can rely on Beta Cluster to be as like production as possible?

It's a bit fiddly to reproduce - but if you keep typing such that you cancel a valid request it should eventually trigger the jquery_NNNN method not found exception.

I can reproduce that exception, yes, but not the loop.

That's really bad; when will that be fixed so that we can rely on Beta Cluster to be as like production as possible?

The unsaid assumption there is that the Sentry extension contributes a significant chunk of the "behavioral distance" between beta and production; I think you would find that claim hard to defend.

I do hope to get the extension deployed in production at some point (not easy since production is very much not like beta in that it has about four magnitudes more traffic, a huge difference for any application that's not purely client-side), but even apart from that having better error logging in beta at the expense of keeping an extra error handler around seems like a reasonable tradeoff to me.

You can also fake a JSONP abort from the console:

var xhr = $.ajax(
  'https://commons.wikimedia.org/w/api.php?action=query&format=json&generator=search&gsrnamespace=6&iiurlheight=200&iiprop=dimensions%7Curl%7Cmediatype%7Cextmetadata%7Ctimestamp%7Cuser&prop=imageinfo&gsrsearch=test&iiurlwidth=400&gsroffset=0&gsrlimit=15',
  {dataType:'jsonp'}
);
xhr.abort();
Tgr added a comment.Aug 19 2016, 8:39 PM

Sentry installs a global.error handler which, when first invoked, replaces window.onerror with TraceKit's error handler, then uninstalls itself. In theory, the browser should protect from window.onerror loops (errors in the error handler don't trigger the error handler again). So either that does not work in some browser (in which case we can just add a try/catch), or somehow unsubscribing the mw.track handler fails.

Can you trigger the same problem with a different error? E.g. setting the click handler of some DOM event to cause an error, and then clicking it? (Errors caused in the console directly are special.)

greg added a subscriber: greg.Aug 19 2016, 8:59 PM

Note that Extension:Sentry (which is where traceKitWindowOnError comes from) is deployed in beta but not production so this probably does not affect production.

That's really bad; when will that be fixed so that we can rely on Beta Cluster to be as like production as possible?

To be clear, it's not a bug to be "fixed". Extensions are deployed to Beta Cluster before production all the time; it's the way things work: https://www.mediawiki.org/wiki/Review_queue

We should keep the time differential as small as reasonably possible, of course. But I'd also like to say that Gergo is working hard on this (Sentry) as a single agent without much help (regretfully! I personally think Sentry, or similar, would be great for our js devs, just a few days ago Fundraising was asking for such a service).

Tgr added a comment.Aug 19 2016, 9:06 PM

The error itself is pretty boring: it's happening on an external domain so Chrome scrubs all error details and just leaves a Script error message, so it is very unlikely that this error was never triggered before. If you can reproduce, can you share the exact OS/browser version? Also, as I said, a stack trace in debug mode would be very helpful.

Also, what do you think, was there some change in VE that caused this error to be triggered now but not in past branches, or was this specific feature never tested (in this specific way) before, or is there some external reason (e.g. a browser update) that the error started appearing now?

Tgr added a comment.Aug 19 2016, 9:18 PM

So either that does not work in some browser (in which case we can just add a try/catch), or somehow unsubscribing the mw.track handler fails.

On second though it must be the latter if this was the looping part:


since otherwise track would not appear (TraceKit has no such method).

@Esanders are you sure that was an actual loop? It could simply show multiple errors being processed in succession (when I try to reproduce the problem, I do get multiple hits to onerror if I type fast enough, but after going through it 3-4 times error handling finishes normally).

Note that Extension:Sentry (which is where traceKitWindowOnError comes from) is deployed in beta but not production so this probably does not affect production.

That's really bad; when will that be fixed so that we can rely on Beta Cluster to be as like production as possible?

To be clear, it's not a bug to be "fixed". Extensions are deployed to Beta Cluster before production all the time; it's the way things work: https://www.mediawiki.org/wiki/Review_queue

Yes, but briefly, and with as few side-effects as possible. Things that interfere with the full scope of JS should be extremely cautiously added. I was mostly extremely surprised that this was running, as I thought given the Sentry project was (regretfully) paused, it wouldn't be live anywhere.

We should keep the time differential as small as reasonably possible, of course. But I'd also like to say that Gergo is working hard on this (Sentry) as a single agent without much help (regretfully! I personally think Sentry, or similar, would be great for our js devs, just a few days ago Fundraising was asking for such a service).

Absolutely, and VE wants it too, but not enough to qualify as a quarterly goal (which I imagine is the same case for everyone else); the tragedy of the commons, indeed.

In this case, we have documented breakage (this ticket); there's no need to keep it live in Beta Cluster unless it's going live in production within a few weeks, surely?

greg added a comment.Aug 19 2016, 9:54 PM

We should keep the time differential as small as reasonably possible, of course. But I'd also like to say that Gergo is working hard on this (Sentry) as a single agent without much help (regretfully! I personally think Sentry, or similar, would be great for our js devs, just a few days ago Fundraising was asking for such a service).

Absolutely, and VE wants it too, but not enough to qualify as a quarterly goal (which I imagine is the same case for everyone else); the tragedy of the commons, indeed.

Yeah, I'm willing to vouch for (+ moral support) this becoming a shared goal in the near future, if needed. But then, maintenance... ;)

In this case, we have documented breakage (this ticket); there's no need to keep it live in Beta Cluster unless it's going live in production within a few weeks, surely?

I'd like to be flexible for Gergo's sake. He's taking on a large burden here that will help us all. With that said:

If this breakage is relatively easily fixed (all subjective there) then I'd let it stay. If not, then I'd recommend disabling it with the encouragement to try again when there are better ideas for a fix (which, admittedly, will be terribly hard without it being enabled on BC).

Tgr added a comment.Aug 19 2016, 10:06 PM

In this case, we have documented breakage (this ticket); there's no need to keep it live in Beta Cluster unless it's going live in production within a few weeks, surely?

Isn't that the exact point of Beta, to discover nontrivial breakage before it would affect production?

In any case, disabling Sentry in Beta wouldn't hinder development much right now; all immediate blockers are related to puppet for which a labs project is fine. But 1) I can't debug this issue without the help of someone who can actually reproduce it, so it would just reoccur whenever Sentry gets re-added, 2) logstash integration would have to happen in Beta, 3) I still think error collection from all Beta projects is a sensible end goal in itself, regardless of whether it's done the same way in production or not.

@Esanders are you sure that was an actual loop? It could simply show multiple errors being processed in succession (when I try to reproduce the problem, I do get multiple hits to onerror if I type fast enough, but after going through it 3-4 times error handling finishes normally).

The event queue was 40,000 deep at this point, so I think so. I can reproduce it consistently by using the xhr.abort code I pasted above.

This also causes an infinite loop - as you suggested:

$(document.body).one( 'click', function () { throw new Error('Oops') } );

Tgr added a comment.Aug 19 2016, 10:26 PM

Thanks, that's conclusive. I'll add some try-catch and if that doesn't fix it, I'll undeploy it from Beta. For the record, can you state the exact browser version?

Version 52.0.2743.116 (64-bit)

I'll add some try-catch and if that doesn't fix it, I'll undeploy it from Beta.

Thank you! But let's get it working and plumbed-in so we can get the benefit too. :-)

greg added a comment.Aug 22 2016, 8:51 PM

Status update? :)

@matmarex I tagged multimedia because of the issue with the providerQueue using JSONP instead of CORS. Any thoughts?

I'm afraid that neither me nor any Multimedia folks ever had anything to do with this code. My thoughts are yes, providerQueue should probably use CORS, and possibly it should even use mw.Api/mw.ForeignApi rather than $.ajax calls (although on Wikimedia wikis that currently has the overhead of getting a CentralAuth token before every request). I'm not familiar with that code, but I could probably figure it out and make that change if you want me to?

Tgr added a comment.Aug 22 2016, 9:53 PM

I managed to reproduce the problem in vagrant (not exactly the same one as the recursion is stopped by the 5 Sentry reports/page limit, but it is some kind of loop, so I assume the fundamental problem is the same). It is somehow related to mw.track unsubscription not working when called from the function that needs to be unsubscribed, depending on timing (jQuery has the annoying property that promise callbacks can be sync or async depending on when they are called, so they are a pain to reason about). It's not too hard to safeguard against, I just want to figure out how exactly it is happening first.

In any case this is not a deployment blocker. If it's otherwise problematic (e.g. interferes with QA) we can undeploy Sentry.

CORS is generally better than JSONP (more secure, slightly more performant, can handle network failure) and you don't need an auth token for searching. Switching would be a good idea but no need to do it in a hurry.
(The one thing to check is that we support CORS for IE9 which has its own weird way of doing it. I imagine that's handled somewhere in mw.Api or mw.ForeignApi but I don't remember if I ever checked.)

Jdforrester-WMF renamed this task from [Regression pre-wmf.16] Infinite error throwing loop when MediaSearchWidget triggers jQuery JSONP exception to [Regression pre-wmf.16] Infinite error throwing loop when MediaSearchWidget triggers jQuery JSONP exception in Beta Cluster (due to loop with Sentry).Aug 22 2016, 9:54 PM
Jdforrester-WMF triaged this task as Low priority.

but I could probably figure it out and make that change if you want me to?

That would be great. Eventually we'll want to move the media search widget upstream to MW widgets, at which we may hand over ownership.

Tgr added a comment.Sep 10 2016, 4:41 AM

Apologies for being so slow on this (needed a break from walking through jQuery code in a debugger). I tracked it down to Sentry abusing mw.track. The usage pattern looks something like this:

mw.trackSubscribe( 'global.error', function sentrySetup( topic, error ) {
    mw.trackUnsubscribe( sentrySetup );
    // load Sentry
    // process error
    mw.track( topic, error );
} );

mw.track( 'global.error', {...} );
mw.track( 'global.error', {...} );

which, depending on how closely the second error follows the first and how long the loading takes, can result in an infinite loop where the handler function defined inside mw.trackSubscribe buffers up the second error (ie. its seen index does not point at the end of trackQueue) and every time it goes through its for loop, the error is again added to the end of trackQueue.

Change 309767 had a related patch set uploaded (by Bartosz Dziewoński):
ve.dm.MWMediaResourceProvider: Use mw.ForeignApi rather than JSONP

https://gerrit.wikimedia.org/r/309767

Change 309809 had a related patch set uploaded (by Gergő Tisza):
Prevent Raven/TraceKit from taking over window.onerror

https://gerrit.wikimedia.org/r/309809

I suppose this should be split into two tickets...one for the Sentry error and another for the VE MediaSearchWidget issue?

Change 309767 merged by jenkins-bot:
ve.dm.MWMediaResourceProvider: Use mw.ForeignApi rather than JSONP

https://gerrit.wikimedia.org/r/309767

Change 312856 had a related patch set uploaded (by Bartosz Dziewoński):
Revert "ve.dm.MWMediaResourceProvider: Use mw.ForeignApi rather than JSONP"

https://gerrit.wikimedia.org/r/312856

Change 312856 merged by jenkins-bot:
Revert "ve.dm.MWMediaResourceProvider: Use mw.ForeignApi rather than JSONP"

https://gerrit.wikimedia.org/r/312856

Change 312857 had a related patch set uploaded (by Bartosz Dziewoński):
ve.dm.MWMediaResourceProvider: Use mw.ForeignApi rather than JSONP

https://gerrit.wikimedia.org/r/312857

Jdforrester-WMF closed this task as Resolved.Nov 2 2016, 7:42 PM
Jdforrester-WMF assigned this task to matmarex.

Change 312857 merged by jenkins-bot:
ve.dm.MWMediaResourceProvider: Use mw.ForeignApi rather than JSONP

https://gerrit.wikimedia.org/r/312857

Tgr added a comment.Nov 2 2016, 9:28 PM

If someone has too much free time, https://gerrit.wikimedia.org/r/#/c/309809/ still needs review :)

Change 309809 merged by jenkins-bot:
Prevent Raven/TraceKit from taking over window.onerror

https://gerrit.wikimedia.org/r/309809