Page MenuHomePhabricator

EventEmitters must behave appropriately if a handler raises an exception
Closed, ResolvedPublic40 Estimated Story Points

Description

EventEmitter events are supposed to be a one-way communcation channel from the emitter to the handler. This paradigm breaks if the handler throws an exception that propagates back to the emitter and causes execution flow to short circuit. In some cases in VisualEditor right now, that means a UI exception can cause corruption in the data model.

This is best fixed in OOJS, for consistency, even though some code could theoretically break if the emitter somehow depends on the effects of the handler (which is an anti-pattern).

Event Timeline

Mind you, the emitter catching exceptions and running a callback defined at the emit-site is also a bit of a violation of that one-way channel.

I'd sort of like to just make emit try/catch and see if anything breaks.

That's true (I was imagining the callback only being used for debugging purposes).

Steps to reproduce (one specific example of data corruption):

  1. Open the VE standalone demo.
  2. In the browser console, add an event handler that throws an exception: ve.init.target.surface.model.documentModel.connect( null, { transact: function () { throw new Error( 'Boom' ); } } )
  3. Click to put the cursor in the document, then type 'x' to cause a transaction to be processed.
  4. Observe that an exception is thrown.

Expected behaviour: The document model is in a consistent state.

Actual behaviour: ve.init.target.surface.model.getHistory().length is 0, but ve.init.target.surface.model.documentModel.completeHistory contains a transaction. This is because the exception in the handler prevented the newly-processed transaction from being added to the list of staging transactions.

Change 405827 had a related patch set uploaded (by Divec; owner: Divec):
[VisualEditor/VisualEditor@master] Create ve.EventEmitter with emitCatch, and use it throughout ve.dm

https://gerrit.wikimedia.org/r/405827

Change 405827 abandoned by Divec:
Create ve.EventEmitter with emitCatch, and use it throughout ve.dm

Reason:
Abandoning in favour of I9d74b3418c05509aca8e3af494ff7c06bdc80337 in oojs/core - it makes more sense to have one uniform definition of emit.

In the unlikely event we decide not to merge that change, we should port DLynch's and Bartosz's enhancements from there to here.

https://gerrit.wikimedia.org/r/405827

Change 406247 had a related patch set uploaded (by DLynch; owner: DLynch):
[oojs/core@master] [BREAKING CHANGE] EventEmitter.emit: catch exceptions from listeners

https://gerrit.wikimedia.org/r/406247

Change 406247 merged by jenkins-bot:
[oojs/core@master] [BREAKING CHANGE] EventEmitter.emit: catch exceptions from listeners

https://gerrit.wikimedia.org/r/406247

This comment was removed by Krinkle.
@dchan wrote:

EventEmitter events are supposed to be a one-way communication channel from the emitter to the handler. This paradigm breaks if the handler throws an exception that propagates back to the emitter and causes execution flow to short circuit. In some cases in VisualEditor right now, that means a UI exception can cause corruption in the data model.

I'd sort of like to just make emit try/catch and see if anything breaks.

EventEmiter: One-way

At All Hands @dchan showed me this issue and we spent some time in weeks since analysing the problem.

First, I agree EventEmitter is best used as one-way communication channel. To consume data, which the handler (consumer) may use to inform its own behaviour, but not as a way to allow consumers to alter behaviour of the emitter object (producer). In other words, they're not supposed to be "hooks", which are better suited as parameters, custom callbacks, or overridable sub-class methods. In addition to events being one-way, it is also important that consumers have a reliable way to make decisions without depending on current state elsewhere. (More on that later.)

Seeing exceptions caught within OO.EventEmitter does seem a bit unusual. It's where I'd expect a silent catch-all. I couldn't find other implementations of an EventEmitter in the wild doing this, either. Which is reason for suspicion. If this is the right thing to do, then I don't mind being innovative, but we should be mindful of the cost of being non-standard.

If I understand correctly, the current bug being addressed in this task, is about exceptions affecting VisualEditor that are considered "unimportant" and can be safely caught and ignored, in exchange for letting the editor proceed as normal and retaining its internal state integrity. I have no reason to doubt these claims. I'm sure they're safe to catch and ignore, and that nothing breaks elsewhere if we do. The question is, how/where do we catch. Especially considering the seems VE-specific and specific to this individual problem. It would not surprise me if, elsewhere in our projects or even later on elsewhere within VE, there would be problems caused by the silent catch-all. Causing the opposite effect: State being corrupted because we proceeded despite a fatal error. When a run-time exception is not caught, something somewhere ought to be responsible for handling it, or else we should fatal. (More on that later.) We can't know that future exceptions in all code paths will be safe to catch.

Asynchronous events

@dchan explained how VE DataModel (DM) events are consumed by different namespaces (ContentEditable–CE, and UI), and that DM events are "one-way" used to keep CE and UI state in sync. And exceptions in UI shouldn't break the internal processing state of DM, or CE.

My first thought was whether these events could be made async. Emitting them asynchronously would solve the problem, while also having the benefit of allowing DM to first complete its transaction atomically before the reactor (CE/UI) could potentially call more DM methods or make another transaction, which may break things. (Not a current problem, but seems desirable to make impossible by design.)

The principle of emitting events asynchronously, by calling emit() from setTimeout or from process.nextTick, is quite common. Both in the browser and in Node.js. It encourages isolation and higher throughput, by allowing individual tasks to complete and before allowing event handlers to start more tasks. E.g. Don't allow an event handler to pause the task mid-way for an unknown period of time.

David quickly pointed out that DM and CE are tightly coupled currently, and its events could not asynchronous (unless major refactoring takes place). That seems fair, so I won't push for async events. However, I did want to determine what that tight coupling was, as it helps us understand the problem and maybe find another solution.

After some iterations and me not understanding how DM+CE really work, we eventually found three aspects of CE that break with async events:

  1. When CE consumes DM events, the info accessible through event parameters is insufficient. It therefore uses its own reference to DM and LinMod to retrieve additional data. If we'd trigger events asynchronously (even when preserving original event order), these lookups could return different/wrong values. For example, the nodes may've changed several times since. In itself, this doesn't seem like a problem, given that it could actually improve performance by letting older events always update to the current state (de-duplicating events). But this would go wrong because:
    • A) It means changes no longer happen in order (later changes would effectivel merge into earlier changes), which breaks various assumptions.
    • B) CE event handlers make lookups in DM based on offsets, not based on object identity. And it tends to use offsets from the event parameters, but in relation to current DM. If called after the transaction completes, these offsets could point to entirely different nodes.
  2. The CE nodes implement the entire Node interface, including a getOffset method. However, the problem is that unlike all other aspects of CE nodes, their offset is not reproduced or updated through events. Instead, ve.ce.Node#getOffset is a proxy that directly returns the current value from getOffset of the correlating DM node. This causes problems if events were asynchronous because some state is updated through events, whereas other state (like offsets) are accessed directly, which means the offsets would be"ahead" of the regularly updated state, making it incompatible.

I don't know for sure, but these two aspects don't actually seem to be very fundamental (nor relied upon). It might be worth exploring if those could be addressed without much difficulty:

  • idea: Use events to update CE node offsets.
  • idea: Put more data in the DM events so that consumer in CE doesn't need DM lookups.

Fatal error vs Event loop

Back to the original issue. UI throws, which goes uncaught and short-circuits whatever DM/CE was doing. Errors are bad, but it wasn't clear to me why we'd want to mute all errors by default because of it. Normally when we find a fatal, we find the broken thing that threw, and fix it. Again, David reminded me of something very important: The event loop.

Unlike in PHP, when you get a fatal in client-side javascript, things don't stop. Not entirely, anyway. There are still event handlers attached to the DOM, and code will re-awaken regardless of any "uncaught" error, which then results in the corrupted objects being interacted with further, causing more corruption. Likely at this point, performing actions in the editor would result in all kinds of unexpected changes throughout the document. And trying to save the edit will likely fail, or it might serialise to something the user didn't intend.

However, it still wasn't clear to me how silently catching exceptions would solve the problem. At any one point, it might ensure the producer retains its state, but likely at the cost of still having the consumer become corrupted. If this bug was about VE vs non-VE code (eg. events emitted towards plugins or user gadgets), then it might not seem so bad to let the plugin become corrupted (although we should probably still catch it with a generic handler common to only gadgets, and then disconnect the plugin in some way). But in this case we're talking about to two VE-internal components. Presumably a corrupt state is bad for both DM, CE and UI. If we had some code that detect a corrupt CE or UI, and could re-create it based on current DM, then that'd be fine, but I don't think that's the case right now. Besides, even if we had that feature, we'd have to trigger it from a catch handler, which the silent catch-all prevents (e.g. a generic catch-handler for all CE event handlers, and one for all UI event handlers). Something like:

  ve.ce.Something = function () {
-   this.data.on( 'example', this.onExample.bind( this ) );
+   this.data.on( 'example', ve.catch( this, 'onExample' ) );

Where ve.catch would discover whether this is CE or UI, and if the handler throws, tear down and re-create the associated CE or UI surface (based on DM). It's possible that the corruption would happen again if there is a consistent problem where e.g. the CE or UI is unable to represent a certain kind of (valid) DM, or if DM accepted invalid state without throwing. For that, we'd want to avoid an infinite loop. E.g. after 1 attempt, fall back to deactivating the surface and displaying a modal dialog with an option to reload the page and locally stash and restore the data model as it was before the last transaction.

  • idea: Wrapping all handlers is nasty though. Another approach would be to use window.onerror, which gets notified of all uncaught errors, and if its stack trace contains VE code, it would disable the editor, and display the same "Something went wrong" modal from there, with the same recovery option.

This modal could be similar to what Google Docs does when it crashes ("Something went wrong", offering, for example: "Details about the error", "Report the error", "Refresh to try and restore the edit", and "Cancel editing"). We could even report the error by default through EventLogging, then from a processor send the trace to Logstash, and increment a counter in Graphite.

google-docs-error.png (250×938 px, 30 KB)
google_docs_error_message.jpg (316×1 px, 28 KB)

Catchy EventEmitter

So back to EventEmitter. I'd like to propose that we revert the breaking change in oojs/core (https://gerrit.wikimedia.org/r/406247), because as-is it seems very difficult to roll-out as major release affecting other projects.

Instead, we could try some of the above ideas, or first proceed with the same approach within VisualEditor (using ve.EventEmitter). My only recommendation would be to add some logging to the catch handler in ve.EventEmitter calling mw.track('counter.ve.<something>'), to provide insight into the number of muted exceptions. I can also help set up a Grafana alert.

I like this idea of VisualEditor doing something nice with window.onerror .

One of the points I hadn't expressed very well is that the issue causing VE DM corruption is often UI errors rather than CE ones. Something relatively trivial like a dialog box not closing properly had the potential to corrupt the document model, because of uncaught errors in listeners unwinding the call stack right back down to the event loop. (We saw one extreme "spark causes forest fire" example — in a testing session only — where a failure to set a caption on a label was causing profound corruption that could get all the way back to the server and out to *every* subsequent editing client).

For this reason, if we back out the try-catch in OO.EventEmitter, I think we should instead write a ve.EventEmitter that has it.
But I do wonder how often this might be true in other projects too: perhaps listeners are often UI code, and perhaps errors in them normally shouldn't bring down the emitter. Maybe we could take a look at existing project code to get an idea of the frequency of different uses?

As for the VisualEditor CE listeners, on reflection, I think I can explain more clearly why I think in practice they cannot avoid looking into the DM. To avoid this, CE would effectively need its own "shadow" linear model, kept updated by the data in the DM events — at which point the DM becomes almost redundant. (This could be another way of saying the emitter-listener relationship isn't actually the ideal one here, which I've leaned towards thinking for years :))

Change 527510 had a related patch set uploaded (by Divec; owner: Divec):
[VisualEditor/VisualEditor@master] Do not propagate exceptions thrown in OO.EventEmitter#emit listeners

https://gerrit.wikimedia.org/r/527510

Change 527510 is the code from 84df3f0b9f59626749db6e00b197c18979b0bacd , reimplemented inside VisualEditor as an extension of OO.EventEmitter.

I looked into the alternative of implementing a separate ve.EventEmitter, based on OO.EventEmitter but with .emit catching exceptions thrown inside listeners. But OOJS-UI classes use OO.EventEmitter, so you unavoidably end up with both OO.EventEmitter objects and ve.EventEmitter objects instantiated at the same time, which gets very confusing.

I also looked into the alternative of an .emitCatch method (either actually in OOJS or in an extension inside VE) coexisting with the standard .emit method. Again, the inconsistency gets very confusing because, when writing or debugging some piece of code, it's far from obvious whether it will be called inside .emit, .emitCatch or both.

Modifying OO.EventEmitter.prototype.emit is not going to work as it will change the behaviour for everything on the page, not just VE.

All the other solutions here are too complex given how widely used EventEmitters everywhere.

I think we should just fix EventEmitter in OOUI to either rethrow the exception in a different event cycle (the patch David submitted), or rethrowing the exceptions as warnings, which is what jQuery promises do.

Hmm, can we perhaps modify the prototype on loading VE? I think when VE is running the chances of anything (VE or otherwise) relying on the non-catchy behaviour are pretty low.

(I agree applying @DLynch 's OOJS core patch seems the simplest solution. I'm only suggesting fixing this within VE because of the pushback there was on doing it in OOJS core)

Hmm, can we perhaps modify the prototype on loading VE? I think when VE is running the chances of anything (VE or otherwise) relying on the non-catchy behaviour are pretty low.

I really don't think this is a worthwhile avenue to pursue. Lots of code runs alongside VE: CX, Flow, Echo, ..

Ok, the VE engineers have now examined many different possible VE-level fixes to this. Unfortunately, all seem to worsen the codebase.

Therefore we strongly believe the right answer is for OO.EventEmitter#emit to catch errors and rethrow them asynchronously (in a setTimeout, meaning the browser debugger can still pause on them and show the async stack trace). Moreover we continue to believe a listener error taking down the emitter is a fundamental violation of the EventEmitter pattern.

Below are evaluations of different possible VE-level fixes we've examined:

Separate ve.EventEmitter

This would work for classes that mixin ve.EventEmitter directly. However, a large number of VE classes inherit a mixed-in OO.EventEmitter e.g. via OO.ui.Widget. Therefore a custom ve.EventEmitter would have to coexist with standard OO.EventEmitters – which obfuscates whether a particular listener (or code called by it) will get executed from OO.EventEmitter, from ve.EventEmitter or from both.

Modifying OO.EventEmitter.prototype.emit within the VE codebase only

This would fix .emit for all VE usage. But it would also leak to non-VE uses of OO.EventEmitter running on the same page. With highly devious hacks we could probably get round this, but such hacks would obfuscate the code, as would the fact that OO.ui.* classes would sometimes, but not always, be running with a modified #emit method.

Handling uncaught listener errors with window.onerror

This could avoid the EventEmitter being taken down silently. However it doesn't fix the fundamental violation of the pattern, which stipulates the EventEmitter should not care what listeners there are. (In the VE case it means unrecoverable data corruption will still be caused, we just get a chance to tell the user to reload).

Writing an .emitCatch to use instead of .emit

This would fix the issue for cases where .emitCatch is used. But it doesn't fix cases where .emit is called by a base class (e.g. OO.ui.Widget), and so it also obfuscates whether a listener (or code called by it) will get executed from .emit, from .emitCatch or from both.

Change 527510 abandoned by Divec:
Do not propagate exceptions thrown in OO.EventEmitter#emit listeners

Reason:
Abandoned for the reasons in given T185546: there's no way this can coexist with OO.EventEmitter#emit without increasing complexity.

https://gerrit.wikimedia.org/r/527510

Change 529726 had a related patch set uploaded (by Divec; owner: Divec):
[oojs/core@master] EventEmitter.emit: catch exceptions from listeners

https://gerrit.wikimedia.org/r/529726

I filed a separate task about using window.onerror (T230441)

JTannerWMF added a subscriber: JTannerWMF.

I am moving this to stalled on our current workboard due to the team waiting from information from @Krinkle

Ok, the VE engineers have now examined many different possible VE-level fixes to this. Unfortunately, all seem to worsen the codebase.

Below are evaluations of different possible VE-level fixes we've examined:

[..]

Agreed.

Therefore we strongly believe the right answer is for OO.EventEmitter#emit to catch errors and rethrow them asynchronously (in a setTimeout, meaning the browser debugger can still pause on them and show the async stack trace). Moreover we continue to believe a listener error taking down the emitter is a fundamental violation of the EventEmitter pattern.

Agreed, and the kinds of events for which we don't want to observe errors, are also the kind that we don't want to depend on and modify our state or execution. If we depend on them for execution, @dchan mentions that it probably shouldn't be using event emission, but rather bind explicitly against another object, or to use a callback interface instead.

In other words, for clean events, both callback execution and any error would need to be asynchronous.

But, per discussion in-person with @dchan we have several events emitted in VisualEditor that do modify state synchronously, and yet we want them to silently fail. I may've misunderstood if that's common or not. If not, then a local catch over an explicit callback would likely be better long-term.

The other use case for synchronous-but-errorless events, as mentioned by @Esanders, is the forwarding of browser events into an EventEmitter object. These event handlers should have the same privileges and behaviours as native callbacks and thus must not run asynchronously. However, these are also cases where we don't care about catching errors presumably, as these are already asynchronous contexts with their own callstack (throwing can't disrupt any other code).

So what I propose is:

  • Introduce emitAsync, which we encourage adoption of wherever possible. This creates a strict contract of one-way data flow. Callers may not inspect modify or disrupt execution of the host object. This is what "proper" events that originate in our own code would use.
  • Introduce emitSync. This is what proxied events would use, for example where we forward a browser event (possibly via jQuery) into an OO.EventEmitter instance. Here we are not concerned about what an exception might do, as the context is already async from a browser event (It can't disrupt other code). We also want the exception to be observed by browser dev tools when looking for breakpoints on failed events and such, the same way as if each even handler were registered directly with the browser. @Esanders brought up that we'd want to go deeper than emulating jQuery and emulate the native browser behaviour. Specifically, that if one listener throws, other listeners still run after it. We'd catch and re-throw the first (or last?) error, and throw the others async (for debugging purposes). Hm.. bit weird but good compromise?

And lastly, to resolve this task, modify emit to always catch and re-throw async, and encourage OOUI, Echo, Flow and other non-VE adopters of OOjs to migrate their callers to one of these.

As a first pass after we upgrade the OOjs version for MediaWiki core, is that non-VE code bases can blindly migrate from emit to emitSync blindly, which they can use if they want to preserve todays behaviour for them (except that handlers after the first failure now run as well, which can be considered a bug fix).

I'll land David's commit as-is, but before we release and deploy, we'll need at least emitSync to exist for those interest in preserving current behaviour. The introduction of emitAsync can be discussed in a separate task instead, as we might not need it (yet).

Ok, I've added the extra emit method, but with the name emitThrow rather than emitSync, since that makes clearer how this method differs from plain emit: see https://gerrit.wikimedia.org/r/#/c/oojs/core/+/529726/8/src/EventEmitter.js .

Change 529726 merged by jenkins-bot:
[oojs/core@master] [BREAKING CHANGE] EventEmitter.emit: catch exceptions from listeners

https://gerrit.wikimedia.org/r/529726

This is officially live in our codebase, feel free to resolve.