Page MenuHomePhabricator

Editing busted in IE8 (at least) due to EventLogging
Closed, ResolvedPublic

Description

EventLogging error msg

Seen in IE8 as anon and as logged-in user, likely to occur in other pre-IE9 versions:

Edit any page in IE8, for instance a user talk page.

  • No editing toolbar appears.
  • Link to "My Sandbox" is gone
  • User javascript is gone
  • Editing toolbar text only appears at bottom of edit screen but does not function in any way

Complete error msg is shown in attachment, I'm running IE8 in a VM and can't copy/paste, but error is "Expected identifier, string or number" from bits...load.php...EventLogging


Version: 1.21.x
Severity: major

Attached:

Screen_shot_2012-10-26_at_8.27.34_AM.png (551×992 px, 66 KB)

Details

Reference
bz41428

Event Timeline

bzimport raised the priority of this task from to Unbreak Now!.Nov 22 2014, 12:46 AM
bzimport set Reference to bz41428.

Created attachment 11248
broken edit toolbar

Attached:

Screen_shot_2012-10-26_at_8.32.37_AM.png (559×1 px, 82 KB)

This seems to relate to line 115, which by my count is "if ( desc.enum && desc.enum.indexOf( val ) === -1 ) {"

This seems to be a sanity check -- if so, we should probably just comment it out pro-tem as a matter of priority.

We've temporarily reverted EventLogging, and that seems to have fixed the problem on enwiki. Leaving this open since the underlying bug will rear its head if we try to reenable.

I validated that IE8 performs as expected now on enwiki with regard to editing, Sandbox link, etc.

"enum" is a reserved word, which makes IE to bork even when used as an object
property.

You can test this by eval'ing "window.enum = '1'", which produces the same error. Using another random property name doesn't throw the error.

If there's no way to rename that property, use desc['enum'] instead of desc.enum

Jesús is correct. Fixed in Gerrit change I810bddad.

Jesús, Chis, Jarry, Rob, Chad: thanks very much for your help with this.

Thanks for the fix, and the quick bug->fix time. It's super-duper appreciated. I see Chad turned off EventLogging to control the damage.

That said, is there a lesson here? Do we have automated JavaScript testing up yet (I remember Testswarm working before the move, but haven't heard anything since)? How many manual tests are performed pre-deployment? Is IE8 on that list (market share 10-20%)? Should it be? Should we have reverted even more quickly rather than trying to fix this and/or waiting for a bit? Who can we ask/go to if we want to propose a straight revert?

Basically I'm grumpy I got stood up during a Wikipedia demo to some librarians and now I'm taking it out on Bugzilla :P My questions are serious ones, though; I think we dropped the ball here, if only temporarily.

(In reply to comment #7)

That said, is there a lesson here? Do we have automated JavaScript testing up yet (I remember Testswarm working before the move, but haven't heard anything since)?

Yes. EventLogging is currently one of only five extensions (if http://www.mediawiki.org/wiki/Manual:Hooks/ResourceLoaderTestModules is to be believed) that declare a JavaScript test suite to ResourceLoader for automated testing. I've written quite a few tests for it, but unfortunately, the comprehensiveness of the test suite gave me a false sense of confidence and I was lazy about testing the most recent change on older browsers. TestSwarm is not yet ready AFAIK, but we have a subscription for BrowserStack (http://www.browserstack.com/) that I could have and should have used.

How many manual tests are performed pre-deployment?

Usually, quite a few. For this particular change set, not enough.

Is IE8 on that list (market share 10-20%)?

Yes.

Should it be?

Yes.

Should we have reverted even more quickly rather than trying to fix this and/or waiting for a bit?

I'm not sure that this was the precise sequence of events. AFAIK as soon as the culprit was identified, ^demon went ahead and disabled it.

Who can we ask/go to if we want to propose a straight revert?

Ideally the developer / deployer responsible for rolling out the bug; failing that, anyone on #wikimedia-tech available to help.

I think we dropped the ball here, if only temporarily.

Agreed. I'm responsible, and I'm sorry for being careless. In the future, I'll be more disciplined about running tests on a complete browser matrix as a precondition for merging / deploying code.

If I can have the luxury of cutting out the bits I agree with...

(In reply to comment #8)

I'm not sure that this was the precise sequence of events. AFAIK as soon as the
culprit was identified, ^demon went ahead and disabled it.

[13:29:33] <Jarry1250> Hey all. At an editathon, so can't track down very well (may have already been reported)
[13:29:39] <Jarry1250> but in IE I'm getting an exception in http://bits.wikimedia.org/static-1.21wmf2/extensions/EventLogging/modules/ext.EventLogging.js
[13:32:33] <Jarry1250> Line 115 [...] by my reckoning

I ping you Ori, I ping S Page (you're both still asleep I'm guessing). Then Chris comes along, we have a chat, he files this bug, then thedj pings Reedy, but he's AFK. Then there's a gap in the logs - perhaps discussion continued in MediaWiki-General or whatever. ServerAdminLog thinks ^demon rolled back at 16:49. It would have been nice to have a rollback in under three hours, I think, and that's assuming I was the first one to report this.

I think we dropped the ball here, if only temporarily.

Agreed. I'm responsible, and I'm sorry for being careless. In the future, I'll
be more disciplined about running tests on a complete browser matrix as a
precondition for merging / deploying code.

Sure, mistakes happen, not a problem. But I'm inclined to think the process could have worked slightly better in some way-I'm not sure what though. Do we have enough staff for 24 hour rollback coverage? Or do we enforce (and make easier to trigger) more pre-deployment tests? Either? Both?

(In reply to comment #9)

Sure, mistakes happen, not a problem. But I'm inclined to think the process
could have worked slightly better in some way-I'm not sure what though. Do we
have enough staff for 24 hour rollback coverage? Or do we enforce (and make
easier to trigger) more pre-deployment tests? Either? Both?

A revert of a particular commit does not roll back the entire MediaWiki platform to an earlier point in time, and is therefore not trivially safe to deploy.

Given the severity of the bug (minor to moderate, in my estimation: the bug affected 10-20% of users, and even for them the site was still basically usable), and the fact that the start of the working day in San Francisco was just coming up, it was probably prudent to wait.

The testing story does get better, though -- the TestSwarm setup integrates with both QUnit and Jenkins to automate the execution of test suites across a spectrum of browsers. It would have flagged this particular problem.

(In reply to comment #10)

A revert of a particular commit does not roll back the entire MediaWiki
platform to an earlier point in time, and is therefore not trivially safe to
deploy.

I don't follow. MediaWiki was a constant here. Let's assume the previous version of the EventLogger was known to be good (or rather known to not be as bad). Then I would infer that rolling back was both trivial and relatively safe.* For MediaWiki as a whole, maybe not, but as we know we *want* to be at a point where rolling back is not a big deal, even if we're not there yet.

The testing story does get better, though -- the TestSwarm setup integrates
with both QUnit and Jenkins to automate the execution of test suites across a
spectrum of browsers. It would have flagged this particular problem.

Yay :)

  • It occurs to me that I am yet to see an actual revert done on the production cluster directly, rather than reverting in the repo and then deploying a new master. So I could be wrong here.

(In reply to comment #11)

I don't follow. MediaWiki was a constant here.

Looking at the server admin log (http://wikitech.wikimedia.org/view/Server_admin_log). I count 30 deployments / configuration changes that were rolled out between the bug getting deployed and Chris's initial report, including several changes to core.

(In reply to comment #12)

(In reply to comment #11)

I don't follow. MediaWiki was a constant here.

Looking at the server admin log
(http://wikitech.wikimedia.org/view/Server_admin_log). I count 30 deployments /
configuration changes that were rolled out between the bug getting deployed and
Chris's initial report, including several changes to core.

There were some minor tweaks, yes. I'd be interested to know if that was part of the reason that Chad chose to disable the whole extension rather than revert the specific change. In any case, I don't think that any of those 30 would really have precluded an attempt to roll-back an incremental update to EventLogger.

This issue prompted a lot of discussion within WMF and we are changing some priorities and procedures as a result.

We have a browser-testing initiative in place, but until now we have been highly focused on making the beta labs environment suitable for fully automated browser testing.

But some recent developments have made it possible to do more pre-deployment browser testing, both automated and manual. We still want to move testing "up the stack" by using beta labs, but we'll be putting in place some more stringent pre-deployment browser testing while we continue to work on beta labs automation.

I'll be posting details to wikitech-l as they become available.

(In reply to comment #13)

There were some minor tweaks, yes. I'd be interested to know if that was part
of the reason that Chad chose to disable the whole extension rather than revert
the specific change. In any case, I don't think that any of those 30 would
really have precluded an attempt to roll-back an incremental update to
EventLogger.

The reason I disabled it was because I had no clue what was broken with it, just that EventLogger was breaking other things. I saw that people had been trying to ping the responsible team to no avail, so I didn't see much use in me trying to ping them further for an actual fix. I'm not a Javascript guru, so rather than trying to dig through code to find what was broken, I went for the easiest course of action to restore functionality for our users.