Page MenuHomePhabricator

0.5 million errors relating to error caused by rollback "TypeError: null is not an object (evaluating 't[e.title]') on mobile domain"
Closed, ResolvedPublic

Description

New production error began around 8pm UTC. Is the top error across all wikis. Am assuming it's a fundraising banner campaign given the timing and number of wikis involved but could also be a problem in MobileFrontend (please untag accordingly). If it is MobileFrontend it should be a deployment blocker.

https://logstash.wikimedia.org/app/dashboards#/doc/logstash-*/logstash-2021.01.21?id=CmfWJncBjr5R1RLC409N

at c URL1:443:699
at URL1:444:720
at URL1:444:804
at URL1:445:830
at dispatch URL1:155:747

URL1: https://de.m.wikipedia.org/w/load.php?lang=de&modules=ext.centralNotice.choiceData%2Cdisplay%2CgeoIP%2CimpressionDiet%2CkvStore%2CstartUp%7Cext.centralauth.centralautologin%7Cext.eventLogging%2CnavigationTiming%2Cpopups%2CwikimediaEvents%7Cext.quicksurveys.init%7Cext.relatedArticles.readMore.bootstrap%2Cgateway%7Cjquery%2Coojs%2Coojs-router%2Csite%7Cjquery.client%2Ccookie%2Cthrottle-debounce%7Cmediawiki.String%2CTitle%2CUri%2Capi%2Cbase%2Ccldr%2Ccookie%2Cexperiments%2CjqueryMsg%2Clanguage%2Crouter%2Cstorage%2Ctemplate%2Cuser%2Cutil%2Cviewport%7Cmediawiki.libs.pluralruleparser%7Cmediawiki.page.ready%7Cmediawiki.page.watch.ajax%7Cmediawiki.template.mustache%7Cmediawiki.ui.anchor%7Cmobile.init%2Cstartup%7Cmobile.messageBox.styles%7Cmobile.ooui.icons%7Cmobile.pagelist.styles%7Cmobile.pagesummary.styles%7Cmobile.placeholder.images%7Cmobile.startup.images%7Cmw.externalguidance.init%7Cskins.minerva.icons.images.scripts%7Cskins.minerva.icons.images.scripts.misc%7Cskins.minerva.icons.page.issues.default.color%7Cskins.minerva.icons.page.issues.medium.color%7Cskins.minerva.icons.page.issues.uncolored%7Cskins.minerva.options%2Cscripts%7Cuser.defaults&skin=minerva&version=osm4i

Event Timeline

Jdlrobson triaged this task as Unbreak Now! priority.Jan 21 2021, 9:28 PM
Jdlrobson renamed this task from TypeError: null is not an object (evaluating 't[e.title]') to TypeError: null is not an object (evaluating 't[e.title]') on mobile domain.Jan 21 2021, 9:29 PM
Jdlrobson lowered the priority of this task from Unbreak Now! to High.
Jdlrobson added a project: Mobile.

Trending downwards so bumping down to high. Hopefully somebody already noticed it and patched it.

Also tagging mobilefrontend in case the train rolled forward and then back and that's what caused the spike and decrease (in which case this should be considered a deployment blocker)

If not a banner, possibly caused by T253137 given that is the only change recently.

Also possible this is an issue with cached JS.

Pretty sure this is a MobileFrontend bug where we switched from localstorage to session storage . The error is possible if new JS runs and then old JS is run. Did we rollback? It's possible that would have caused the burst of errors and left those users in a broken state.

Once I understand what happened with the train today I can suggest a fix. too late to revert.

Also tagging mobilefrontend in case the train rolled forward and then back and that's what caused the spike and decrease (in which case this should be considered a deployment blocker)

I rolled the train forward to group2 around 20:05, then back to group1 at 21:27 UTC after a spike of errors that I think turn out to be T270334.

If I'm reading the client errors dashboard correctly, these did start somewhere around the time of the rollback to group1, but are ongoing on a downward trend?

brennen raised the priority of this task from High to Unbreak Now!.Jan 21 2021, 10:07 PM

Marking as UBN! since resolution probably determines what happens with the train.

Mentioned in SAL (#wikimedia-operations) [2021-01-21T22:10:08Z] <brennen> 1.36.0-wmf.27 train status: for avoidance of doubt, no deploys until further notice - sorting out T272638

Okay i can provide a patch, but am at a dentist appointment so it might be an hour or so.

Ive commented on what i suspect to be the offending line in the patch. That needs to be changed to a remove to make it possible to roll back safely in future.

For clarity here:

a) What's user impact on this one?

b) We shouldn't roll forward if we can't roll back safely, but at this point that's likely to mean the train sits over the weekend. Would be good to be able to judge whether this is just logspam or causes user-facing breakage.

For clarity here:

a) What's user impact on this one?

b) We shouldn't roll forward if we can't roll back safely, but at this point that's likely to mean the train sits over the weekend. Would be good to be able to judge whether this is just logspam or causes user-facing breakage.

The user impact is only on users who viewed the site between the deploy and the rollback. Those users will likely see problems with section collapsing on mobile as the new code had put them in an error state with the old code. This issue will be user facing (possibly not possible for them to expand sections)

There is no impact on users using the current code in an environment where no rollback occured.

I'll be back in irc in about 20mins if you want to chat through this. Sorry for the bad timing!

I'll be back in irc in about 20mins if you want to chat through this. Sorry for the bad timing!

Thanks - I'll be in #wikimedia-operations. @thcipriani and I have been attempting to reproduce the error without much luck.

to reproduce it you need to get into an error state by running the following in the console of the mobile site

		mw.storage.set( 'expandedSections', null);
`

It's only possible to get in the error state if the code in master is deployed, executed by the user and then the deploy is rolled back.

I can now confirm the error doesn't seem to break the UI but it could bring down error logging in the event of a rollback given for impacted users it throws an error on every page view.

Change 657702 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/extensions/MobileFrontend@master] Fix toggling storage cleanup

https://gerrit.wikimedia.org/r/657702

Change 657652 had a related patch set uploaded (by Brennen Bearnes; owner: Jdlrobson):
[mediawiki/extensions/MobileFrontend@wmf/1.36.0-wmf.27] Fix toggling storage cleanup

https://gerrit.wikimedia.org/r/657652

Change 657652 merged by jenkins-bot:
[mediawiki/extensions/MobileFrontend@wmf/1.36.0-wmf.27] Fix toggling storage cleanup

https://gerrit.wikimedia.org/r/657652

Mentioned in SAL (#wikimedia-operations) [2021-01-22T00:20:42Z] <brennen@deploy1001> Synchronized php-1.36.0-wmf.27/extensions/MobileFrontend: Backport: [[gerrit:657702|Fix toggling storage cleanup (T272638)]] (duration: 01m 07s)

Change 657702 merged by jenkins-bot:
[mediawiki/extensions/MobileFrontend@master] Fix toggling storage cleanup

https://gerrit.wikimedia.org/r/657702

Jdlrobson lowered the priority of this task from Unbreak Now! to Medium.Jan 22 2021, 12:50 AM

I've summarized what happened in https://www.mediawiki.org/wiki/Reading/Web/Notable_incidents#January :

We switched from localStorage to session storage for tracking open sections in the mobile site. Code for cleaning up localStorage entries had a bug, so when the deploy was rolled back it left the mobile site in an error state and an error was thrown for every page view where the new code had been executed. This is recorded in phab:T272638. We backported a fix in the event we might need to roll back again and resumed the deployment. The errors disappeared after the deploy.

We saw over half a million errors in our error logging pipeline during this time (usually we see under 10,000 in a given day). Amazingly nothing collapsed.

In future, we should be aware that just as we worry about cached HTML, we should be wary of making changes to data stored in localStorage in a backwards compatible way so that in the event of rollback we don't hit this problem again.

The errors are trailing off now, but someone should check this tomorrow or Monday to make sure they are non-existent.

https://logstash.wikimedia.org/goto/20c804e86faec4d85900dda8c7fc9947

Jdlrobson renamed this task from TypeError: null is not an object (evaluating 't[e.title]') on mobile domain to 0.5 million errors relating to error caused by rollback "TypeError: null is not an object (evaluating 't[e.title]') on mobile domain".Jan 22 2021, 12:50 AM

This is looking good now.