Page MenuHomePhabricator

OperationError: The operation failed for an operation-specific reason in generateRandomSessionId
Closed, ResolvedPublic

Description

There were about 29 instances of this bug in the last 12 hours from different IPs in the production logs. Not sure what impact this has on things using generateRandomSessionId

error
DOMException: OperationError: The operation failed for an operation-specific reason
trace
at .generateRandomSessionId URL1:730:1235
at .getPageviewToken URL1:730:1624
at .files["core.js"]/</core.eventInSample URL1:480:200
at URL1:475:837
at resolve/</mightThrow URL1:53:141
at resolve/</process< URL1:53:808

URL1: https://fr.wikipedia.org/w/load.php?lang=fr&modules=Spinner%2Cjquery%2Coojs%2Coojs-router%2Coojs-ui-core%2Coojs-ui-widgets%2Csite%7Cext.centralNotice.choiceData%2Cdisplay%2CgeoIP%2CkvStore%2CstartUp%7Cext.centralauth.centralautologin%7Cext.cite.ux-enhancements%7Cext.cx.eventlogging.campaigns%7Cext.eventLogging%2CnavigationTiming%2Cpopups%2CwikimediaEvents%7Cext.growthExperiments.SuggestedEditSession%7Cext.quicksurveys.init%2Clib%7Cext.tmh.OgvJsSupport%7Cext.uls.common%2Ccompactlinks%2Cinit%2Cinterface%2Cpreferences%2Cwebfonts%7Cjquery.client%2Ccookie%2CembedPlayer%2CloadingSpinner%2CmwEmbedUtil%2CtextSelection%2CtriggerQueueCallback%7Cjquery.uls.data%7Cmediawiki.String%2CTitle%2CUri%2Capi%2Cbase%2Ccldr%2Ccookie%2Cexperiments%2CjqueryMsg%2Clanguage%2Cstorage%2Ctoc%2Cuser%2Cutil%2Cviewport%7Cmediawiki.editfont.styles%7Cmediawiki.libs.pluralruleparser%7Cmediawiki.page.ready%7Cmediawiki.ui.button%7Cmmv.bootstrap%2Chead%7Cmmv.bootstrap.autostart%7Cmw.EmbedPlayer.loader%7Cmw.MediaWikiPlayer.loader%7Cmw.MwEmbedSupport%2CPopUpMediaTransform%7Cmw.MwEmbedSupport.style%7Cmw.PopUpMediaTransform.styles%7Cmw.TMHGalleryHook.js%7Cmw.TimedText.loader%7Coojs-ui-core.icons%2Cstyles%7Coojs-ui-widgets.icons%7Coojs-ui.styles.indicators%7Cskins.vector.js%7Cuser.defaults&skin=vector&version=1jh2e

Event Timeline

  • all errors from the same FF 52 session (three year old browser) , does not seem relevant. Closing.

There were 37 errors in last 12hrs from 4 IPs While 4 of those came from Firefox 67 and 52, 33 came from 2 IPs using Firefox 79 a much more modern browser [1]. That said the errors are still low (for now) but I think it's wrong to dismiss this just on age.

https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2020.09.19/clienterror/?id=AXSmrR6XO4h4ahU1Qr9A

https://bugzilla.mozilla.org/show_bug.cgi?id=1406348 might relate but it's an old closed bug, but it would be nice to say with certainty that this is not an error to be concerned about.

[1] Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:79.0) Gecko/20100101 Firefox/79.0

I've seen this error for Firefox 81 https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2020.10.19/clienterror/?id=AXVBWcP6LNRtRo5XDWsw so I think this bug should be reconsidered.

We're not live on English Wikipedia yet and we're seeing 612 errors every 24hrs for other wikipedia language projects.

Ottomata added subscribers: jlinehan, mpopov, Milimetric.

Am trying to read the stack trace a bit, am I wrong in reading that the exception is thrown in mw.user.generateRandomSessionId? Just trying to determine out if this is a problem in EventLogging specific JS, or something in MW core JS.

For reference:
https://github.com/wikimedia/mediawiki/blob/master/resources/src/mediawiki.user.js#L25-L75

Am trying to read the stack trace a bit, am I wrong in reading that the exception is thrown in mw.user.generateRandomSessionId? Just trying to determine out if this is a problem in EventLogging specific JS, or something in MW core JS.

For reference:
https://github.com/wikimedia/mediawiki/blob/master/resources/src/mediawiki.user.js#L25-L75

My understanding is this is an internal error occurring inside that function when we call crypto.getRandomValues so the best we could do is catch it.

Feels radar-ey for us. If folks decide to add a catch there, I'm happy to review the patch. But I'd prefer a fix in mw.user.

Krinkle subscribed.

What does the Web Crypto API spec say? Why does it happen? Is it avoidable? If not, how would we fallback?

@Milimetric one of you two should start this investigation and/or tag a team that will.

if this is not super urgent i can work on it on my volunteer capacity.

Ottomata subscribed.

The behavior of getRandomValues is specified here: https://www.w3.org/TR/WebCryptoAPI/#Crypto-method-getRandomValues,
and the Gecko implementation is here: https://github.com/mozilla/gecko-dev/blob/0db73daa4b03ce7513a7dd5f31109143dc3b149e/dom/base/Crypto.cpp#L38.

The error we're seeing here is thrown when the operation to overwrite array elements with random values fails, either because of a failure to get a reference to the random generator service or an error when generating random bytes. The spec does not prescribe an error to be thrown during this step, and the implementers chose to throw the generic OperationError we're seeing here.

I propose simply replacing the if-else block with a try-catch block. This would allow us to fall back to the Math.random approach and return an ID if either Uint16Array is not defined or getRandomValues fails for whatever reason.

Change 666387 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/core@master] generateRandomSessionId: Catch and fall back if getRandomValues fails

https://gerrit.wikimedia.org/r/666387

Change 666387 merged by jenkins-bot:
[mediawiki/core@master] mediawiki.user: Catch and fall back if getRandomValues fails

https://gerrit.wikimedia.org/r/666387

Thanks @jlinehan and @Krinkle for reviewing. I'll claim this and follow up in Logstash after the change rolls out.