Page MenuHomePhabricator

Enable client side error counting on Minerva production (wikipedia only)
Closed, ResolvedPublic1 Estimated Story Points

Description

In T205582 we made it possible to count client side errors.

Enable wgMinervaCountErrors on all wikipedia wikis

Enabling it on wikipedia should give us the coverage we need.

Event Timeline

Change 467760 had a related patch set uploaded (by Pmiazga; owner: Pmiazga):
[operations/mediawiki-config@master] Enable client side error counting on Minerva

https://gerrit.wikimedia.org/r/467760

Change 467760 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable client side error counting on Minerva

https://gerrit.wikimedia.org/r/467760

Mentioned in SAL (#wikimedia-operations) [2018-10-17T11:30:34Z] <addshore@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: T206702 [[gerrit:467760|Enable client side error counting on Minerva]] (duration: 00m 57s)

Task was deployed to production. There are no stats tracked yet as MinervaSkin is not up to date. Current Minerva version is e679f5207aafa4815298ad923225a0ba8c543b9a (Oct 2nd).

I have a couple of concerns with this change that I spoke about with @pmiazga but wanted to raise here too:

  • There doesn't seem to be a note about how to monitor the additional load on Varnish or Graphite. For the latter, I would advise watching https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?refresh=1m&orgId=1 closely (though I can't find any clarification on-wiki about the difference between the local and frontend graphs)
  • Further to the above, we appear to be going from 0 to 100% without testing what the additional load on Varnish or Graphite is. Perhaps we could roll this out to one or two large wikis first?

Also, @pmiazga and I have both arrived at the following performance optimisation independently: since we're just counting the number of errors and discarding any detail, it should be possible to count the number of errors per pageview and increment our counter when the page unloads. This would minimise the number of HTTP requests we make to the statsv beacon endpoint.

I made a patch to optimize number of requests: https://gerrit.wikimedia.org/r/#/c/mediawiki/skins/MinervaNeue/+/467984, but I'm afraid that this can be not a best way to do it. There are 2 problems:

  • not all browsers support onbeforeunload
  • not all browsers support AJAX requests during unload events.

Most probably we want to track all errors, especially for older browsers. But older browsers might not support sending stats onbeforeunload and because of that, we might lose some data.

Moving to Needs More Work to reflect the current state.

  • not all browsers support onbeforeunload

https://caniuse.com/#search=beforeunload begs to differ (acknowledging that we can't strictly rely on CIU) but do you have another source?

  • not all browsers support AJAX requests during unload events.

statsv requests are sent by the Beacon API when it's supported.


As we discussed, I'm grateful that you raised these points as it highlights a potential problem with the proposed performance improvement: it's likely that those browsers that we deliver JS assets to but don't support the Beacon API are those most likely to be affected by JS errors but we might stand to lose the most data for. I'm not sure if this problem becomes moot if we're only concerned with counting the global error rate.

  • not all browsers support onbeforeunload

https://caniuse.com/#search=beforeunload begs to differ (acknowledging that we can't strictly rely on CIU) but do you have another source?

Hmm, I didn't verify the CIU first, but I remember having some issues with that event, but maybe this was related to using alerts/confirms, not to the event itself. If CIU says it's supported -> let's use it

Jdlrobson added a subscriber: pmiazga.

Moving to Needs More Work to reflect the current state.

Please create a new task for the additional work proposed. This task was only about deploying the change, which looks to have been done.
Given T205582 is in 1.32.0-wmf.26 we won't see this change in production for a while.
I'm moving to backlog for the time being. I'll move back into board when it's possible to sign this off.

… and how's Graphite coping?

@phuedx the UDP packet rate increased but then ~7am UTC today it dropped. Graphs looks ok, All other graphs looks ok, nothing we should worry about.