Page MenuHomePhabricator

Client side error logging production launch
Open, Needs TriagePublic

Description

This umbrella task will be used to track production launch of "client side error logging" project.

Paraphrasing from https://etherpad.wikimedia.org/p/clients-error-logging, we'll be lining up as many ducks as we can in Q1, with launch in Q2 to many/most wikis. End of Q2 is naturally risking for high traffic wikis because of donation campaigns, we'll have to weight benefit/risks of launching to big wikis too.

Q1 FY2019/2020

  • Schema validation for eventgate events - @Tgr
  • kubernetes setup for eventgate deployment (i.e. backend component) initially set to receive errors from low traffic wikis - @Ottomata
  • Choose a js client to send errors (raven.js ? our own?)
  • Security review of the js client we'll be using
  • Performance review of the js client we'll be using
  • Verify we have enough ingestion capacity on the Logstash side @fgiunchedi
  • Related to the above, make sure deduplication/rate limiting in depth (i.e. both on the client side, and on the backend side) is in place before high traffic wikis launch.
  • Verify events show up in logstash and we have a Kibana dashboard available @fgiunchedi

Q2 FY2019/2020

  • Enable client side error logging to increasingly high traffic wikis, work out any kinks discovered

@dr0ptp4kt @phuedx @Ottomata @Milimetric @CDanis @colewhite @Tgr @Krinkle please check and adjust the above plan as needed! What do you think?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 1 2019, 12:26 PM

Looks good, I can take the k8s task. I can start the schema but we'll need to bikeshed that one together.

I can continue to help with the JS client, before we decided to put something up quickly I had started thinking about what Raven.js gives us and what else we could do.

Milimetric moved this task from Incoming to Radar on the Analytics board.Jul 8 2019, 3:54 PM
fgiunchedi updated the task description. (Show Details)Jul 18 2019, 9:45 AM
Tgr added a comment.Jul 18 2019, 10:02 AM

"Schema validation for eventgate events" - does that only mean adding to mediawiki/event-schemas? If so, I can do that.

"Choose a js client to send errors (raven.js ? our own?)" Not sure who and how should make that choice. I imagine it will come down to performance, as using the official client is preferable in all other ways. @Krinkle do you have any thoughts on what information or experiments would help with making that decision?

"Schema validation for eventgate events" - does that only mean adding to mediawiki/event-schemas? If so, I can do that.

That's the first step yes! I actually need this in order to build the deployment image to set up the service in k8s.

Ottomata updated the task description. (Show Details)Jul 18 2019, 1:17 PM

Actually, @Tgr... we are beginning to use some new tooling to aide in managing mediawiki/event-schemas: https://gerrit.wikimedia.org/r/c/mediawiki/event-schemas/+/523745

You can be our Guinea pig! I'll ask Petr to review that patch today so we can merge it.

I am starting to worry that I won't have enough time before my paternity leave to work on the client, someone else should go for it and I can assist while I'm around.

Wanted to mention this in today's meeting but couldn't find it in time: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/NotErrorLogging. The main reason to not use EL for error logging, that I agree with, is that if EL goes down it's not a big deal, but if we are blind to client-side errors it affects our users directly.

Also, it seems there's a way to build a minimal @sentry/browser client via https://github.com/getsentry/sentry-javascript/tree/master/packages/minimal. The main build has all this crazy stuff like React and Angular support. I liked Timo's idea from today's meeting to set up something super minimal until there's an error. We could then pull in this minimal build to log the error.

Nuria added a subscriber: Nuria.Jul 24 2019, 4:34 PM

Also, per https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/NotErrorLogging let's please have in mind that analytics storage is 1) not optimal to group stack traces/errors (sentry is made for that) and 2) not tier-1. Error logging should be consider tier-1 infrastructure and it makes total sense for errors to go to a backend like logstash.

Tgr added a comment.Jul 24 2019, 4:39 PM

Per today's meeting, next steps on the client side (alongside with defining the EventGate schema) is to write a minimal client that does not try to normalize errors, just ships the error message and trace (and basic data like URL, wiki ID and code version). That way we don't have to worry about client size and can revisit the issue once we have more data and are better aware of the shortcomings of the minimalistic approach.

I'm inclined to just put it in mediawiki.errorLogger.js and not bother with lazy-loading since it will probably be just a few lines. That way we don't have to deal with deploying the Sentry extension to production, and about the client download triggering on errors and e.g. biasing analytics. And having functionality for sending JS errors to an operator-defined URL seems like a reasonable addition to MediaWiki core.

From what I can tell, most of the objections on EventLogging/NotErrorLogging are about EventLogging specific stuff. Event Platform is more modular and configurable; the error events will be consumed by non analytics systems.

And having functionality for sending JS errors to an operator-defined URL seems like a reasonable addition to MediaWiki core.

See wgEventServiceStreamConfig in InitialiseSettings.php, and the referred to services in ProductionServices.php. We just need the client to get this config. Or, I guess for MVP, the operaator-defined URL could just be configured separately for now.

alongside with defining the EventGate schema

@Tgr, lemme know when you start working on this. mediawiki/event-schemas now uses jsonschema-tools to help with DRYing up common bits of schemas and schema versioning. See how test/event/current.yaml uses a $ref to /common/1.0.0.

my fault - I confused this with our mediawiki-storage repo, I should've read the title more carefully. Will work on fixing.

Nuria added a comment.Jul 24 2019, 7:31 PM

I would like to suggest a deployment strategy for this code that I think would make things simple for an MVP (feel free to disregard if it does not seem useful). Rather than sampling on the client which might be error prone and introduces state, let's roll out the error logging code just in 1 wiki for 1 browser, it could be the most used browser of rather, the one at the bottom of the support, like ie11. That way we have a "small" enough stream of errors, we can look at the "whole" picture and decide what makes (or doesn't) sense to send. I will do some numbers to find a wiki with a suitable browser profile.

Nuria added a comment.Jul 24 2019, 9:03 PM

Hawain wikipedia has about 5000K pageviews daily, about 3000 from users with a good representation of IE11 (20%)
Even if we have an error stream several times the magnitude of the user pageview flow the backend should be able to handle it.

I propose we launch the error feed 100% on this one site.

@Tgr, for when you start working on the error schema, here is an example of adding a new schema with the new jsonschema-tools workflow:

https://gerrit.wikimedia.org/r/c/mediawiki/event-schemas/+/525562

npm install . will install the git pre-commit hook to help auto generation of versioned dereferenced files.
All I did was create jsonschema/swift/upload/complete/current.yaml. When I did git add && git commit, the versioned files were auto generated and added to my commit for me.