Page MenuHomePhabricator

Turn on MinervaErrorLogSamplingRate (Schema:WebClientError)
Closed, ResolvedPublic3 Estimated Story Points

Description

We have wired up client side errors to EventLogging to give us an insight into how many client side errors we have in our code. We would like to turn this on to measure progress and code quality in our refactor process. We use the schema https://meta.wikimedia.org/wiki/Schema:WebClientError

Acceptance criteria

  • Increase the sampling rate of MinervaErrorLogSamplingRate from 0 to 1 on the beta cluster. This will help us identify bugs before they hit production. Monitoring those is blocked on T204088
  • Make sure it's not turned on in production!

Sign off steps

  • In the unlikely event that rate of events on the beta cluster should not exceed ~100 / sec or less at any given time. If it does roll back. It if gets close, stop increasing the value so we can talk to analytics.
  • Setup a dashboard and add a monitoring step to https://www.mediawiki.org/wiki/Reading/Web/Chores

Event Timeline

Jdlrobson added a subscriber: Ottomata.

@Ottomata can you guide me with a hard limit on events per minute for an EventLogging schema so during deploy we can ensure we stay way below it?

Please consider this a heads up that we'd like to use EventLogging for client side error reporting (similar to https://meta.wikimedia.org/wiki/Schema:UploadWizardErrorFlowEvent) for the mobile site and please please let me know if that scares you in any way!

Naw that should be fine! In general

  • if your rate is ~100 / sec or less, no need to notify.
  • if your rate is > ~100 / sec, notify, and we need to blacklist from MySQL.
    • if also rate > ~800 / sec notify, and we will keep an eye out
      • if rate is larger than this, let's think about if we really need this many events!

It's not that we can't handle more than 1000 events / second right now, its just that you would be the only users really doing that many. VirtualPageView is the highest rate atm at around 800 / sec (IIRC).

I think blacklisting from MySQL might be sensible anyway, given a bug in MediaWiki:Mobile.js by an editor could spike the number of events.

What happens if this kind of thing does happen and the number of events is crazy high (e.g. every page view) (apart from the obvious reacting to the spike and fixing that JS error ASAP)?

I think blacklisting from MySQL might be sensible anyway, given a bug in MediaWiki:Mobile.js by an editor could spike the number of events.

Great. We are actually considering moving to whitelisting for MySQL import soon, since most everything goes to Hive anyway.

What happens if this kind of thing does happen and the number of events is crazy high (e.g. every page view) (apart from the obvious reacting to the spike and fixing that JS error ASAP)?

Depends on how high. If blacklisted from MySQL, likely the system will handle it. However, we process all eventlogging events on a single server only (with no redundancy), so if it gets overloaded, all events will stop processing. The raw events are queued in Kafka for 7 days, so we'd have at least that long to fix.

If you expect a LOT of events, maybe you can slowly ramp up sampling, instead of going from 0 to 1 immediately?

I'm hoping for few events and we do plan on ramping up the sampling, but but because of the nature of the logging (we're logging client side errors) and the fact that editors can edit site wide JS it makes me a little uneasy that if an error was introduced for all users this could cause us problems by having an event for every page view. In such an event if it was detected we'd turn the sampling rate down to 0 until the problem was fixed, but I'm not sure if that's feasible?

@Jdlrobson +1 to @Ottomata 's suggestion. I do not think sending this schema to MySQL is a viable option, EL is really not the best tool to do error logging, as 1) it is impossible to anticipate a sampling rate (errors are bursty) , 2) EL is not designed to manipulate long streams of text and 3) EL lacks grouping of errors in order to see most prevalent occurrences of errors.

All this functionality is provided by Sentry, which really is what we should be looking towards installing here (I understand that for this support from SRE team is needed, cc @Tgr) .

Let's please postpone any sampling rate changes until this schema is blacklisted in MySQL and its events only go to hadoop.

. In such an event if it was detected we'd turn the sampling rate down to 0 until the problem was fixed, but I'm not sure if that's feasible?

With caching of our JS resources the answer is that no, it is really not feasible, there will be a timeperiod in which events will be sent regardless of config setting as that has been cached on client.
That is also another reason why we really do not think EL should be used for error reporting, bursty (and unpredictable) traffic

Change 459574 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Blacklist WebClientError from EventLogging MySQL

https://gerrit.wikimedia.org/r/459574

Change 459574 merged by Ottomata:
[operations/puppet@production] Blacklist WebClientError from EventLogging MySQL

https://gerrit.wikimedia.org/r/459574

Mentioned in SAL (#wikimedia-analytics) [2018-09-10T16:26:08Z] <ottomata> restarting eventlogging-processors to pick up blacklist of WebClientError schema for MySQL - T203814

Let's please postpone any sampling rate changes until this schema is blacklisted in MySQL and its events only go to hadoop.

Done

EL is really not the best tool to do error logging

Completely agree for the reasons you mention! Right know however we are keen to document the total number of client side errors we do have to build a business case for further exploration in this area and as an indicator to measure that our code is improving during a refactor. Right now we plan on counting errors rather than analyzing them (although we log the errors themselves so we can scratch that itch).

All this functionality is provided by Sentry, which really is what we should be looking towards installing here (I understand that for this support from SRE team is needed, cc @Tgr) .

Totally! However, we've also been advised this is a long way off so we're trying to provide data to show why this needs to be done sooner. It feels like making it clear why this is needed would be a good first step.

Let's please postpone any sampling rate changes until this schema is blacklisted in MySQL and its events only go to hadoop.

It sounds like this has now been done? Do you have any concerns with us proceeding (maybe with an upper cap on the sampling rate?)

Completely agree for the reasons you mention! Right know however we are keen to document the total number of client side errors we do have to build a business case for further exploration

On my opinion, in this case absolute numbers are of little help, a mild error that still lets you use the page that shows up in chrome will dwarf any other errors from browsers that are less used. I would look, ad minimun, into rates of error types per browser.

It sounds like this has now been done? Do you have any concerns with us proceeding (maybe with an upper cap on the sampling rate?)

If you are going to just "count" errors as you describe a 1-5% sampling rate seems more than sufficient as you are likely to catch all grade A traffic, given that your schema includes a stack trace I think you might run into issues client side with too-long -messages. Those should be visible in the graphana dashboard for this schema.

EL is really not the best tool to do error logging

Eh? Why not? Perhaps EL as we have it now, but the system intake for errors can be the same as it is for events. Errors really are just
events too.

BTW, if you want to keep things simple, and are just counting, you could use statsv instead of EventLogging:
https://wikitech.wikimedia.org/wiki/Graphite#statsv

IMHO we should enable this only to users without gadgets (maybe anons only?). I'm afraid that we might get lots of errors and spend way too much time on finding what's causing the issue.

ovasileva set the point value for this task to 3.Sep 11 2018, 4:38 PM

Completely agree for the reasons you mention! Right know however we are keen to document the total number of client side errors we do have to build a business case for further exploration

On my opinion, in this case absolute numbers are of little help, a mild error that still lets you use the page that shows up in chrome will dwarf any other errors from browsers that are less used. I would look, ad minimun, into rates of errors per browser.

On that note, it might be worth ingesting the data into Druid (leaving out most or all of the many-valued fields, a bit like T202751), to enable quick exploration of error frequency by browser, OS and skin in Superset.

IMHO we should enable this only to users without gadgets (maybe anons only?). I'm afraid that we might get lots of errors and spend way too much time on finding what's causing the issue.

It looks like the schema already allows excluding logged-in users, via the isAnon field.

After talking with @Ottomata on irc about the reasons why eventlogging is not well suited to do error logging I have written a wikitech page on this regard: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/NotErrorLogging#Eventlogging_is_not_Well_suited_to_do_error_logging

This topic has come up several times before in conversations with @Tgr , @Fjalapeno and @faidon and iI really support the fact that client side error logging is needed. Now, the solution to that problem is Sentry. Current blockers for Sentry are around debianizing the package, if I remember correctly.

On that note, it might be worth ingesting the data into Druid

I doubt this would be useful, errors without stack traces are not of much use and druid could not deal with stack traces. The main use for a backend that manipulates errors is to group them by stacktrace: https://docs.sentry.io/learn/rollups/

@Nuria so I completely agree with everything your saying.

I however would like some indication of how buggy the mobile web site is right now given big changes are happening there and lack of any information makes me nervous. I'm also confident if I can make some grand statement that there are "X undiagnosed client side errors on our mobile site a day" I can make Sentry get prioritised.

What would you recommend I do?
Should I report this as explicitly blocked by analytics, or would it be okay to proceed with EventLogging + a low sampling rate to get a sense of the breakages?

Since its imperative we get a count - would using statsv be safer?

After talking with @Ottomata on irc about the reasons why eventlogging is not well suited to do error logging I have written a wikitech page on this regard: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/NotErrorLogging#Eventlogging_is_not_Well_suited_to_do_error_logging

Thanks for taking the time to write this.

Current blockers for Sentry are around debianizing the package, if I remember correctly.

I spoke with @faidon towards the beginning of this calendar year (I think – I didn't datestamp my notes for some reason) about deploying Sentry and he said that SRE would be receptive to help from Audiences (Readers Infra?) with standing up Sentry but that SRE couldn't commit resources until Q1-Q2 FY2019-20.

I think the case for client-side error reporting is undeniable and I'm confident that the choice of Sentry is the right one but it doesn't seem like standing up any solution to the problem has been a relatively high priority. I think having numbers will only strengthen the argument Audiences folk will have to make to dedicate resources to this as we'll have a better handle on the current situation.

The case for using statsv to count client-side errors is quite strong and IMO it's about as easy to implement as it is with EventLogging. However, using EventLogging does afford us some limited follow-on analysis with tools like Druid/Superset as @Tbayer pointed out in T203814#4575680 (e.g. breaking down event rates by OS and/or browser family) as well as being able to take a limited dive into the data if we're so inclined.

The above being said we have to be careful that a temporary solution doesn't become relied upon too heavily by us or third parties.

Sentry for JS errors is a combination of a logging server and a JS error processing library (Raven.js).

Raven parses stack traces into object structures and does other kinds of normalization. It could be deployed on its own, it needs to be done at some point anyway and it might help with counting (grouping errors by stack trace hash) or filtering errors where a certain extension is involved.

There needs to be a way to get the data into the Sentry server; connecting to it directly is probably not a good idea. I originally thought of using varnishkafka, although that probably runs into request length limits, but in any case, there should be some way to load the errors into Kafka. If the Sentry server is not ready for use, that data could just go statsd, or better into logstash (which allows stack trace inspection and flexible searching/filtering).

Neither of those are trivial tasks, but they are not blocked on Ops or doing anything with the Sentry server application, can be worked on right now, and they would improve error logging a lot.

using EventLogging does afford us some limited follow-on analysis with tools like Druid/Superset

The point I was trying to make above is that neither Druid nor Superset are a good fit for error logging as errors are quite useless w/o stack traces. Druid is a tool to analyze timeseries data, cannot deal with text and it is really is not well suited for data that is schema-less. I suggest you use hive to analyze this data if you are set on collecting it. I think once you start a 1% collection you might find that many of your events are not persisted due to validation issues, I might be wrong about that so we shall see. A sure thing is that validating client side errors to a schema might not be the best usage of CPU cycles.

The above being said we have to be careful that a temporary solution doesn't become relied upon too heavily by us or third parties.

Thanks for pointing this out. This is my concern number 1, our team can work on analytics use cases, this is just not one of them.

The point I was trying to make above is that neither Druid nor Superset are a good fit for error logging as errors are quite useless w/o stack traces. Druid is a tool to analyze timeseries data, cannot deal with text and it is really is not well suited for data that is schema-less. I suggest you use hive to analyze this data if you are set on collecting it.

Or logstash?

Jdlrobson updated the task description. (Show Details)
Jdlrobson moved this task from Needs Prioritization to Upcoming on the Web-Team-Backlog board.

I've revised this task to be about turning this on on the beta cluster.
I've also created T205582 to handle error counting via statsv.

Sound good?

Sounds fine, note that beta cluster has no hadoop component so you will need to consume errors directly from kafka or elsewhere.

Jdlrobson claimed this task.

All done here!