WebClientError EventLogging reports are currently only permitted on the beta cluster. statsv is used in production but doesn't contain as much detail. This task is to identify an acceptable sampling rate and submit a patch to enable reporting in production at said rate. Additionally, consider removing the statsv logging if it is no longer needed.
Tagging as epic as there are several things to do here still.
- Understand the EventLogging URL length constraints and how to deal with them. Since the first version was a proof of concept to count errors, we kept the original pass very simple but did hit T206257. We'd need to work out what fields to trim and how to do this (we bikeshedded a little when attempting to do this in the first pass)
- Talk to analytics and get buy-in/permission. The previous attempt to enable this in production was descoped (see T203814#4576030 and description edits
- Add stack traces to WebClientError events. Right now we're not including them in the current implementation because of the URL length. These will be essential. See T202026 for more information.
- Work out how to deal with error spiking. Even at a small sample, an error that hits 100% of users could bring down our entire EventLogging cluster. We'd need to talk to analytics about whether rolling back a deploy is enough to do this.
- Work out how to deal with the frustration of non-deduplication. The feedback we got from people who have used EventLogging is that having the stack traces allowed them to fix the common bugs, but the other ones were harder to track down. We have no idea of the spread of our errors - all of them could be different, or many of them could be duplicates. We should think about what queries we can use to de-duplicate errors (And work out how many people they impact) and how our stack trace can be structured to help facilitate that.
- Work out a suitable sampling rate, based on the possibility of error spikes, the need to identify some of the bugs our users are using.
- Progressive roll out - once all above is done we should slowly ramp up the sampling rate cautiously.