Change Details

There've been numerous discussions about logging client-side errors, which, for myriad reasons, have stalled. This proposal describes a practical approach to logging client-side errors using technologies that are already deployed and actively maintained by WMF and aims to leverage the work already being done on this infrastructure by SRE. We propose that client-side errors are caught, normalised, encoded, and sent to a "beacon endpoint" (i.e. /beacon/error). Requests to that endpoint should are tailed, and the associated information formatted and added to Kafka on a well-known topic. Logstash will then consume that stream of information and transform it as necessary. Some well-known examples of this "requests to beacon endpoint to Kafka to `$consumer`" pipeline are [[ https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging | EventLogging ]] and [[ https://wikitech.wikimedia.org/wiki/Graphite#statsv | statsv ]]. == 1 Prior art === 1.1 Within WMF ==== 1.1.1 Readers Web's MinervaClientError metric Readers Web began counting the number of client-side errors occurring for users using the Minerva skin in {T205582} as a result of a fairly-isolated discussion about the problem (see T167699). We now have some sense of how many client-side errors are occurring and how that varies over time: https://grafana.wikimedia.org/d/000000566/overview?orgId=1&from=now-30d&to=now&panelId=15&fullscreen === 1.2 Without WMF ==== 1.2.1 Sentry TBD == 2 Considerations === 2.1 Limitations in URL length Requests made to a beacon endpoint are currently expected to use the HTTP GET method with the data in the URL's query string. However, we're limited to the amount of data we can include in the URL, e.g. [[ https://phabricator.wikimedia.org/diffusion/EEVL/browse/master/modules/ext.eventLogging/core.js$44 | EventLogging has a maximum URL size of 2000 ]]. === 2.2 Burstiness - Sampling - Leaky bucket - TBF - Class-based queuing -- Obvious cost is maintaining custom infrastructure === 2.3 Pre-existing tools for exploration In 1.1.1, it was noted that Readers Web are already counting the number of client-side errors for users using the Minerva skin. We do so using [[ https://wikitech.wikimedia.org/wiki/Graphite#statsv | statsv ]]. Currently, we leverage that [[ https://wikitech.wikimedia.org/wiki/Graphite#statsv | statsv ]] makes requests to a beacon endpoint and that the Analytics team provides tools like [[ https://turnilo.wikimedia.org/ | Turnilo ]], which allow us to explore request data at a high-level, to find trends in client-side errors, e.g. [[ https://turnilo.wikimedia.org/#webrequest_sampled_128/3/N4IgbglgzgrghgGwgLzgFwgewHYgFwhLYCmAtAMYAWcATmiADQgYC2xyOx+IAomuQHoAqgBUAwoxAAzCAjTEaUfAG1QaAJ4AHLgVZcmNYlO4B9E3sl6ACgqwATJXlUg7MGuiy4CVgOwARSy0dQnRiKHomcOJNfFIARgBfAF0EhjUg7nCaCGwAc0lDYwI3CBNNdEpJOHIMHG4cyTBEGDCVEAEAI2JqnAFw9CgwECSmbEx6PClEKGJU9O1MtGy8gqNuEpMARxaadSqaz25yHDQ4HKUmJoQWx2UQAHViDrEkYmw0HhoaTBph0fH8FMEDNkpFNEg0Ld5sEsjl8kw7BA2NgoIcCMd3jk3hEQFAfhNQIVuJQIJDJIjDAc6gQ7GFyG9EStUoQkaT8NgYAgEHNmBldEj9C4BSi0SAzBYmLl3ByELRSXtvCI4gAJSR4uj4QlrAjigXkiCU2peEBwKD07CM/LMpAsNl4GXckYgNimtytPCgaAAWU5GEB02IkThCGCJLJTBYvogbTDShSTE0ORIdj8wtROFuTsT2GTAGV8cTSRdCMRcgzNULkenjRiMCR3pIbXaAKzMkm5ShIDsTB0JIA== | a graph client-side errors for users using the Minerva skin by continent ]]. It's worth noting that if this proposal were implemented, this facility would still be available.

There've been numerous discussions about logging client-side errors, which, for myriad reasons, have stalled. This proposal describes a practical approach to logging client-side errors using technologies that are already deployed and actively maintained by WMF and aims to leverage the work already being done on this infrastructure by SRE. We propose that client-side errors are caught, normalised, encoded, and sent to a "beacon endpoint" (i.e. /beacon/error). Requests to that endpoint should are tailed, and the associated information formatted and added to Kafka on a well-known topic. Logstash will then consume that stream of information and transform it as necessary. Some well-known examples of this "requests to beacon endpoint to Kafka to `$consumer`" pipeline are [[ https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging | EventLogging ]] and [[ https://wikitech.wikimedia.org/wiki/Graphite#statsv | statsv ]]. == 1 Prior art === 1.1 Within WMF ==== 1.1.1 Readers Web's MinervaClientError metric Readers Web began counting the number of client-side errors occurring for users using the Minerva skin in {T205582} as a result of a fairly-isolated discussion about the problem (see T167699). We now have some sense of how many client-side errors are occurring and how that varies over time: https://grafana.wikimedia.org/d/000000566/overview?orgId=1&from=now-30d&to=now&panelId=15&fullscreen === 1.2 Without WMF ==== 1.2.1 Sentry TBD == 2 Considerations === 2.1 Limitations in URL length Requests made to a beacon endpoint are currently expected to use the HTTP GET method with the data in the URL's query string. However, we're limited to the amount of data we can include in the URL, e.g. [[ https://phabricator.wikimedia.org/diffusion/EEVL/browse/master/modules/ext.eventLogging/core.js$44 | EventLogging has a maximum URL size of 2000 ]]. === 2.2 Burstiness If a syntax error were introduced in a JavaScript asset that's delivered to all clients, then we'd see upwards of 5500 errors reported per second. The simplest way of dealing with this issue is to enable client-side error reporting for 1% of all pageviews. We might also consider creating a service that acts as: * A [[ https://en.wikipedia.org/wiki/Leaky_bucket | leaky bucket ]] if we don't want to permit bursts or [[ https://en.wikipedia.org/wiki/Token_bucket | token bucket ]] if we do; * A classful version of the above, if we want to "roll up" errors based on their normalized properties. These services would allow us to maximise the number of clients that can report errors, thereby increasing the likelihood of capturing relatively low-rate errors. However, the introduction of any such service would require a long-term maintenance commitment from at least SRE. === 2.3 Pre-existing tools for exploration In 1.1.1, it was noted that Readers Web are already counting the number of client-side errors for users using the Minerva skin. We do so using [[ https://wikitech.wikimedia.org/wiki/Graphite#statsv | statsv ]]. Currently, we leverage that [[ https://wikitech.wikimedia.org/wiki/Graphite#statsv | statsv ]] makes requests to a beacon endpoint and that the Analytics team provides tools like [[ https://turnilo.wikimedia.org/ | Turnilo ]], which allow us to explore request data at a high-level, to find trends in client-side errors, e.g. [[ https://turnilo.wikimedia.org/#webrequest_sampled_128/3/N4IgbglgzgrghgGwgLzgFwgewHYgFwhLYCmAtAMYAWcATmiADQgYC2xyOx+IAomuQHoAqgBUAwoxAAzCAjTEaUfAG1QaAJ4AHLgVZcmNYlO4B9E3sl6ACgqwATJXlUg7MGuiy4CVgOwARSy0dQnRiKHomcOJNfFIARgBfAF0EhjUg7nCaCGwAc0lDYwI3CBNNdEpJOHIMHG4cyTBEGDCVEAEAI2JqnAFw9CgwECSmbEx6PClEKGJU9O1MtGy8gqNuEpMARxaadSqaz25yHDQ4HKUmJoQWx2UQAHViDrEkYmw0HhoaTBph0fH8FMEDNkpFNEg0Ld5sEsjl8kw7BA2NgoIcCMd3jk3hEQFAfhNQIVuJQIJDJIjDAc6gQ7GFyG9EStUoQkaT8NgYAgEHNmBldEj9C4BSi0SAzBYmLl3ByELRSXtvCI4gAJSR4uj4QlrAjigXkiCU2peEBwKD07CM/LMpAsNl4GXckYgNimtytPCgaAAWU5GEB02IkThCGCJLJTBYvogbTDShSTE0ORIdj8wtROFuTsT2GTAGV8cTSRdCMRcgzNULkenjRiMCR3pIbXaAKzMkm5ShIDsTB0JIA== | a graph client-side errors for users using the Minerva skin by continent ]]. It's worth noting that if this proposal were implemented, this facility would still be available.

There've been numerous discussions about logging client-side errors, which, for myriad reasons, have stalled. This proposal describes a practical approach to logging client-side errors using technologies that are already deployed and actively maintained by WMF and aims to leverage the work already being done on this infrastructure by SRE. We propose that client-side errors are caught, normalised, encoded, and sent to a "beacon endpoint" (i.e. /beacon/error). Requests to that endpoint should are tailed, and the associated information formatted and added to Kafka on a well-known topic. Logstash will then consume that stream of information and transform it as necessary. Some well-known examples of this "requests to beacon endpoint to Kafka to `$consumer`" pipeline are [[ https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging | EventLogging ]] and [[ https://wikitech.wikimedia.org/wiki/Graphite#statsv | statsv ]]. == 1 Prior art === 1.1 Within WMF ==== 1.1.1 Readers Web's MinervaClientError metric Readers Web began counting the number of client-side errors occurring for users using the Minerva skin in {T205582} as a result of a fairly-isolated discussion about the problem (see T167699). We now have some sense of how many client-side errors are occurring and how that varies over time: https://grafana.wikimedia.org/d/000000566/overview?orgId=1&from=now-30d&to=now&panelId=15&fullscreen === 1.2 Without WMF ==== 1.2.1 Sentry TBD == 2 Considerations === 2.1 Limitations in URL length Requests made to a beacon endpoint are currently expected to use the HTTP GET method with the data in the URL's query string. However, we're limited to the amount of data we can include in the URL, e.g. [[ https://phabricator.wikimedia.org/diffusion/EEVL/browse/master/modules/ext.eventLogging/core.js$44 | EventLogging has a maximum URL size of 2000 ]]. === 2.2 Burstiness - Sampling - Leaky bucket - TBF - Class-based queuingIf a syntax error were introduced in a JavaScript asset that's delivered to all clients, then we'd see upwards of 5500 errors reported per second. The simplest way of dealing with this issue is to enable client-side error reporting for 1% of all pageviews. We might also consider creating a service that acts as: * A [[ https://en.wikipedia.org/wiki/Leaky_bucket | leaky bucket ]] if we don't want to permit bursts or [[ https://en.wikipedia.org/wiki/Token_bucket | token bucket ]] if we do; -- Obvious cost is maintaining custom infrastructure* A classful version of the above, if we want to "roll up" errors based on their normalized properties. These services would allow us to maximise the number of clients that can report errors, thereby increasing the likelihood of capturing relatively low-rate errors. However, the introduction of any such service would require a long-term maintenance commitment from at least SRE. === 2.3 Pre-existing tools for exploration In 1.1.1, it was noted that Readers Web are already counting the number of client-side errors for users using the Minerva skin. We do so using [[ https://wikitech.wikimedia.org/wiki/Graphite#statsv | statsv ]]. Currently, we leverage that [[ https://wikitech.wikimedia.org/wiki/Graphite#statsv | statsv ]] makes requests to a beacon endpoint and that the Analytics team provides tools like [[ https://turnilo.wikimedia.org/ | Turnilo ]], which allow us to explore request data at a high-level, to find trends in client-side errors, e.g. [[ https://turnilo.wikimedia.org/#webrequest_sampled_128/3/N4IgbglgzgrghgGwgLzgFwgewHYgFwhLYCmAtAMYAWcATmiADQgYC2xyOx+IAomuQHoAqgBUAwoxAAzCAjTEaUfAG1QaAJ4AHLgVZcmNYlO4B9E3sl6ACgqwATJXlUg7MGuiy4CVgOwARSy0dQnRiKHomcOJNfFIARgBfAF0EhjUg7nCaCGwAc0lDYwI3CBNNdEpJOHIMHG4cyTBEGDCVEAEAI2JqnAFw9CgwECSmbEx6PClEKGJU9O1MtGy8gqNuEpMARxaadSqaz25yHDQ4HKUmJoQWx2UQAHViDrEkYmw0HhoaTBph0fH8FMEDNkpFNEg0Ld5sEsjl8kw7BA2NgoIcCMd3jk3hEQFAfhNQIVuJQIJDJIjDAc6gQ7GFyG9EStUoQkaT8NgYAgEHNmBldEj9C4BSi0SAzBYmLl3ByELRSXtvCI4gAJSR4uj4QlrAjigXkiCU2peEBwKD07CM/LMpAsNl4GXckYgNimtytPCgaAAWU5GEB02IkThCGCJLJTBYvogbTDShSTE0ORIdj8wtROFuTsT2GTAGV8cTSRdCMRcgzNULkenjRiMCR3pIbXaAKzMkm5ShIDsTB0JIA== | a graph client-side errors for users using the Minerva skin by continent ]]. It's worth noting that if this proposal were implemented, this facility would still be available.