Please see client side errors on citation data, at 200 per minute I think this is significant data loss that should probably be addressed:
|mediawiki/extensions/WikimediaEvents||master||+4 -4||Update schema revisions for CitationUsage and CitationUsagePageLoad|
I see that the percentage you do not think is significant for the research to be done that is ultimately up to you, now there are other considerations about being good citizens of ecosystem. Errors cost user bandwidth and the volume of errors on this schema is significant, it is one order of magnitude higher than the other schemas.. We should not ignore them but rather try to minimize them if possible.
According to grafana, on average we're getting 613 events/second for the CitationUsagePageLoad schema. We're also getting about 41 client side errors/minute.
As for CitationUsage, we're getting about 37 events/second, while the error rate is 12 errors/minute.
Given the above, the normalized values are:
|Schema name||Events per error per minute|
|CitationUsagePageLoad||15 (= 613/41)|
|CitationUsage||3 (= 37/12)|
Since the citation link text field is part of the CitationUsage schema only, limiting it would only reduce the number of errors for CitationUsage. 1/3 of events are explained by other fields, even if we think that the link text length is responsible for the 2/3 of events. In order to fully fix the issue we'd have to limit other fields too. The question is, how would this affect the study?
Also, there's no deployment this week, so our change would be in production on 10/18. We're planning on turning off data collection on 10/29. I wonder if we should not change anything while we're collecting data (even though Nuria makes a good point at T206083#4637142) so as not to ruin the uniformity of data collection.
@Nuria I'd appreciate your review of https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikimediaEvents/+/468490/ before the branch cut today. I'd like to get the fix in to go out this Thursday. Thanks!