Please see client side errors on citation data, at 200 per minute I think this is significant data loss that should probably be addressed:
Description
Details
Related Objects
Event Timeline
For CitationUsagePageLoad we're getting about 450-800 events per second, which gives us 37,500 events per minute. At 200 errors per minute, we one error every 187.5 events. @Miriam and I found this not significant and that's why submitted this patch.
I see that the percentage you do not think is significant for the research to be done that is ultimately up to you, now there are other considerations about being good citizens of ecosystem. Errors cost user bandwidth and the volume of errors on this schema is significant, it is one order of magnitude higher than the other schemas.. We should not ignore them but rather try to minimize them if possible.
@Nuria that makes sense. Rather than limiting URL length (so that we don't get incomplete data), would it be a good idea to not report these errors? So I'd detect long URLs and not have EL ping these URLs. Would that work?
@bmansurov: errors are reported as way for us to monitor client, so turning errors off is like "turning monitoring off" so that is probably not an acceptable option
@bmansurov yes, sorry for the delay. We propose to cap the citation link text in order to avoid these errors. Would that be ok? Thanks!
According to grafana, on average we're getting 613 events/second for the CitationUsagePageLoad schema. We're also getting about 41 client side errors/minute.
As for CitationUsage, we're getting about 37 events/second, while the error rate is 12 errors/minute.
Given the above, the normalized values are:
| Schema name | Events per error per minute |
| CitationUsagePageLoad | 15 (= 613/41) |
| CitationUsage | 3 (= 37/12) |
Since the citation link text field is part of the CitationUsage schema only, limiting it would only reduce the number of errors for CitationUsage. 1/3 of events are explained by other fields, even if we think that the link text length is responsible for the 2/3 of events. In order to fully fix the issue we'd have to limit other fields too. The question is, how would this affect the study?
Also, there's no deployment this week, so our change would be in production on 10/18. We're planning on turning off data collection on 10/29. I wonder if we should not change anything while we're collecting data (even though Nuria makes a good point at T206083#4637142) so as not to ruin the uniformity of data collection.
Change 468490 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[mediawiki/extensions/WikimediaEvents@master] CitationUsage: limit some parameter lengths
@Miriam I've submitted a patch to limit the link text to 100 characters and page title to 200 characters. Let me know if these numbers need to change. Thanks!
@Nuria I'd appreciate your review of https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikimediaEvents/+/468490/ before the branch cut today. I'd like to get the fix in to go out this Thursday. Thanks!
Change 468490 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Update schema revisions for CitationUsage and CitationUsagePageLoad