Page MenuHomePhabricator

Many client side errors on citation data, significant percentages of data lost
Closed, ResolvedPublic


Please see client side errors on citation data, at 200 per minute I think this is significant data loss that should probably be addressed:

Event Timeline

These errors are happening most likely due to urls being too large.

For CitationUsagePageLoad we're getting about 450-800 events per second, which gives us 37,500 events per minute. At 200 errors per minute, we one error every 187.5 events. @Miriam and I found this not significant and that's why submitted this patch.

Nuria renamed this task from Many client side errors on citation dat, significant percentages of data lost to Many client side errors on citation data, significant percentages of data lost .Oct 2 2018, 11:10 PM

I see that the percentage you do not think is significant for the research to be done that is ultimately up to you, now there are other considerations about being good citizens of ecosystem. Errors cost user bandwidth and the volume of errors on this schema is significant, it is one order of magnitude higher than the other schemas.. We should not ignore them but rather try to minimize them if possible.

@Nuria that makes sense. Rather than limiting URL length (so that we don't get incomplete data), would it be a good idea to not report these errors? So I'd detect long URLs and not have EL ping these URLs. Would that work?

@bmansurov: errors are reported as way for us to monitor client, so turning errors off is like "turning monitoring off" so that is probably not an acceptable option

@Miriam any updates on this? Did you get a chance to talk with Michele and Tiziano?

@bmansurov yes, sorry for the delay. We propose to cap the citation link text in order to avoid these errors. Would that be ok? Thanks!

According to grafana, on average we're getting 613 events/second for the CitationUsagePageLoad schema. We're also getting about 41 client side errors/minute.

As for CitationUsage, we're getting about 37 events/second, while the error rate is 12 errors/minute.

Given the above, the normalized values are:

Schema nameEvents per error per minute
CitationUsagePageLoad15 (= 613/41)
CitationUsage3 (= 37/12)

Since the citation link text field is part of the CitationUsage schema only, limiting it would only reduce the number of errors for CitationUsage. 1/3 of events are explained by other fields, even if we think that the link text length is responsible for the 2/3 of events. In order to fully fix the issue we'd have to limit other fields too. The question is, how would this affect the study?

Also, there's no deployment this week, so our change would be in production on 10/18. We're planning on turning off data collection on 10/29. I wonder if we should not change anything while we're collecting data (even though Nuria makes a good point at T206083#4637142) so as not to ruin the uniformity of data collection.

Change 468490 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[mediawiki/extensions/WikimediaEvents@master] CitationUsage: limit some parameter lengths

@Miriam I've submitted a patch to limit the link text to 100 characters and page title to 200 characters. Let me know if these numbers need to change. Thanks!

@Nuria I'd appreciate your review of before the branch cut today. I'd like to get the fix in to go out this Thursday. Thanks!

Change 468490 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Update schema revisions for CitationUsage and CitationUsagePageLoad

bmansurov added a project: Research.
bmansurov moved this task from Staged to Done (current quarter) on the Research board.