Page MenuHomePhabricator

[Spike, 1hr]: Analyse Catalan Wikipedia error logs for iOS error
Closed, ResolvedPublicSpike

Description

Blocked on T258073

The error tracking we have for Mediawiki.org is great, but it doesn't represent the majority of our users. Looking at WebClientError counting most of our errors are occurring on English or Japanese Wikipedia

A high profile iOS error is going unaccounted for but not showing up in mediawiki.org's error logs. In MediaWiki.org, 3.1k of the 43.3kb total errors are iOS (7%).

On Catalan however, 17.5% of errors come from iOS: (50.6k iOS errors per day, 290.8kb errors per day)
KM Wikipedia looks like a good choice. 22% of errors came from iOS and it sees less errors than mediawiki.org (6.9k of 31.1k in a day)

We will be enabling error tracking on Catalan Wikipedia. Once that's done analyze the data coming in and make tasks for potential sources of the bug.

Notes

My understanding is this would require pushing errors to : WMEClientErrorIntakeURL

Based on turnilo iOS errors and error totals

  • If any of the wikis mentioned on T249297 can be considered, that would be helpful for Desktop Improvements and Vue.js search.
  • Example config patch

Event Timeline

Jdlrobson added a subscriber: Krinkle.
Reedy renamed this task from Enable error tracking on mobile KM.Wikipedia wiki to catch iOS specific bugs to Enable error tracking on mobile KM.Wikipedia wiki to catch iOS specific bugs.Jun 11 2020, 9:19 PM

The error tracking we have for Mediawiki.org is great, but it doesn't represent the majority of our users. Looking at WebClientError counting most of our errors are occurring on English or Japanese Wikipedia

Is this something @jlinehan can help with?

@Niedzielski yep, does this task have a priority?

@Jdlrobson, @ovasileva can we prioritize this task? We need similar functionality for Vector Vue.js work: T249826.

ovasileva triaged this task as Medium priority.Jul 6 2020, 10:39 AM
Jdlrobson updated the task description. (Show Details)Jul 6 2020, 4:21 PM
Jdlrobson added a subscriber: JTannerWMF.

@JTannerWMF @jlinehan @ovasileva happy to talk through this in a videochat if that's helpful. I think first of all we're just keen to know if it's possible to do this and secondly what needs to happen (and who) to enable this.

Jdlrobson renamed this task from Enable error tracking on mobile KM.Wikipedia wiki to catch iOS specific bugs to Enable error tracking on a Wikipedia mobile wiki to catch and fix high volume bugs.Jul 9 2020, 11:18 PM

@JTannerWMF @ovasileva I think we need to make this high priority and meet with @jlinehan to discuss the problem

To put this in perspective, Our end users are hitting 15.5 million bugs every hour Chrome Mobile and Mobile Safari being the most affected and 15.1 million of those bugs were from Wikipedia. I really think that being able to capture errors from at least one Wikipedia project should help us identify what's happening. For comparison only 2k of errors happen in an hour on mobile on MediaWiki.org

Ideally we'd pick a project from this list: breakdown per project where the risk of enabling it is low and the likelihood of uncovering bugs that affect all our projects is high.

Krinkle added a comment.EditedJul 10 2020, 1:35 AM

We don't need to distinguish between mobile and desktop. enwiki and jawiki are too big I think.

Looks like this would be covered by T246030, which is about enabling it on a small Wikipedia. The client error instrumentation has since been enabled on haw.wikipedia.org (naturally incl mobile). If that one is too small, perhaps we should enable it on ca.wikipedia (Catalan) instead of haw.wikipedia. Catalan is about 100x larger in traffic (300K views/mo vs 30M view/mo), but still fairly small – mediawiki.org is currently 10M views/mo. (stats.wikimedia.org)

ovasileva raised the priority of this task from Medium to High.Jul 10 2020, 10:09 AM

first of all we're just keen to know if it's possible to do this

Yes it certainly is. Better documentation is coming, for now I'm sorry for it being a little ad-hoc.

and secondly what needs to happen (and who) to enable this.

Background: right now we aren't sampling the errors: if an error happens, it gets sent. There are reasons this might be desirable once the overall error rate comes down (don't want to miss out on catching a rare bug because it was out of sample, etc). The result is that on a high or even medium traffic wiki, the number of error events could be very large and strain some parts of the backend. So the solution for now is to enable the instrument on a low-ish traffic wiki and suppose that the errors there will act sort of like a sample of the errors on any other wiki.

So for what needs to happen, first is to decide on a project that has sufficient traffic that you feel like there is a good enough sample to catch the high-volume bugs. Currently we have it enabled on Hawaiian Wikipedia, but perhaps that is too low-traffic. What do you think about @Krinkle's suggestion of Catalan wiki?

After that, all that needs to happen is a change to the MediaWiki config to enable the instrument on that Wiki. I can help with that.

@JTannerWMF @ovasileva I think we need to make this high priority and meet with @jlinehan to discuss the problem

I'll reach out to you and set something up so we can talk. I'll put any takeaways back on this ticket.

Jdlrobson added a comment.EditedJul 10 2020, 3:49 PM

Catalan could work. According to the break down, Catalan is seeing around 11.9k errors on mobile (from our error counting) every hour compared to 1.6k on mediawiki.org so this seems like should allow us to make some significant discoveries. Thank you!

Niedzielski updated the task description. (Show Details)Jul 14 2020, 6:12 PM
Niedzielski updated the task description. (Show Details)

Per our chat today, @jlinehan and @Mholloway will enable on Catalan Wikipedia. Once that's done please poke me here, and I'll create the required follow up tasks on our side.

Jdlrobson reassigned this task from ovasileva to jlinehan.Jul 15 2020, 5:13 PM
Jdlrobson added subscribers: LGoto, ovasileva.

@LGoto we met with @jlinehan and @Mholloway yesterday and it was our understanding that product infrastructure would be making this config change this week. I noticed however you moved to tracking - does that mean I misunderstood?

LGoto added a comment.Jul 15 2020, 5:22 PM

@Jdlrobson Thanks for checking, @jlinehan has a separate, more specific task (T258073) which is on the PI kanban board and currently in progress. He advised that this broader task be moved to tracking in the meanwhile. Let me know if you have any concerns!

Jdlrobson renamed this task from Enable error tracking on a Wikipedia mobile wiki to catch and fix high volume bugs to [Spike, 1hr]: Analyse Catalan Wikipedia error logs for iOS error.Jul 15 2020, 5:27 PM
Jdlrobson removed jlinehan as the assignee of this task.
Jdlrobson updated the task description. (Show Details)
Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptJul 15 2020, 5:27 PM

Thanks! Okay I've repurposed this task as a spike on our end.

I had a look at the new errors coming in and this has exposed T257872 as a bigger problem than I realised. It has high potential for high volume so is certainly is a possibility for the sheer numbers we're seeing on mobile.

Jdlrobson added a comment.EditedJul 17 2020, 6:25 PM

@jlinehan after analyzing the logs a little more - the errors we were seeing were actually a large amount of errors from a small amount of users - in fact in a 12hr period, there were 3278 errors on mobile, 3,029 of which came from 1 individual across 2 pages.

Expanding the search query to include extensions and gadgets, I should note that 38,830 of the 48,175 errors logged in a 12 hr period came from the same IP address and same set of user scripts which is insane!

We're going to investigate what effect limiting the counting of errors to a maximum of 5 per page/user session to see what impact that has during next week's deploy on our own error counts but I think what we learn from that is going to be important for rolling out the client side error tracking further - what we're seeing is most errors come from a small amount of places - based on that we may want to one of two things:

  • Have some kind of IP blacklist for editors with problematic user scripts/extensions
  • Limit the number of errors from a single user/IP like we're doing.

I wanted to cut a bug for that but I wasn't sure where to put it... what do you recommend?

I'LL sign off

Early signs indicate the error rate is being reduced by almost half since we limited the amount of tracked errors. Am waiting for more data before drawing conclusions.

LGoto removed a subscriber: LGoto.Jul 28 2020, 2:15 AM
Jdlrobson closed this task as Resolved.EditedAug 4 2020, 4:15 PM

Limiting errors per client to 5, halved the errors we were counting on Minerva:

The Uncaught Error: Set map center and zoom first. error (T255204) was sending 1,782 events per day on Catalan Wikipedia - meaning 356 users were sending 5 errors each. After a fix for that bug was provided and backported on 30th July at 11am the errors disappeared from Catalan Wikipedia.

On Monday when this change rolled out to English Wikipedia there was not much change to the error count which suggests that while this error was prominent limiting to 5 errors per person was enough to extinguish the noise:

Using the logs from Catalan Wikipedia I have reached out to various editors to fix problematic gadgets (and fixed some myself). As a result of these changes over a seven day period I'm seeing 3,567 errors per 24hrs from Catalan Wikipedia (for comparison it was around 30,000+ every 3hrs when we first enabled)

Based on this I'm resolving the task and suggesting the follow up work in this order:

  • T259371- Limit the number of errors from a single client (suggested: maximum 5 errors should be recorded for a single IP on a single page)
  • T259383 - Filter out non-Wikimedia domains
  • T258099 - Enable on Hebrew Wikipedia
  • T256173 - Allow filtering of errors from logged in users
  • T259369 - "Script error." Gadgets loaded from other domains should have more actionable error messages

cc @jlinehan :)