Page MenuHomePhabricator

Filter out non-wikimedia domains and extensions from web-client-errors
Closed, ResolvedPublic

Description

Amongst the errors recorded in logstash I'm noticing errors originating from non-Wikimedia domains . These include:

These should not end up in logstash. They should be filtered out.

We may also want to filter out errors with the following in the stack traces (or send them to a separate channel):

  • moz-extension/
  • chrome-extension/
  • debugger eval code

Event Timeline

Jdlrobson renamed this task from Filter out untrusted domains and extensions from web-client-errors to Filter out non-wikimedia domains and extensions from web-client-errors.Jul 31 2020, 8:47 PM
Jdlrobson updated the task description. (Show Details)

Hmm yes. Could we use a regex or pattern of some kind in the ResourceLoader URL (if we want to focus exclusively on the MediaWiki client)? @Krinkle what do you think? We'll be cutting out errors from browser extensions, but maybe that's another channel like you say, @Jdlrobson.

jlinehan triaged this task as High priority.
jlinehan moved this task from Task Backlog to Doing on the Product-Data-Infrastructure board.

I would recommend against filtering these out from the client or in the ingestion pipeline. If these scripts are executing on our pages (not in a browser extension thread, but actually on our pages), that seems fine to keep and could actually help in some way.

Blocking these and/or requiring opt-in is part of the on-going CSP endavourer and is outside of the scope for the web client handling.

I'd recommend focussing only on filtering out those without URLs or from browser extensions. Anything else, if needed, will either be in the noise, or can be filtered out in a Kibana dashboard preset that we enable by default.

I would recommend against filtering these out from the client nor in the ingestion pipeline. If these scripts are executing on our pages (not in a browser extension thread, but actually on our pages), that seems fine to keep and could actually help in some way.

That sounds good. So for now why don't we see what kind of impact https://phabricator.wikimedia.org/T259369 has on the event volume, and then play with some filters in logstash and see if we can tune enough of this out. We can revisit afterwards. Sound ok @Jdlrobson?

My main concern here is what kind of traffic the endpoint can take. If we're happy to log these errors from external clients that's fine, but I am personally not interested in them and my goal is having this error logging on all domains including English Wikipedia.

Separating these out into a second channel seems sensible to me, but we can reconsider as we roll out to further larger wikis.

jlinehan lowered the priority of this task from High to Low.Aug 24 2020, 3:23 PM
jlinehan moved this task from Doing to Reviewing on the Product-Data-Infrastructure board.

I'd like to chat through this one and up its priority.
While it seems this isn't a problem from a ingestion point of view, there's a lot of noise from foreign domains at this point which adds a bit of a burden on the end user e.g. me to filter these out.

While the moz-extension and chrome-extension ones are harmless and easy to filter, every new domain I discovered requires an additional filter. I have 38 filters on our board relating to this and rising!

Filters can be combined into a single multi-value filter instead of many separate ones. I'd also recommend enabling it by default with saved dashboard (they're currently disabled presets that one has to enable one-by-one ad-hoc. I don't know if this was intentional?).

I'll also repeat from other tasks that proactively excluding each domain does not seem useful investment given that most are far below any reasonable threshold we'd set. My general recommendation would be to 1) upon proactive triage look at the top 3 most common ones by aggregate normalized_message over the past N hours and then move on with other chores, and 2) only go further if and when the total volume for a given wiki domain has reached a predefined alerting threshold, at which point we'd see where the majority is coming from and fix or exclude accordingly. If the spike is a false positive and if from a third-party domain, we'd add it to the the default ofter of this dashboard.

We could also invert the filter and only specify the instrumented first-party domains, that might be simpler right now.

Do those errors contain any useful information? I would expect errors originating from off-domain scripts to have their details hidden due to CORS restrictions.

I am aware that filters can be multi-value. I am relying on this, and updating the dozens of active filters as we roll out. I wish I wasn't, but right now I don't see much choice. Note that many errors may occur at low frequency every day but that doesn't mean they are not important. For example. there was an issue this week with data loss on ContentTranslation.

The presets are disabled by default intentionally. Right now as we roll out we are concerned with volume so no filters are applying so we can monitor that. The reading web team has their own dashboard which is filtering out gadgets and external domains. My assumption is that other teams will do the same, and this one will be non-filtered by default but the disabled presets are there to help those who need them.

@Tgr most are "script error". The errors from other domains do contain stack traces in the case where host of url is same as file_uri (e.g. translatoruser-int.com)
e.g. https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2020.09.29/clienterror/?id=AXTbopKOLNRtRo5X2W5F

	at Object.i.getStyleValue  https://www.translatoruser-int.com/static/js/BVToolkit.min.js:1:123029
at dt  https://www.translatoruser-int.com/static/js/BVToolkit.min.js:1:8534
at lu  https://www.translatoruser-int.com/static/js/BVToolkit.min.js:1:1769
at https://www.translatoruser-int.com/static/js/BVToolkit.min.js:1:3294

Chrome and Firefox extensions also have stack traces.

Some of the scripts loaded from elsewhere have stack traces, however, these might relate to AJAX request callbacks
e.g. https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2020.09.29/clienterror/?id=AXTcCcr3LNRtRo5X_6I_

Having an EventGate or Logstash filter that sets a flag on errors likely originating in non-production code, and then using that flag in Kibana to filter dashboards, seems unobjectionable to me, if you think Logstash/EventGate is a better place to handle the filter conditions. It could even be done on the client side, that has performance implications but probably only trivial ones.

FWIW in the last 24hrs, we recorded 38,523 errors. When I filter out bugs originating from other domains that number drops to 21,041. I'm not sure how big a problem this is in terms of volume for when we enable English Wikipedia, but while interesting to know these errors are occuring, nobody is caring about those scripts just yet.

Change 654470 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/extensions/WikimediaEvents@master] Do not log errors for browser extensions out of our control

https://gerrit.wikimedia.org/r/654470

I think as a temporary measure we should disable these before rolling out to English Wikipedia. There are a lot of browser extensions running on English Wikipedia, and that will make the january roll out a lot less stressful.

The total log volume of Logstash is about a hundred million items a day so I wouldn't be too worried about it. I don't know about EventGate but surely its also in the millions.

That said one could make a privacy argument for filtering: we shouldn't log what browser extensions our readers are using, it's a kind of browser fingerprinting.

Change 654470 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Do not log errors for browser extensions out of our control

https://gerrit.wikimedia.org/r/654470

FYI at time of writing extension errors right now and account for 10,000 of the 30,000 errors we currently see.
@Tgr thanks for reassurance around logstash volume, however sadly we don't have any sense of how widely used errors from extensions will be on English Wikipedia traffic levels. Agree that privacy is also a good motivation here. We can reconsider our approach here once we've rolled out to English Wikipedia.