Filter out non-wikimedia domains and extensions from web-client-errors
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Jdlrobson
	Jul 31 2020, 7:30 PM

Description

Amongst the errors recorded in logstash I'm noticing errors originating from non-Wikimedia domains . These include:

http://localhost
https://siteprerender.com/
https://3001.scriptcdn.net
www.translatoruser-int.com
interpreter.caiyunai.com
translate.googleusercontent.com
https://crisgrey.com/

These should not end up in logstash. They should be filtered out.

We may also want to filter out errors with the following in the stack traces (or send them to a separate channel):

moz-extension/
chrome-extension/
debugger eval code

Details

	Subject	Repo	Branch	Lines +/-
	Do not log errors for browser extensions out of our control	mediawiki/extensions/WikimediaEvents	master	+15 -3

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T49145 Formally deprecate jQuery UI after we've stopped using jQuery UI in extensions and core
Open	None	T100270 Replace use of jQuery UI and MW UI with OOUI across all Wikimedia-deployed extensions and core
Open	None	T85394 Use OOUI suggestions/autocompletion components only (instead of jquery.suggestions, jquery.ui.autocomplete)
Open	None	T125725 [epic] Update autocomplete search box with metadata and remove and delete the old searchSuggest system
Open	None	T177251 Dead keys prevent autocomplete in search box
Resolved	ovasileva	T244392 [GOAL] Deploy the new Vue.js search experience
Resolved	ovasileva	T275200 Analyze results of A/B test for new search widget
Resolved	ovasileva	T249297 Deploy the new Vue.js search experience
Resolved	• jlinehan	T255585 [EPIC] Extend client-side error logging coverage to include English Wikipedia
Resolved	• jlinehan	T259383 Filter out non-wikimedia domains and extensions from web-client-errors

Event Timeline

Jdlrobson created this task.Jul 31 2020, 7:30 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 31 2020, 7:30 PM

Jdlrobson renamed this task from Filter out untrusted domains and extensions from web-client-errors to Filter out non-wikimedia domains and extensions from web-client-errors.Jul 31 2020, 8:47 PM

Jdlrobson updated the task description. (Show Details)

Jdlrobson updated the task description. (Show Details)Jul 31 2020, 8:56 PM

• jlinehan added a project: Product-Data-Infrastructure.Aug 3 2020, 2:35 PM

• jlinehan moved this task from Inbox to Task Backlog on the Product-Data-Infrastructure board.Aug 3 2020, 3:49 PM

Jdlrobson mentioned this in T255204: [Spike, 1hr]: Analyse Catalan Wikipedia error logs for iOS error.Aug 4 2020, 4:15 PM

• jlinehan added a parent task: T255585: [EPIC] Extend client-side error logging coverage to include English Wikipedia.Aug 6 2020, 2:32 PM

Hmm yes. Could we use a regex or pattern of some kind in the ResourceLoader URL (if we want to focus exclusively on the MediaWiki client)? @Krinkle what do you think? We'll be cutting out errors from browser extensions, but maybe that's another channel like you say, @Jdlrobson.

• jlinehan claimed this task.Aug 11 2020, 2:59 PM

• jlinehan triaged this task as High priority.

• jlinehan moved this task from Task Backlog to Doing on the Product-Data-Infrastructure board.

• jlinehan mentioned this in T259369: "Script error." Scripts loaded from other domains with empty file_uri and no stack trace should not be included.Aug 11 2020, 3:19 PM

I would recommend against filtering these out from the client or in the ingestion pipeline. If these scripts are executing on our pages (not in a browser extension thread, but actually on our pages), that seems fine to keep and could actually help in some way.

Blocking these and/or requiring opt-in is part of the on-going CSP endavourer and is outside of the scope for the web client handling.

I'd recommend focussing only on filtering out those without URLs or from browser extensions. Anything else, if needed, will either be in the noise, or can be filtered out in a Kibana dashboard preset that we enable by default.

In T259383#6376520, @Krinkle wrote:

I would recommend against filtering these out from the client nor in the ingestion pipeline. If these scripts are executing on our pages (not in a browser extension thread, but actually on our pages), that seems fine to keep and could actually help in some way.

That sounds good. So for now why don't we see what kind of impact https://phabricator.wikimedia.org/T259369 has on the event volume, and then play with some filters in logstash and see if we can tune enough of this out. We can revisit afterwards. Sound ok @Jdlrobson?

My main concern here is what kind of traffic the endpoint can take. If we're happy to log these errors from external clients that's fine, but I am personally not interested in them and my goal is having this error logging on all domains including English Wikipedia.

Separating these out into a second channel seems sensible to me, but we can reconsider as we roll out to further larger wikis.

• jlinehan lowered the priority of this task from High to Low.Aug 24 2020, 3:23 PM

• jlinehan moved this task from Doing to Reviewing on the Product-Data-Infrastructure board.

I'd like to chat through this one and up its priority.
While it seems this isn't a problem from a ingestion point of view, there's a lot of noise from foreign domains at this point which adds a bit of a burden on the end user e.g. me to filter these out.

While the moz-extension and chrome-extension ones are harmless and easy to filter, every new domain I discovered requires an additional filter. I have 38 filters on our board relating to this and rising!

• jlinehan moved this task from Reviewing to Next on the Product-Data-Infrastructure board.Sep 15 2020, 9:15 PM

Filters can be combined into a single multi-value filter instead of many separate ones. I'd also recommend enabling it by default with saved dashboard (they're currently disabled presets that one has to enable one-by-one ad-hoc. I don't know if this was intentional?).

I'll also repeat from other tasks that proactively excluding each domain does not seem useful investment given that most are far below any reasonable threshold we'd set. My general recommendation would be to 1) upon proactive triage look at the top 3 most common ones by aggregate normalized_message over the past N hours and then move on with other chores, and 2) only go further if and when the total volume for a given wiki domain has reached a predefined alerting threshold, at which point we'd see where the majority is coming from and fix or exclude accordingly. If the spike is a false positive and if from a third-party domain, we'd add it to the the default ofter of this dashboard.

We could also invert the filter and only specify the instrumented first-party domains, that might be simpler right now.

Krinkle mentioned this in T262627: Error: vs Uncaught Error: in client error messages.Sep 28 2020, 9:35 PM

Do those errors contain any useful information? I would expect errors originating from off-domain scripts to have their details hidden due to CORS restrictions.

I am aware that filters can be multi-value. I am relying on this, and updating the dozens of active filters as we roll out. I wish I wasn't, but right now I don't see much choice. Note that many errors may occur at low frequency every day but that doesn't mean they are not important. For example. there was an issue this week with data loss on ContentTranslation.

The presets are disabled by default intentionally. Right now as we roll out we are concerned with volume so no filters are applying so we can monitor that. The reading web team has their own dashboard which is filtering out gadgets and external domains. My assumption is that other teams will do the same, and this one will be non-filtered by default but the disabled presets are there to help those who need them.

@Tgr most are "script error". The errors from other domains do contain stack traces in the case where host of url is same as file_uri (e.g. translatoruser-int.com)
e.g. https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2020.09.29/clienterror/?id=AXTbopKOLNRtRo5X2W5F

	at Object.i.getStyleValue  https://www.translatoruser-int.com/static/js/BVToolkit.min.js:1:123029
at dt  https://www.translatoruser-int.com/static/js/BVToolkit.min.js:1:8534
at lu  https://www.translatoruser-int.com/static/js/BVToolkit.min.js:1:1769
at https://www.translatoruser-int.com/static/js/BVToolkit.min.js:1:3294

Chrome and Firefox extensions also have stack traces.

Some of the scripts loaded from elsewhere have stack traces, however, these might relate to AJAX request callbacks
e.g. https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2020.09.29/clienterror/?id=AXTcCcr3LNRtRo5X_6I_

Having an EventGate or Logstash filter that sets a flag on errors likely originating in non-production code, and then using that flag in Kibana to filter dashboards, seems unobjectionable to me, if you think Logstash/EventGate is a better place to handle the filter conditions. It could even be done on the client side, that has performance implications but probably only trivial ones.

Jdlrobson edited projects, added DoNotUse---Instrument-ClientError; removed Product-Data-Infrastructure.Oct 5 2020, 9:25 PM

FWIW in the last 24hrs, we recorded 38,523 errors. When I filter out bugs originating from other domains that number drops to 21,041. I'm not sure how big a problem this is in terms of volume for when we enable English Wikipedia, but while interesting to know these errors are occuring, nobody is caring about those scripts just yet.

Change 654470 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/extensions/WikimediaEvents@master] Do not log errors for browser extensions out of our control

https://gerrit.wikimedia.org/r/654470

gerritbot added a project: Patch-For-Review.Jan 5 2021, 5:54 PM

I think as a temporary measure we should disable these before rolling out to English Wikipedia. There are a lot of browser extensions running on English Wikipedia, and that will make the january roll out a lot less stressful.

The total log volume of Logstash is about a hundred million items a day so I wouldn't be too worried about it. I don't know about EventGate but surely its also in the millions.

That said one could make a privacy argument for filtering: we shouldn't log what browser extensions our readers are using, it's a kind of browser fingerprinting.

Change 654470 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Do not log errors for browser extensions out of our control

https://gerrit.wikimedia.org/r/654470

FYI at time of writing extension errors right now and account for 10,000 of the 30,000 errors we currently see.
@Tgr thanks for reassurance around logstash volume, however sadly we don't have any sense of how widely used errors from extensions will be on English Wikipedia traffic levels. Agree that privacy is also a good motivation here. We can reconsider our approach here once we've rolled out to English Wikipedia.

ReleaseTaggerBot added a project: MW-1.36-notes (1.36.0-wmf.26; 2021-01-12).Jan 6 2021, 11:00 PM

Maintenance_bot removed a project: Patch-For-Review.Jan 6 2021, 11:10 PM

Aklapper edited projects, added Instrument-ClientError; removed DoNotUse---Instrument-ClientError.Nov 24 2022, 1:55 PM

Filter out non-wikimedia domains and extensions from web-client-errorsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Filter out non-wikimedia domains and extensions from web-client-errors
Closed, ResolvedPublic
Actions

Related Objects
Search...