Page MenuHomePhabricator

Bot throwing large amount of errors
Closed, ResolvedPublic

Description

In the client side error handling we're seeing several thousand of errors from a single IP address

IP address inside https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2020.09.30/clienterror?id=AXTgRozJLNRtRo5XZX3F&_g=h@44136fa

Uncaught TypeError: $(...).updateTooltipAccessKeys is not a function

So far the IP address seems to be fixed.

We need a reliable way to quickly filter out such problematic clients from recording errors to the logs or some kind of limit to the number of errors we take from a single IP.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Ottomata added a subscriber: Ottomata.

FYI, we are likely to remove client IPs in error logs from logstash: T262626: Remove http.client_ip from EventGate default schema (again)

IIRC client side error logging has (or should have) some kind of aggregation or limiting if the stack traces are exactly the same. I can't remember for sure though, @jlinehan can confirm.

Anyway Product Infra should take a look. Not sure what the right thing to do is.

Why does it matter that it came from a specific IP address or user ID? If the error is common enough to hit our monitoring threshold, and investigation shows it is caused by something we can't or don't want to fix, then we can exclude it by error message, right?

IPs generally change quite often, so this isn't particilarly stable anyway, and suggests that it is specific to a single user which seems unlikely over the long-term.

Alternatively, if we believe something or someone is abusing our instrumentation and artifically causing errors that could be genuine but aren't, then we may want to provide a more granular and les sensisive way to filter that. E.g. by running a cheap hash (short, not one-to-one mappable, e.g. fnv32) over the wgUserId value, the GeoIP cookie value. Which will be reasonably stable and and yet isn't identifyable. We could add that to WikimediaEvents with a few lines of code.

Why does it matter that it came from a specific IP address or user ID?

I thought T265131 makes this pretty clear - we have a lot of errors that stem from the common.js of a single individual or a very low number of users. For example, I've seen a single client throw over 1000 errors in a single day, simply because they have included wikitext in their common JS. Without knowing if errors are unique to a given IP this will make troubleshooting really hard for those sort of problems and give the impression there's a problem with code or MediaWiki:Common.js. The main reason we were able to diagnose https://phabricator.wikimedia.org/T264665 was the site module name in the file uri and knowing that it was impacting hundreds of IPs. Scrutinizing this data over the last 2 months has shown me thatif a gadget is used by 2 people it can still produce a lot of noise. Being able to filter these out is extremely important as these errors can drown out important errors that are lower volume such as errors in VisualEditor that are causing data loss for example T244114.

I think in this particular case however all that's being asked for is something to prevent denial of service attacks from a single session and we don't need to rely on IPs for that. We could do that in code on the client. Right now we limit to 5 errors per page view, but this issue made it clear to me that we probably want to limit errors from a single client.

E.g. by running a cheap hash (short, not one-to-one mappable, e.g. fnv32) over the wgUserId value, the GeoIP cookie value. Which will be reasonably stable and and yet isn't identifyable. We could add that to WikimediaEvents with a few lines of code.

Yeh this sounds good. I think the generate session ID would be fine for this. Although we may want to look into T263041 first (although on the plus side - those errors will magically get filtered out if we relied on it :) )

So this is is now happening at an even higher frequency then before. We are dropping IPs from these soon so I'm not sure what the plan going forward is with being able to filter these. The main user agent this seems to use is " Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) QtWebEngine/5.14.2 Chrome/77.0.3865.129 Safari/537.36" so perhaps we can filter by user agent?

Jdlrobson claimed this task.

This is less of a problem my side.
If it becomes a problem again I recommend allowing a maximum of 50 errors from a single client.
Until then I'll close. Can reopen later if necessary.