Page MenuHomePhabricator

Rate limit requests in violation of User-Agent policy more aggressively
Open, MediumPublic

Description

Wikimedia's User-Agent policy specifically forbids using generic values for the User-Agent request header.

Apply stricter rate limiting to requests violating the policy.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema triaged this task as Medium priority.Jun 3 2019, 3:04 PM

Change 514017 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache_upload: return HTTP 403 to requests violating UA policy

https://gerrit.wikimedia.org/r/514017

Change 514017 merged by Ema:
[operations/puppet@production] cache_upload: return HTTP 403 to requests violating UA policy

https://gerrit.wikimedia.org/r/514017

For Tech News: Bots and other scripts that do not set an identifiable User-Agent may find their requests blocked until they identify themselves properly.

Not sure if it applies here, but please remember that we allow Api-User-Agent as an alternative to User-Agent for Javascript solutions.

ema renamed this task from Return HTTP 403 to requests in violation of User-Agent policy to Rate limit requests in violation of User-Agent policy more aggressively.Jun 5 2019, 2:48 PM
ema updated the task description. (Show Details)

We (Traffic) have decided to continue allowing requests violating the UA policy. Instead of blocking them, we will apply stricter rate limiting to those.

Change 513596 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnish: cache_upload rate limit

https://gerrit.wikimedia.org/r/513596

Change 513596 merged by Ema:
[operations/puppet@production] varnish: cache_upload miss/pass rate limit

https://gerrit.wikimedia.org/r/513596

TechNews: I've added it to the upcoming edition with this edit, that will be frozen for translation in about 18 hours. Please amend it before then if needed. (And thank you @Legoktm for writing the initial version!). Cheers!

Even with the current rate limiting, some crawling are regularly causing issues, wasting precious SRE time.

I'd like to revisit this task to be more strict on user agents, maybe progressively increasing the way we enforce our policy. For example:

  • Keep rate limiting for generic curl and other command line/testing tools
  • Forbid generic scripting UAs (eg. python-requests, empty) from cloud providers
  • Ideally later on, forbid generic scripting UAs from the whole Internet (except WMCS)

A variant could be to only apply the above on the upload cluster, but the less exceptions the better

Even with the current rate limiting, some crawling are regularly causing issues, wasting precious SRE time.

I'd like to revisit this task to be more strict on user agents, maybe progressively increasing the way we enforce our policy. For example:

  • Keep rate limiting for generic curl and other command line/testing tools
  • Forbid generic scripting UAs (eg. python-requests, empty) from cloud providers
  • Ideally later on, forbid generic scripting UAs from the whole Internet (except WMCS)

Agreed to all that, though I would not exempt WMCS because WMCS can generate significant amounts of traffic much faster by virtue of already being in the cluster and people using WMCS are generally Wikimedians who should be more familiar with our policies than someone who just wants to scrape wiki pages.

I would also add that after a DoS ~2 months ago I spent a while working on advertising the UA policy and our general API usage guidelines: [1], [2].