Page MenuHomePhabricator

zerofetcher in production is getting throttled for API logins
Closed, ResolvedPublic

Description

Getting this from cronjobs today, as we've expanded the pool of cache cluster machines which are fetching the Zero-rating JSON data. Can a limit be increased substantially somewhere?

Exception: API login phase2 gave result Throttled, expected "Success"

Event Timeline

BBlack raised the priority of this task from to Needs Triage.
BBlack updated the task description. (Show Details)
BBlack added projects: Traffic, ZeroPortal, Zero.
BBlack subscribed.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 236080 had a related patch set uploaded (by BBlack):
Disable zerofetcher on text caches T111045

https://gerrit.wikimedia.org/r/236080

Change 236080 merged by BBlack:
Disable zerofetcher on text caches T111045

https://gerrit.wikimedia.org/r/236080

Is this proxying requests for external clients? If so they should probably be added to the list of legitimate proxies sending X-Forwarded-For headers.

No, the zerofetcher is just a custom script that runs on the caches and fetches zero-rating metadata from the zero portal periodically, to feed it to varnish for X-CS/X-F-B processing of real requests.

jcrespo triaged this task as Medium priority.Sep 7 2015, 5:51 PM
jcrespo removed a project: Patch-For-Review.
jcrespo set Security to None.
jcrespo subscribed.

We can add entries to wgRateLimitsExcludedIPs in mediawiki-config if there is a list of internal hosts which need it.
It'd need to be kept updated though.

@Krenair is that just single IPs, or can we add networks to it like the wgSquidServers type of lists use?

Just single IPs.

Maybe there's a better way somewhere...

Do we know what is actually doing the rate check and blocking? Is that a backend feature / an extension?

Do we know what is actually doing the rate check and blocking? Is that a backend feature / an extension?

The only place in core and deployed extensions I see returning LoginForm::THROTTLED is core, based on LoginForm::incLoginThrottle(). The throttle should be being cleared by a successful login though. And it's per both IP and user, so unless your machines are all seen as coming from the same IP that shouldn't make a difference.

We can add entries to wgRateLimitsExcludedIPs in mediawiki-config if there is a list of internal hosts which need it.
It'd need to be kept updated though.

wgRateLimitsExcludedIPs doesn't appear to be used for the login throttle.

In this case, we're logging into the same account from many IP addresses (let's say ~100), and each of those IP addresses is logging in every 5 minutes (log in, fetch one chunk of data from an API, log out). Could it just be the concurrency of the logins from multiple machines?

The code doing the fetching, in case that provides any details as to what kind of login it's using, is: https://github.com/wikimedia/operations-puppet/blob/production/modules/varnish/files/zerofetch.py

Change 242888 had a related patch set uploaded (by BBlack):
zero_update: randomize cron every 15 minutes

https://gerrit.wikimedia.org/r/242888

This issue is becoming a blocker for doing a better job at DDoS mitigation (so that we can port XFF-decoding used by Zero to other clusters and not accidentally false-positive-ratelimit things like OperaMini). Any objections from Zero on carriers/proxies data updates being on a 15-minute rather than 5-minute schedule in the patch above?

Change 242888 merged by BBlack:
zero_update: randomize cron every 15 minutes

https://gerrit.wikimedia.org/r/242888

Change 243033 had a related patch set uploaded (by BBlack):
re-enable zerofetcher for cache_text T111045

https://gerrit.wikimedia.org/r/243033

Change 243033 merged by BBlack:
re-enable zerofetcher for cache_text T111045

https://gerrit.wikimedia.org/r/243033

BBlack claimed this task.