Page MenuHomePhabricator

Allowlist Cloud VPS instances that need XFF header passed through the web proxy
Closed, ResolvedPublic

Description

Currently, labs instances with a webserver can see the IP address of a user. Default webserver installs in labs don't record access logs, but it would be easy for an instance owner to start logging the XFF header for requests, and keep IP<->Account information on the host, which is considered private under the WMF privacy policy.

Since labs instances have their own privacy policies, this isn't a violation of the WMF policy. However, it would be nice to not give instances the option.

It appears that only one instance is known to actually need that data. Let's whitelist that, and any others that specifically need the XFF data, and remove the header for other instances.

Event Timeline

Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript

(the one instance is UTRS).

We can do this by maintaining a whitelist of IPs / domains in puppet, and adding an if in nginx for that. That should be fast enough and simple.

We should announce this and then Make It So.

Labs instances can get public IPs though

account-creation-assistance also currently has (and continues to need - T61662 ) access to the XFF data from the web proxy.

We can easily just whitelist utrs and account-creation-assistance, and blacklist all others. Need to announce tho...

Current status of this is that T61662: Pass XFF headers through the proxy configured by Special:NovaProxy started passing the XFF headers to all upstream servers behind the domainproxy deployment of dynamicproxy (*.wmflabs.org). XFF headers are not passed to upstream servers behind the urlproxy (tools.wmflabs.org/*.toolforge.org).

In retrospect, the fix for T61662 should not have happened this way.

Change 583098 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] dynamicproxy: add support for dynamic XFF per FQDN

https://gerrit.wikimedia.org/r/583098

Change 583316 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloud: refactor novaproxy role

https://gerrit.wikimedia.org/r/583316

Mentioned in SAL (#wikimedia-cloud) [2020-03-25T12:18:46Z] <arturo> disable puppet in the 2 VMs to try refactoring the puppet role https://gerrit.wikimedia.org/r/c/operations/puppet/+/583316 (T135046)

Change 583316 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloud: refactor novaproxy role

https://gerrit.wikimedia.org/r/583316

Change 583318 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloud: novaproxy: fix default value for String typed hiera keys

https://gerrit.wikimedia.org/r/583318

Change 583318 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloud: novaproxy: fix default value for String typed hiera keys

https://gerrit.wikimedia.org/r/583318

Mentioned in SAL (#wikimedia-cloud) [2020-03-25T12:34:53Z] <arturo> enable puppet in the 2 VMs. Role is now role::wmcs::novaproxy. Hiera was updated accordingly too (T135046)

Ready for review & merge: https://gerrit.wikimedia.org/r/c/operations/puppet/+/583098

The patch will drop the XFF header for all backends except the ones we specify, so we need to collect them first add them to a hiera list (in horizon):

profile::wmcs::novaproxy::xff_fqdns:
 - myfqdn1.wmflabs.org <-- will recv XFF
 - myfqdn2.wmflabs.org <-- will recv XFF

From the backlog I get that only accounts.wmflabs.org needs it?

This might be something that requires cloud-announce posts and a window of
time for people to submit their things to be added.

It would be useful to get accounts-dev.wmflabs.org whitelisted too, so we don't have to test stuff on the "prod" tool. I'm not too fussed if that's not possible though.

I think UTRS needs this as well? @DeltaQuad can confirm on that one.

This might be something that requires cloud-announce posts and a window of
time for people to submit their things to be added.

Done: https://lists.wikimedia.org/pipermail/cloud-announce/2020-April/000272.html

We will introduce the change on 2020-04-15. Until then, I hope we collect a list of stuff that requires whitelisting. As of today this list is:

  • accounts.wmflabs.org
  • accounts-dev.wmflabs.org

As mentioned above, UTRS does need this also on both:
utrs-beta.wmflabs.org
utrs.wmflabs.org

Hello! I help maintain XTools, which as I think you know already is subject to significant disruptive automation, web crawlers (that ignore robots.txt), and more rarely just overzealous users, any of which can hog up all the db connections and result in downtime. The service is something people heavily depend on, so I have put significant work into being able to keep out this disruption.

I will admit, I never really thought about getting the IP from the XFF header... that would have saved me an immense amount of time, and going by that would mean a lot less false positives. Instead, I'm going by user agent and other heuristics (see /var/www/config/request_blacklist.yml on xtools-prod06). This actually largely works well, but some bots I just can't block because they have the same UA as innocent humans. Worse, Chrome will soon be freezing user agents (e.g. T242825)... and that accounts for most users, bots included. So in about a year or so going by UAs just won't work anymore.

There are still a LOT of bots scraping XTools. I made a log-only system that exposes possible crawlers (it looks for rapid requests that change the interface language, meaning it's clicking on all the links in the language dropdown -- something humans don't do). You can review /var/www/var/log/crawler.log to see what we're dealing with everyday.

There is the nuclear option of just putting everything behind a login wall, and that can be considered. It is a last resort, though, because many outside researchers use XTools. We would much prefer to not make them go through account creation. We also need to figure out why login sessions don't persist (T224382), but I digress.

So, I'm wondering if XTools could be whitelisted, too, given it's popularity and issues with disruptive automation? Incidentally, all maintainers are staff, and one volunteer who has signed the NDA for non-public information.

Another idea -- perhaps it's possible to pass the instances a one-way hash of the IP? As long as it's unique, that'd be enough for our purposes and also maintain the user's privacy. As I understand it, this is similar to what Anti-Harassment is planning to do for temporary accounts in MediaWiki.

@MusikAnimal your use case seems sensible. We can keep it whitelisted.

Applied this change to hiera in horizon:

https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/refs/heads/master%5E%21/#F0

diff --git a/project-proxy/_.yaml b/project-proxy/_.yaml
index 705d0a0..9b64aaa 100644
--- a/project-proxy/_.yaml
+++ b/project-proxy/_.yaml

@@ -4,3 +4,10 @@
 - proxy-02.project-proxy.eqiad.wmflabs
 profile::wmcs::novaproxy::block_ref_re: koeri.boun.edu.tr
 profile::wmcs::novaproxy::use_ssl: true
+profile::wmcs::novaproxy::xff_fqdns:
+- accounts.wmflabs.org
+- accounts-dev.wmflabs.org
+- utrs-beta.wmflabs.org
+- utrs.wmflabs.org
+- xtools.wmflabs.org
+- xtools-dev.wmflabs.org

Change 583098 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] dynamicproxy: add support for dynamic XFF per FQDN

https://gerrit.wikimedia.org/r/583098

I spot checked the apache logs for the affected projects, and only the account project seems to dump the XFF header in the logs. In case you need it, they do something like:

LogFormat "%{X-Forwarded-For}i %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" proxy

The change is now live, thanks you all!

@aborrero Looks like the google-api-proxy project needs to be whitelisted, too. See T250312: HTTP 403 Error when using google-api-proxy on VPS. See the puppet role. Right now all requests to the proxy are getting 403s. It looks like it's simply checking the incoming IP to ensure it's internal. I thought you could just do that via the security groups on Horizon? Do we need the puppet role, too?

Resolved, thank you!

Mentioned in SAL (#wikimedia-cloud) [2020-04-15T20:22:54Z] <bd808> Added google-api-proxy.wmflabs.org & googlevision-api-proxy.wmflabs.org to profile::wmcs::novaproxy::xff_fqdns (T135046, T250312)

@aborrero For the same reasons as T135046#6056334, I am requesting that XFF headers be allowed for the wikisource VPS project, maintained by Community-Tech. We have had a long battle with fighting disruptive bots, and sometimes they use human UAs and I can't feasibly block them without collateral damage. Could this be arranged? Let me know if I should open a new task. Thank you!

@aborrero For the same reasons as T135046#6056334, I am requesting that XFF headers be allowed for the wikisource VPS project, maintained by Community-Tech. We have had a long battle with fighting disruptive bots, and sometimes they use human UAs and I can't feasibly block them without collateral damage. Could this be arranged? Let me know if I should open a new task. Thank you!

Yes, open a new task.

bd808 renamed this task from Whitelist labs instances that need XFF header passed through the web proxy to Allowlist Cloud VPS instances that need XFF header passed through the web proxy.Jun 28 2023, 5:06 PM