Page MenuHomePhabricator

Maybe restrict domains accessible by webproxy
Open, LowPublic

Description

https://wikitech.wikimedia.org/wiki/HTTP_proxy currently allows access to external web domains from our internal networks.

https://gerrit.wikimedia.org/r/c/operations/puppet/+/753029 is about maintaining a list of sites that are allowed to be accessed.

This change needs to be discussed with the teams and engineers that use the webproxy to do their day to day work, especially from within the Analytics Cluster. Analytics Cluster users do a lot of ad-hoc data exploration and experimentation. Restricting external resources may make their lives harder.

Event Timeline

Thanks for creating this task Andrew, Just wanted to copy paste the following from the parent task in-case there are some logstash experts here that may be able to make use of the data to see what is currently in use :)

FYi i have added the squid logs to logstash (no visualisations yet).

The events have the following information

HTTP
{
   "client.ip" : "2620:0:861:101:10:64:0:240",
   "ecs.version" : "1.7.0",
   "event.category" : [
      "network",
      "web"
   ],
   "event.dataset" : "squid.access",
   "event.duration" : "1",
   "event.kind" : "event",
   "event.outcome" : "unknown",
   "event.type" : [
      "access",
      "connectiona"
   ],
   "host.domain" : "wikimedia.org",
   "host.hostname" : "install1003.wikimedia.org",
   "host.ip" : "208.80.154.32",
   "host.name" : "install1003",
   "http.request.method" : "GET",
   "http.request.referrer" : "-",
   "http.response.bytes" : 393,
   "http.response.status_code" : 304,
   "http.version" : "1.1",
   "labels.hierarchy_status" : "HIER_DIRECT",
   "labels.request_status" : "TCP_REFRESH_UNMODIFIED",
   "service.type" : "squid",
   "source.ip" : "2620:0:861:101:10:64:0:240",
   "timestamp" : "2022-01-25T10:28:54+0000",
   "url.domain" : "deb.debian.org",
   "url.full" : "http://deb.debian.org/debian/dists/buster-updates/InRelease",
   "url.path" : "/debian/dists/buster-updates/InRelease",
   "user_agent.original" : "Debian APT-HTTP/1.3 (1.8.2.3)"
}
HTTPS
{
   "client.ip" : "2620:0:861:105:10:64:21:118",
   "ecs.version" : "1.7.0",
   "event.category" : [
      "network",
      "web"
   ],
   "event.dataset" : "squid.access",
   "event.duration" : "87",
   "event.kind" : "event",
   "event.outcome" : "unknown",
   "event.type" : [
      "access",
      "connectiona"
   ],
   "host.domain" : "wikimedia.org",
   "host.hostname" : "install1003.wikimedia.org",
   "host.ip" : "208.80.154.32",
   "host.name" : "install1003",
   "http.request.method" : "CONNECT",
   "http.request.referrer" : "-",
   "http.response.bytes" : 42550,
   "http.response.status_code" : 200,
   "http.version" : "1.1",
   "labels.hierarchy_status" : "HIER_DIRECT",
   "labels.request_status" : "TCP_TUNNEL",
   "service.type" : "squid",
   "source.ip" : "2620:0:861:105:10:64:21:118",
   "timestamp" : "2022-01-25T10:21:00+0000",
   "url.domain" : "www.wikidata.org",
   "url.full" : "www.wikidata.org:443",
   "url.path" : "-",
   "user_agent.original" : "WMDE Wikidata metrics gathering"
}

If needed we can add more elements from http://www.squid-cache.org/Doc/config/logformat/.

Hello! I would prefer to not have an allowlist for external domains, but if the final decision is to have one then I would like the following added to the initial list:

  • r-project.org (installing R packages from cran.r-project.org and cloud.r-project.org)
  • rstudio.com (mainly for RStudio's CRAN mirror cran.rstudio.com, but also their public package manager)

This would affect the research team, especially if the stat machines are also included in this restriction. For example:

  • pip / conda install (python packages)
  • github / gitlab / gerrit (code)
  • downloading/uploading datasets, e.g. figshare, zenodo, and a long tail of others
  • libraries that depend on the web, e.g. when working with pre-trained models in tensorflow hub/huggingface
  • APIs (mediawiki / toolforge / cloud VPS)

Especially for one-off use cases, where one would have to whitelist a url (via a phab/gerrit task) only to have very little use for it after, this solution does not seem ideal. Bypassing this restriction by using your local machine for large datasets in particular is not convenient.

For tasks running on yarn, that restriction certainly makes sense, or at least we shouldn't add the webproxy env config without good reason; e.g. wmfdata automatically propagates the webproxy to all workers if it is set in the notebook.
Would it be possible to work with whitelists for the worker machines but not for the stat machines?

For some production jobs we still use the proxy to access:

  • MW APIs (all our sites)
  • ores.wikimedia.org

For dev purposes I use the proxy to download some dependencies:

MW APIs (all our sites)

BTW, the proper way to access MW APIs from within our networks is to use e.g. https://api-ro.discovery.wmnet and set the HTTP Host header to the domain of the site you want to access, e.g. www.wikidata.org.

Then you don't need a proxy.

MW APIs (all our sites)

BTW, the proper way to access MW APIs from within our networks is to use e.g. https://api-ro.discovery.wmnet and set the HTTP Host header to the domain of the site you want to access, e.g. www.wikidata.org.

Then you don't need a proxy.

It's not just a BTW unfortunately. There's multiple benefits to not going through the web proxy to reach the MW APIs, e.g.

  • avoiding artificially polluting the organically (that's human user traffic) populated cache of the reverse proxies,
  • avoiding obfuscating logs with IPs that do not belong to the internal endpoint that actually talks to the MW API (or other endpoints),
  • avoiding a SPOF (there aren't that many web proxies nor is it a highly available setup cause there isn't any need to),
  • avoiding saturating a host that is offering this service along side other services.
  • avoiding the webproxy's own cache
  • avoiding another 4 intermediaries (1 outgoing proxy, 1 tls terminator, 2 reverse proxies) in the path to the desired content
  • avoiding adding latency to requests
  • probably others that I am missing.

We probably should update https://wikitech.wikimedia.org/wiki/HTTP_proxy and point out that it should never be used to access wikimedia resources, but rather only resources existing strictly outside of the production networks.

Oh and ORES is also available under https://ores.discovery.wmnet (and it's the exact same service!)

avoiding a SPOF (there aren't that many web proxies nor is it a highly available setup cause there isn't any need to),

This has bitten me before when I used to use the webproxy internally. Don't do it! :)

This has bitten me before when I used to use the webproxy internally. Don't do it! :)

Its worth mentioning that when i took a quick look at the traffic going through the proxy, after excluding apt traffic, the majority of domains being fetched where internal

Hahah, maybe what we should do is excludelist the internal domains in the webproxy!

A couple of questions/comments:

Its worth mentioning that when i took a quick look at the traffic going through the proxy, after excluding apt traffic, the majority of domains being fetched where internal

Hahah, maybe what we should do is excludelist the internal domains in the webproxy!

First: How difficult & how much overhead would it be to make the proxy redirect requests made to internal domains to api-ro.discovery.wmnet (and have it set the HTTP Host header automatically)? That is, optimize this internal traffic invisibly, behind the scenes.

Second: In my almost 7 years here I have never heard of using api-ro.discovery.wmnet for accessing MW API internally. Neither Wikitech nor MediaWiki returned satisfactory results when searching for it just now, and the points made by @akosiaris in T300977#7700803 are pretty significant and the benefits (for all parties) are crystal clear! For something so beneficial it is surprising how much of a secret this appears to be. This is a practice worth popularizing within multiple teams and yet it appears only SREs know about it, and (as best as I could find) it is not documented anywhere other than the comment above.

Oh and ORES is also available under https://ores.discovery.wmnet (and it's the exact same service!)

Doesn't appear to be available from stats network. I was told years ago that we were not allowed to talk stats<->prod and built an entire system around shuffling things through kafka+swift to work around the firewalls. Is the intention to allow us to talk to prod in a more general fashion then?

ebernhardson@an-airflow1001:~$ nc -w 2 ores.discovery.wmnet 443
nc: connect to ores.discovery.wmnet port 443 (tcp) timed out: Operation now in progress

First: How difficult & how much overhead would it be to make the proxy redirect requests made to internal domains to api-ro.discovery.wmnet (and have it set the HTTP Host header automatically)? That is, optimize this internal traffic invisibly, behind the scenes.

Also, another benefit of such an approach is that currently there is no way to employ the recommended (api-ro.discovery.wmnet) method when using mwapi (Python) and WikipediR (R) wrappers – neither of which is actively maintained these days and probably wouldn't support this even if someone made pull requests to them. The recommended method only works for someone who wrote their API-calling code from scratch, rather than relying on a wrapper.

Update: Filed issue on python-mwapi requesting support for optimized internal traffic

Is the intention to allow us to talk to prod in a more general fashion then?

I think so, see the parent ticket: {T298087} :)

I was told years ago that we were not allowed to talk stats<->prod

Although, I don't think it was every true that you weren't 'allowed' to talk stats<->prod, just that there is a firewall that disallows it by default. We just need to add exceptions in the network VLAN firewall ACLs to allow this. This is how you can talk to Kafka clusters, etc.

api-ro.discovery.wmnet [...] it is surprising how much of a secret this appears to be.

Yeah I also didn't know about it until I had the SPOF bite me. :)

First: How difficult & how much overhead would it be to make the proxy redirect requests made to internal domains to api-ro.discovery.wmnet (and have it set the HTTP Host header automatically)? That is, optimize this internal traffic invisibly, behind the scenes.

As internal request are all https this would require significant changes to the proxy config as we currently just tunnle ssl traffic. however the more important point is that this would not resolve the other issues mentioned by @akosiaris

Hahah, maybe what we should do is excludelist the internal domains in the webproxy!

Heh, I have been trying to do this in url-downloader (which is the sister part of webproxy and reserved for use by applications in k8s and mediawiki) since 2015. See https://gerrit.wikimedia.org/r/q/topic:%2522url_downloader%2522+(status:open+OR+status:merged)+LVS. And that's arguably less problematic than webproxy. Perhaps I should try once more...

A couple of questions/comments:

Its worth mentioning that when i took a quick look at the traffic going through the proxy, after excluding apt traffic, the majority of domains being fetched where internal

Hahah, maybe what we should do is excludelist the internal domains in the webproxy!

First: How difficult & how much overhead would it be to make the proxy redirect requests made to internal domains to api-ro.discovery.wmnet (and have it set the HTTP Host header automatically)? That is, optimize this internal traffic invisibly, behind the scenes.

Even if it is possible (I have no idea if it is tbh), it wouldn't solve the other problems mentioned above (still a SPOF, still 4 extra layers of caching and extra latency) and it would create the false impression that is the Right way of talking to the MW APIs when it clearly isn't.

Second: In my almost 7 years here I have never heard of using api-ro.discovery.wmnet for accessing MW API internally. Neither Wikitech nor MediaWiki returned satisfactory results when searching for it just now, and the points made by @akosiaris in T300977#7700803 are pretty significant and the benefits (for all parties) are crystal clear! For something so beneficial it is surprising how much of a secret this appears to be. This is a practice worth popularizing within multiple teams and yet it appears only SREs know about it, and (as best as I could find) it is not documented anywhere other than the comment above.

You are right that it isn't documented well nor communicated well. They started being introduced in 2017 for the purposes of the DC switchovers we are performing every now and then. I don't remember off hand for how long we 've been suggesting to people to use them instead of going via the edge.

By the way, those .discovery.wmnet endpoints exist for all production services and the point is that they abstract the active Datacenter from the view of applications. The DC specific ones (e.g. api-ro.svc.eqiad.wmnet or api-rw.svc.codfw.wmnet have existed for way longer and we always suggested (even pre discovery) to use those instead of going via the edge caches). The docs for the discovery ones are at https://wikitech.wikimedia.org/wiki/DNS/Discovery and are pretty generic indeed. A listing of which ones exist and which DCs are active for every one exists at https://config-master.wikimedia.org/discovery/discovery-basic.yaml (but I 'll confess it's not mean for easy human consumption).

I guess all of this is to say, point taken. As SREs we probably need to communicate more broadly (and not in the more narrow scope of what services get deployed to production) what are the preferred ways of talking to the APIs.

First: How difficult & how much overhead would it be to make the proxy redirect requests made to internal domains to api-ro.discovery.wmnet (and have it set the HTTP Host header automatically)? That is, optimize this internal traffic invisibly, behind the scenes.

Even if it is possible (I have no idea if it is tbh), it wouldn't solve the other problems mentioned above (still a SPOF, still 4 extra layers of caching and extra latency) and it would create the false impression that is the Right way of talking to the MW APIs when it clearly isn't.

Well if you keep the proxy as a local sidecar to the application, this is the istio approach. Don't configure the applications, just funnel all http requests to host:port via a local proxy. There are a series of reasons why we chose not to do that - the most important being we want to socialize with deployers when we're proxying requests or not. We recently adopted this strategy for MediaWiki on kubernetes so that it could still reach the API (by adding support in software for a local proxy).

Second: In my almost 7 years here I have never heard of using api-ro.discovery.wmnet for accessing MW API internally. Neither Wikitech nor MediaWiki returned satisfactory results when searching for it just now, and the points made by @akosiaris in T300977#7700803 are pretty significant and the benefits (for all parties) are crystal clear! For something so beneficial it is surprising how much of a secret this appears to be. This is a practice worth popularizing within multiple teams and yet it appears only SREs know about it, and (as best as I could find) it is not documented anywhere other than the comment above.

You are right that it isn't documented well nor communicated well. They started being introduced in 2017 for the purposes of the DC switchovers we are performing every now and then. I don't remember off hand for how long we 've been suggesting to people to use them instead of going via the edge.

There is no documentation because that's an implementation detail in production that people don't need to know necessarily, IMHO. A developer should just keep in mind that their code might need to connect to an IP/hostname/port that are different from the public hostname/port of the asset they're reaching. In production the configuration will not even tell your application to reach api-ro.discovery.wmnet when wanting to call enwiki, but rather to call localhost:6501 which is the local proxy port to reach that asset - in fact all of our services run an Envoy sidecar for this job.

@Joe I think most of the usages of webproxy that folks here are concerned with aren't by production services. These are human users running commands on stat boxes.

@Joe I think most of the usages of webproxy that folks here are concerned with aren't by production services. These are human users running commands on stat boxes.

Yeah, that. The solutions we chose for MediaWiki and related services lately aren't necessarily applicable in this use cases.

What this discussion makes obvious however, is that an infrastructure that was setup and designed with a specific use case in mind (allowing the servers to reach out to fetch their Ubuntu/Debian packages and similar resources), has been co opted to serve different use cases out of necessity (in part due to the analytics vlan firewalling).

Perhaps a way forward would be to find a way to serve those use cases by design instead of by accident.

jbond triaged this task as Low priority.Feb 16 2022, 4:57 PM

Perhaps a way forward would be to find a way to serve those use cases by design instead of by accident.

+1, do you have a rough idea in mind?

Thanks for this task! Now that we have a clear path forward in T298087, it makes sens to focus on this one as well.

BTW, T298087 will solve issues like:

ebernhardson@an-airflow1001:~$ nc -w 2 ores.discovery.wmnet 443
nc: connect to ores.discovery.wmnet port 443 (tcp) timed out: Operation now in progress

My understanding is that this task branches in 2 directions:

Going through proxies for internal resources, explained by @akosiaris in (for example) T300977#7700803

To help solve this, a better use of no_proxy in T278315 seems like a low-enough hanging fruit.
Data path won't be the most direct (still hits the caches, etc) but it would at least removes some layers while being transparent for users and relatively quick to implement.

In parallel we should improve our proxy documentation and communicate the better alternatives (internal discovery endpoints, etc).
Transparently redirecting queries to matching discovery endpoint seems overly complex for this use-case.

Adding ACLs to webproxies, here we need to figure out first if we need them (for both production and analytics hosts), what threat we're trying to protect us against.

Production ACLs should be easy-ish to manage, and could be tied to Puppet profiles/roles.
Analytics ACLs could become complex due to the wide variety of usages (eg. stats boxes users), it would then raise the question of who will manage those?

On that second part, we discussed it within Infrastructure Foundation.

With the webproxies (and url-downloader) wide open, a malware accidentally downloaded (compromised library dependency, infected executable, etc) could easily "phone home", provide a backdoor access to our infrastructure, exfiltrate data, etc...).
Adding ACLs, even based on a loose allow-list would be a significant security improvement, protecting us against that type of threat.

Pinging @JBennett to get security's feedback (especially to know if you agree on the risk) and decide collectively on a policy and way forward.

The goal is of course to improve the security across the infra, without making's life more difficult to the users.

One suggestion is for example:

  • SRE will take care of the tooling around squid ACL management, an example is already staged by John in https://gerrit.wikimedia.org/r/c/operations/puppet/+/753029/
  • Craft the initial allow list based on historical data (eg. last 3 months) and feedback from the data engineering team (eg. on the gerrit change)
  • Once merged, users send CRs to add exceptions when needed
    • And can always SCP files if the exception has not be merged yet
  • Document the processes and best practices so they're easy to use for anyone

Chiming in as a heavy user of the stat boxes. It's difficult for me to follow this conversation so I'm mainly asking for clarification right now on the current proposal. Thanks all for working through this.

stat machine -> Mediawiki etc. APIs

Going through proxies for internal resources, explained by @akosiaris in (for example) T300977#7700803
To help solve this, a better use of no_proxy in T278315 seems like a low-enough hanging fruit.

If I had a Python script on the stat machines that made a request to the Mediawiki APIs, would this block that request and tell me to use the other internal endpoints mentioned (and then up to me to make that fix?)? Or does it allow the request to go through but try to make it a little less of a problem for other users (and separately we try to guide folks to usue the internal endpoints)? Or something else?

stat machine -> any other external resource

One suggestion is for example:

  • SRE will take care of the tooling around squid ACL management, an example is already staged by John in https://gerrit.wikimedia.org/r/c/operations/puppet/+/753029/
  • Craft the initial allow list based on historical data (eg. last 3 months) and feedback from the data engineering team (eg. on the gerrit change)
  • Once merged, users send CRs to add exceptions when needed
    • And can always SCP files if the exception has not be merged yet
  • Document the processes and best practices so they're easy to use for anyone

With your suggestion @ayounsi, I'm assuming this initial allow list would likely contain standard Python/R library endpoints (PyPi, conda-forge, Github, r-project/rstudio), Toolforge/Cloud VPS, and maybe a number of standard dataset repositories (zenodo, figshare, etc.). Then if e.g., I need a dataset that a researcher has hosted on their personal webpage, I have two options:

  • download it locally and scp it to the stat machine ("fast" in theory and useful for one-offs though not great for very large datasets + folks who are far from eqiad1)
  • or put in a request to add the endpoint (maybe fast but realistically takes a day or two at least?)

malware accidentally downloaded (compromised library dependency, infected executable, etc) could easily "phone home"

Craft the initial allow list based on historical data (eg. last 3 months) and feedback from the data engineering team

I appreciate the intention here, but I'm not sure if the combo of these two things will actually accomplish the goal of excluding your software that might 'phone home'. The allowlist that will be needed is quite broad (I think what Isaac mentioned is just a small portion). If we allowlist all of the possible software package repos we might need, are we really mitigating the ability for someone to install software that might have a vulnerability in it?

And can always SCP files if the exception has not be merged yet

I agree that this could be quite a pain. Even software distribution release tarballs I sometimes download for testing can be large enough to make this difficult, especially if I'm working while traveling, using cafe WiFI or my cellular data. If I wanted to just download something once for testing it out, would we want to make an allowlist exception for its source?

I appreciate the intention here, but I'm not sure if the combo of these two things will actually accomplish the goal of excluding your software that might 'phone home'.
...
If we allowlist all of the possible software package repos we might need, are we really mitigating the ability for someone to install software that might have a vulnerability in it?

I see your point too but think that the primary goal of this allowlist is not focused so much on excluding the malware (although that would naturally be good) - its primary goal is limiting the capability of any such malware to 'phone home' to a command & control endpoint.

To me it seems that the chance of some software that has been compromized being hosted on GitHub or PyPi or conda-forge etc. is relatively high, when compared against the chance of a malicious command & control server operating out of the same IP address.

I suspect that it's much more likely that compromized software would try to contact an innocuous looking domain or IP address, thereby joining a botnet or whatever. This allowlist approach would at least block that attack vector and we would have a record of the attempt.

its primary goal is limiting the capability of any such malware to 'phone home' to a command & control endpoint.

Ah! I missed that point, that does make a little more sense. Got it.

Okay, this allowlist will still be annoying :p, but I do see the benefit now.

It's clear from the above that we have two distinct use cases that have emerged for the web proxies:

#NameUse caseLocationNotes
AProduction servicesPrimarily APT, but also some other servicesAll data centresThis was the original use case for which the install servers were deployed
BLegitimate user activityAnalytics users, researchers etc.Eqiad for now (codfw in the medium term?)This is for members of analytics-privatedata-users and other authorised shell account holders

I think we can agree that we don't need to include internal domains in this - such as the discussion about api-ro.discovery.wmnet above - right?
To me this is fundamentally about the security of North-South traffic that has to leave our private networks, even if at the moment the proxies are mixed up with serving microservices and other East-West traffic. We can fix that with a bit lot of effort.

Here is my suggestion for how to proceeed. - (n.b. I realize that this suggestion increases the scope if the task considerably - but I'm going to share it anyway 🙂 )

1: Tighten production and exclude analytics from the existing web proxies
  • Use the existing web proxies (install[1-6]001) for Production services (A) only
  • Restrict the domains accessible to these web proxy servers as per the discussion above, using the SRE developed tooling - excluding the GitHub, PyPi, Conda-forge etc. sources required by (B)
  • Define a firewall rule preventing access from the analytics vlan to the existing web proxies
2: Deploy a new highly-available web proxy service for analytics users
  • Deploy a pair of servers or VMs in eqiad with whatever mechanism is preferred (BGP, anycast, vrrp, pybal) to make this new web proxy service highly-available
  • Use this exclusively for the Legitimate user activity (B) use case above
  • Ensure that this service is only accessible from the analytics vlan and prevent production servers from using it.
  • Use the SRE tooling to create more permissive allowlists on these proxies, including the GitHub, PyPi, Conda-forge, r-studio etc. sources mentioned above to support (B)
3: Deploy enhanced network security monitoring for analytics users
  • Configure port mirroring on the switches or some kind of network tap to capture the North-South traffic passing through the new analytics web proxies
  • Deploy a network security monitoring host in eqiad (VM or physical server), which receives this captured traffic from the web proxies
  • I would personally use Security Onion here for simplicity - although we could replicate the functionality under Debian if we wanted.
  • Configure full packet capture with a rolling window of say 14 day's worth of network traffic on disk.
  • This solution would give us a great deal of visibility into the network traffic between the analytics vlan and the Internet, with a facility to investgate any incidents retrospectively.
  • We would also have tools such as Suricata and/or Zeek actively scanning the traffic and alerting if anything such as known malware signatures or C&C access attempts are matched.

Chiming in as a heavy user of the stat boxes. It's difficult for me to follow this conversation so I'm mainly asking for clarification right now on the current proposal. Thanks all for working through this.

stat machine -> Mediawiki etc. APIs

Going through proxies for internal resources, explained by @akosiaris in (for example) T300977#7700803
To help solve this, a better use of no_proxy in T278315 seems like a low-enough hanging fruit.

If I had a Python script on the stat machines that made a request to the Mediawiki APIs, would this block that request and tell me to use the other internal endpoints mentioned (and then up to me to make that fix?)? Or does it allow the request to go through but try to make it a little less of a problem for other users (and separately we try to guide folks to usue the internal endpoints)? Or something else?

The latter, when "no_proxy" is set, the library will not try to use the proxies for the matching request, routing directly to the endpoint (less intermediary). If for some reason the library doesn't read "no_proxy" it will continue to work as it is now.
In other words: nothing will change.
Any kind of blocking of traffic towards internal hosts should be done after thorough analysis of the existing flows and head's up to anyone involved, so not anytime soon.

stat machine -> any other external resource

One suggestion is for example:

  • SRE will take care of the tooling around squid ACL management, an example is already staged by John in https://gerrit.wikimedia.org/r/c/operations/puppet/+/753029/
  • Craft the initial allow list based on historical data (eg. last 3 months) and feedback from the data engineering team (eg. on the gerrit change)
  • Once merged, users send CRs to add exceptions when needed
    • And can always SCP files if the exception has not be merged yet
  • Document the processes and best practices so they're easy to use for anyone

With your suggestion @ayounsi, I'm assuming this initial allow list would likely contain standard Python/R library endpoints (PyPi, conda-forge, Github, r-project/rstudio), Toolforge/Cloud VPS, and maybe a number of standard dataset repositories (zenodo, figshare, etc.). Then if e.g., I need a dataset that a researcher has hosted on their personal webpage, I have two options:

  • download it locally and scp it to the stat machine ("fast" in theory and useful for one-offs though not great for very large datasets + folks who are far from eqiad1)
  • or put in a request to add the endpoint (maybe fast but realistically takes a day or two at least?)

Exactly! It's open to suggestions of course.

@BTullis

I realize that this suggestion increases the scope if the task considerably

yup :) We unfortunately don't have the resources to implement something close to your proposal.
I prefer to focus on 1 set of versatile proxies than splitting them.
Managing 2 set of proxies servers would increase the overall complexity while not adding much value. ACLs would be similar on each set of hosts, the "if analytics do X, if prod do Y" can be done on a single machine.
Point (3) could benefit all kind of proxies (and overall traffic in/out our network), not only the "analytics" one.

If I may add my use case too, I would like to be able to restrict the access to the webproxies from the cumin hosts (cluster::management puppet role) and potentially other sensitive hosts. Ideally to an allow-list of URLs or something similar.

Change 753029 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] P:installserver::proxy: Add domain whitelist to proxy

https://gerrit.wikimedia.org/r/753029

If I may add my use case too, I would like to be able to restrict the access to the webproxies from the cumin hosts (cluster::management puppet role) and potentially other sensitive hosts. Ideally to an allow-list of URLs or something similar.

In my mind the production network would only allow .debian.org, which would work for the cumin hosts. Hosts that need additional access would need to have an explicit exception. I have updated https://gerrit.wikimedia.org/r/c/operations/puppet/+/753029 to:

  • create a global domain list which will be the minimum allowed list of domains (currently only includes .debian.org)
  • a new variable to allow users to create new domain lists and apply them to a list of roles

This should allow us to have a rather strict default and then add holes for any hosts/roles that require more access. The implementations currently relies quite heavily on puppet DB and needs more testing but should act as a good PoC

Change 879409 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] base::cache: drop wikimediafoundation.org from wikimedia_domains

https://gerrit.wikimedia.org/r/879409

Change 879418 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] P:environment: roll out no proxy config to all hosts

https://gerrit.wikimedia.org/r/879418

Change 879409 abandoned by Jbond:

[operations/puppet@production] base::cache: drop wikimediafoundation.org from wikimedia_domains

Reason:

traffic would like to keep this domain in case we need to start hosting subdomains from it

https://gerrit.wikimedia.org/r/879409

If I may add my use case too, I would like to be able to restrict the access to the webproxies from the cumin hosts (cluster::management puppet role) and potentially other sensitive hosts. Ideally to an allow-list of URLs or something similar.

In my mind the production network would only allow .debian.org, which would work for the cumin hosts. Hosts that need additional access would need to have an explicit exception. I have updated https://gerrit.wikimedia.org/r/c/operations/puppet/+/753029 to:

  • create a global domain list which will be the minimum allowed list of domains (currently only includes .debian.org)
  • a new variable to allow users to create new domain lists and apply them to a list of roles

This should allow us to have a rather strict default and then add holes for any hosts/roles that require more access. The implementations currently relies quite heavily on puppet DB and needs more testing but should act as a good PoC

This would break a lot of workflows, but mainly I'm worried about docker image building/the deployment pipeline, although I'll admit sometimes I do download something in production from outside networks while doing debugging.

Was any analysis of the current proxy logs performed?

Also: a whitelist approach will *break* the self-service workflow of the deployment pipeline. Maybe it's not so bad, but I guess we'll see if we get to it. I would prefer having separate proxies for package updates and a more liberal proxy reachable from the CI servers and the analytics VLAN, not unlike what @BTullis suggested.

I would in theory support having a whitelist on the CI servers, but we'll need to allow to reach npm, rubygems, pypi, github - that is to say the domains that host 99.9% of the hostile payloads someone might want to import in our production.

I would maintain that it's more urgent to provide an artifact repository for having local npm/pypi/go packages first (which would also mean what we use in production is at least auditable). Once we have that in place, 99% of the usages of the web proxy outside of debian.org would go away.

@Joe thanks for the input

This would break a lot of workflows,

I t would be useful to see if we can capture all of theses.

but mainly I'm worried about docker image building/the deployment pipeline,

The way that the patch is currently constructed is that we can have different whitelist for different hosts so host where we build docker images could have a more relaxed or even no blocking where as something like the pki server could have the most strict acl in place

although I'll admit sometimes I do download something in production from outside networks while doing debugging.

It would be good to know what these are. I think that at this point anything could be better then nothing, so a fairly relaxed list to start with that just ensures we block obviously malicious stuff e.g. ransom.haxxxed.to could help. However also thinking about where we may need to do the debugging and where we really shouldn't be could also help inform which hosts get which list

Was any analysis of the current proxy logs performed?

Yes, this is ongoing and one of the reason we wanted to add the no_proxy rule in the first place

Also: a whitelist approach will *break* the self-service workflow of the deployment pipeline. Maybe it's not so bad, but I guess we'll see if we get to it. I would prefer having separate proxies for package updates and a more liberal proxy reachable from the CI servers and the analytics VLAN, not unlike what @BTullis suggested.

The current patch allows us to impose different proxy whitelists for different roles, which achieves a similar end result to bens proposal without the need to add additional web proxies, some type of package capture system with analyse would also be great but that's a bit of feature creep for this task.

The current very loose proposal is

  • analytics: no blocking for now
  • rpki: needs an extended list to fetch ROA's
  • mx: needs access to clam av
  • other classes to be identified
  • everything elses: allowed access to a known safe list
    • We also discussed making everything else the same as analytics in the first instance and instead gradually add roles that should have the more strict rules

I would in theory support having a whitelist on the CI servers, but we'll need to allow to reach npm, rubygems, pypi, github - that is to say the domains that host 99.9% of the hostile payloads someone might want to import in our production.

This is an arms race and i agree most of the malware would come from the sites that we would end up allowing. however (as clarified by ben) most malware is first downloaded from one of theses resources and then will try to phone home to some other domain which would be blocked, so it should stop some attacks from escalating. it will also be easier to spot such issues once we have removed some noise from logs (things are already much better in that regard)

I would maintain that it's more urgent to provide an artifact repository for having local npm/pypi/go packages first (which would also mean what we use in production is at least auditable). Once we have that in place, 99% of the usages of the web proxy outside of debian.org would go away.

I agree that having theses repositories would be a big win (cc @joanna_borun would be good to get this in the APP) however i don't think that we have to wait for that to make improvements to the webproxies

I would maintain that it's more urgent to provide an artifact repository for having local npm/pypi/go packages first (which would also mean what we use in production is at least auditable). Once we have that in place, 99% of the usages of the web proxy outside of debian.org would go away.

I'd suggest opening a tracking task

Was any analysis of the current proxy logs performed?

That's a first pass I did a few weeks ago, for about 4h worth of proxies traffic, we can see that lot of it is for internal traffic and shouldn't show up in the list. Some hosts would also need a "permit all" (CI, RPKI, WDQS?). So a progressive/step by step approach as John suggested is best here.

{P43795}

Change 879418 merged by Jbond:

[operations/puppet@production] P:environment: roll out no proxy config to all hosts

https://gerrit.wikimedia.org/r/879418

I'm going to remove this task from the Backlog lane of the Research board given that there is no task for Research here, yet. Once prioritized, please reach out to us with a subtask and add Research back. We would be happy to look into prioritizing supporting you at that point.