Page MenuHomePhabricator

Toolhub container in staging cluster cannot reach meta.wikimedia.org to complete OAuth handshake
Closed, ResolvedPublicBUG REPORT

Description

Trying to complete an OAuth handshake from inside the staging cluster leads to a timeout based failure:

Authentication failed: HTTPSConnectionPool(host='meta.wikimedia.org', port=443): Max retries exceeded with url: /w/rest.php/oauth2/access_token (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f98d83234a8>: Failed to establish a new connection: [Errno 110] Connection timed out'))

Event Timeline

https://github.com/python-social-auth/social-core/issues/146 reports the normal HTTP_PROXY envvar (https://docs.python-requests.org/en/master/user/advanced/#proxies) not working for some from a venv runtime. More investigation needed. Would be easier to carry out experiments with T290357: Maintenance environment needed for running one-off commands, but I will try to find other means in the near term.

Social-core delegates HTTP connection handling to requests. Requests in turn delegates proxy configuration from envars to urllib.request.getproxies() which does the envvar checks. It is also possible to explicitly configure social-core to pass proxy configuration to requests via the SOCIAL_AUTH_PROXIES Django setting. That setting seems to only be documented in the code and not in any of the user docs that I can find. The Django setting should be a dict matching the requests proxies kwarg format. Something like:

SOCIAL_AUTH_PROXIES = {
    "http": "http://url-downloader.eqiad.wikimedia.org:8080"
}

Introduced in social-core v3.4.0 by https://github.com/python-social-auth/social-core/commit/25ed3b6242e89f644b3d4a4d235496905b4bc9c1

url-downloader blocks access to internal networks, I /think/ you should be using some internal discovery name instead.

In T291447#7369589, @Majavah wrote:

url-downloader blocks access to internal networks,

I wondered about that too, but have also found that curl -v --proxy http://url-downloader.eqiad.wikimedia.org:8080 -I https://meta.wikimedia.org/w/api.php totally works from inside the eqiad.wmnet network, so I have doubts that the squid config does what it looks like it might.

I /think/ you should be using some internal discovery name instead.

I don't disagree on this. It seems like that would be a more elegant solution, but it also may be tricky to do with the social-core library. I am not yet certain if the library supports a concept of the URL to the OAuth service being different for end users (who need to be served a 302 redirect to the Authorization Server) and the backend (which needs to call the Authorization Server to exchange the Authorization Code for an Access Token). I will be looking into that too.

Change 722679 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[wikimedia/toolhub@main] oauth: Add SOCIAL_AUTH_PROXIES setting for production use

https://gerrit.wikimedia.org/r/722679

Change 722679 merged by jenkins-bot:

[wikimedia/toolhub@main] oauth: Add SOCIAL_AUTH_PROXIES setting for production use

https://gerrit.wikimedia.org/r/722679

Change 722685 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/deployment-charts@master] toolhub: Add config.public.SOCIAL_AUTH_PROXIES setting

https://gerrit.wikimedia.org/r/722685

Change 722687 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/deployment-charts@master] toolhub: Set SOCIAL_AUTH_PROXIES

https://gerrit.wikimedia.org/r/722687

Change 722685 merged by jenkins-bot:

[operations/deployment-charts@master] toolhub: Add config.public.SOCIAL_AUTH_PROXIES setting

https://gerrit.wikimedia.org/r/722685

Change 722687 merged by jenkins-bot:

[operations/deployment-charts@master] toolhub: Set SOCIAL_AUTH_PROXIES

https://gerrit.wikimedia.org/r/722687

Ok. I went the long way around to test this as can be seen by the pile of patches that it took. Ultimately trying to complete the handshake fails in the same way which I think means that @Majavah was correct in T291447#7369589 that the url-downloader proxy will not handle this use case.

So... two things:

The way to make calls to internal services in production is as follows:

  • Allow in the code to connect to a specific SCHEME:IP:PORT but use a different Host: header for any http call
  • Set up the service proxy in your chart (you do that by enabling a list of discovery.listeners)
  • configure the IP:PORT combination to localhost:$DISCOVERY_PORT, where $DISCOVERY_PORT can be found in the puppet hieradata for profile::services_proxy::envoy::listeners that you find in hieradata/common/services/proxy.yaml

So for instance, say you want to call the MediaWiki API, you will typically want to set the SCHEME:IP:PORT to http://localhost:6501.

For OAuth there is an additional issue: given you make an http call to localhost via http, but given envoy proxies your request via https you get back a set-cookie response that includes the secure property. See https://phabricator.wikimedia.org/T264101#6646027.

bd808 changed the task status from Open to In Progress.Sep 22 2021, 7:38 PM
bd808 claimed this task.
bd808 triaged this task as High priority.
bd808 moved this task from Backlog to In Progress on the Toolhub board.

@Joe and I had a conference call to talk about reasonable next steps here. We are considering a two part solution of changes to unblock the initial deploy of Toolhub and follow up changes to make a more robust long term solution.

The core issue that needs to be addressed is interacting with the expectations of the Kubernetes cluster network configuration by an app that is doing new things in new ways from that environment. When egress rules are enabled for a k8s deployment, network traffic is greatly restricted by design both for security (making it more difficult for a remote exploit of a service to make random network calls) and for stability (avoiding communication with the public CDN edge in favor of communicating with internal service pools).

For contacting the outside internet, url-downloader.{codfw,eqiad}.wikimedia.org proxies are available. This should handle most of the traffic needs of Toolhub's crawler. The url-downloader.{codfw,eqiad}.wikimedia.org proxies are configured to disallow proxying to endpoints which are inside the wmnet networks (T291447#7369589) to help with the stability use case. Something is different about this restriction when accessing the proxy from a maintenance host (T291447#7369641), but that's orthogonal to the k8s egress needs of Toolhub.

For contacting "local" HTTP resources (*.wikimedia.org & the project wikis) from inside the k8s clusters, requests should be made to a dynamic service endpoint like api-rw.discover.wmnet with the public name sent in a Host header like Host: meta.wikimedia.org. This target + Host header combination is fairly easy to implement in custom code. Things get a little more tricky however for Toolhub which is using some upstream general purpose libraries for things like OAuth authentication where we do not control the HTTP connection stack completely.

The quick and dirty fix that we are proposing as a first step is adding explicit egress to the Toolhub chart which will allow direct egress to "text" CDN edge. The "text" pool is the edge cache for nearly everything other than commons media requests. Opening this egress should allow Toolhub to call https://meta.wikimedia.org/w/rest.php directly without traversing the url-downloader proxy. The down side of this is that it puts Toolhub's traffic into the same gateway used by the public internet. This may impact stability of Toolhub in the face of some network events (DOS attack, api disable at CDN edge in response to some active incident).

The deeper follow up fix will be implementing support for the target + Host header behavior needed to connect directly to the internal MediaWiki API pool. This can certainly be done with raw use of Python's requests library. It will require investigation to determine how deeply we will need to hack on Toolhub's integration with the social-core library that is handling the OAuth2 authentication handshakes for us.

Ultimately the goal will be for all outbound HTTP requests from Toolhub to either route through the url-downloader HTTP proxy (external internet, Cloud VPS/Toolforge) or to route to an internal service cluster and pass the appropriate Host header containing the public hostname for that service (meta.wikimedia.org, etc). Access to wikitech.wikimedia.org may be extra weird in that today that wiki is not hosted in the shared MediaWiki pool (T237773: Move Wikitech onto the production MW cluster).

I had a brief hope that I could use a requests proxy configuration to transparently provide the desired target + Host header behavior in social-core:

proxies = {
  "https://meta.wikimedia.org": "https://api-rw.discovery.wmnet",
}

It turns out however that deep in the heart of urllib3 this leads to a TLS in TLS tunnel configuration that will not work with the replacement endpoint being a plain HTTP + TLS service.

Change 723006 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/deployment-charts@master] toolhub: Add no_proxy envvar

https://gerrit.wikimedia.org/r/723006

Change 723007 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/deployment-charts@master] toolhub: text-lb egress + no_proxy

https://gerrit.wikimedia.org/r/723007

Change 723006 merged by jenkins-bot:

[operations/deployment-charts@master] toolhub: Add no_proxy envvar

https://gerrit.wikimedia.org/r/723006

Change 723297 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/deployment-charts@master] toolhub: Do not force envvars to uppercase

https://gerrit.wikimedia.org/r/723297

Change 723297 merged by jenkins-bot:

[operations/deployment-charts@master] toolhub: Do not force envvars to uppercase

https://gerrit.wikimedia.org/r/723297

Change 723007 merged by jenkins-bot:

[operations/deployment-charts@master] toolhub: text-lb egress + no_proxy

https://gerrit.wikimedia.org/r/723007

Traffic is now passing from the staging cluster to meta! I still can't complete an OAuth2 handshake. The leg of exchanging the oauth2 authorization code for a token is failing. I think that is because the hostname is stuck in it on the server side, but I'm not 100% convinced of that yet. But it is at least a new problem to think about. ;)

Aklapper renamed this task from Toolhub container in staging clsuter cannot reach meta.wikimedia.org to complete OAuth handshake to Toolhub container in staging cluster cannot reach meta.wikimedia.org to complete OAuth handshake.Sep 28 2021, 4:11 PM

Change 725060 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/deployment-charts@master] toolhub: Do not force cronjob envvars to uppercase

https://gerrit.wikimedia.org/r/725060

Change 725060 merged by jenkins-bot:

[operations/deployment-charts@master] toolhub: Do not force cronjob envvars to uppercase

https://gerrit.wikimedia.org/r/725060