Page MenuHomePhabricator

Toolforge: consider introducing a command line for creating reverse proxies
Open, Needs TriagePublic

Description

During Wikimedia-Hackathon-2023 we learned about more and more tools using the reverse proxy described at https://wikitech.wikimedia.org/wiki/Help:Toolforge/Kubernetes/Reverse_proxy

We could consider introducing a command line interface abstraction with the semantics required to maintain such reverse proxies.

Event Timeline

Is this related to having a frontend+backed type of application?

If so, this would align pretty well with the idea of being able to run multiple deployments/containers on the same tool, and consolidating the client on something like "toolforge app" command, allowing to run the backend and expose it to the frontend, and expose the frontend to the world (this would be a superset of what current webservice does).

Similar to the compute + network sections mentioned in the mail discussion here: https://lists.wikimedia.org/hyperkitty/list/cloud-admin@lists.wikimedia.org/thread/CHGYMEUYSIFU3AIVPXFKEFIDT7UK6YV2/

That could help also running things like celery workers, etc.

Is this related to having a frontend+backed type of application?

Yes and no. The reverse proxy can be used for many uses cases, one of the most interesting is having the tool webservice app do requests to external services to prevent CORS problems. Indeed this is something that can show up in frontend+backend type of applications, but not only on them. I think we can find both at the moment in Toolforge (f+b and monolitics) making use of this reverse proxy thing.

If so, this would align pretty well with the idea of being able to run multiple deployments/containers on the same tool, and consolidating the client on something like "toolforge app" command, allowing to run the backend and expose it to the frontend, and expose the frontend to the world (this would be a superset of what current webservice does).

Similar to the compute + network sections mentioned in the mail discussion here: https://lists.wikimedia.org/hyperkitty/list/cloud-admin@lists.wikimedia.org/thread/CHGYMEUYSIFU3AIVPXFKEFIDT7UK6YV2/

That could help also running things like celery workers, etc.

Yes I agree we should take into account the semantics of the platform as a whole before introducing this.

Is this related to having a frontend+backed type of application?

Yes and no. The reverse proxy can be used for many uses cases, one of the most interesting is having the tool webservice app do requests to external services to prevent CORS problems. Indeed this is something that can show up in frontend+backend type of applications, but not only on them. I think we can find both at the moment in Toolforge (f+b and monolitics) making use of this reverse proxy thing.

Can you elaborate on the use cases? I don't think I have a clear idea of what you mean. Maybe examples of application designs that would use it besides frontend-backend one would help.

Can you elaborate on the use cases? I don't think I have a clear idea of what you mean. Maybe examples of application designs that would use it besides frontend-backend one would help.

T250922: MoeData causes visiting browser to load data from 3rd party sites is a concrete example of the reverse proxy to external service use case. Implementing reverse proxies there solved two separate problems:

  • Avoid CORS restrictions when attempting to call web service directly from browser.
  • Prevent our visitor's IP address from being shared with a 3rd party web service.

Can you elaborate on the use cases? I don't think I have a clear idea of what you mean. Maybe examples of application designs that would use it besides frontend-backend one would help.

T250922: MoeData causes visiting browser to load data from 3rd party sites is a concrete example of the reverse proxy to external service use case. Implementing reverse proxies there solved two separate problems:

  • Avoid CORS restrictions when attempting to call web service directly from browser.
  • Prevent our visitor's IP address from being shared with a 3rd party web service.

Thanks!

That last point is quite important.
Could this be worked around by the application itself instead? (if I understand correctly, we are mapping from a path under the tool domain, to an external host, ex. mytool.toolforge.org/scdn/.* -> i.scdn.co/.*)
For example, a flask app can be used as proxy to 3rd parties, no need to rely on whatever implementation of ingress we have right?
Currently we don't enable mod_proxy on lighthttpd (afaik), but we could enable that and let the user sort it out there too right? (not sure if this is better though)
With buildpack-based images you can also (at least on some of them, php/scala for example) configure httpd/nginx, so if you don't want your app to handle the proxying you can rely on those too.

Silly example:

image.png (1×2 px, 832 KB)

using the php buildpack and the following extras:

12:06 PM ~/Work/wikimedia/tools/MoeData  (master|● 3) 
dcaro@vulcanus$ git diff --cached
diff --git a/Procfile b/Procfile
new file mode 100644
index 0000000..b5fe852
--- /dev/null
+++ b/Procfile
@@ -0,0 +1 @@
+web: heroku-php-apache2 -C httpd.inc.conf public/
diff --git a/composer.json b/composer.json
new file mode 100644
index 0000000..0967ef4
--- /dev/null
+++ b/composer.json
@@ -0,0 +1 @@
+{}
diff --git a/httpd.inc.conf b/httpd.inc.conf
new file mode 100644
index 0000000..c4dfdc7
--- /dev/null
+++ b/httpd.inc.conf
@@ -0,0 +1,4 @@
+SSLProxyEngine on
+
+ProxyPass "/"  "https://www.wikipedia.org/"
+ProxyPassReverse "/"  "https://www.wikipedia.org/"

And specially if we allow multiple apps users can implement api-gateways and similar that way.

I understand that it's attractive to rely on the platform for that, but that just couples people more with the current implementation, when they could relatively easily keep their application contained.

Unless we create a proper abstraction on top of it and maintain it, but given that there's already a workaround/solution that would unblock users, I'm reluctant to create another one.

While I like the idea of relying on existing features (like configuring a web server by hand), I do think we need to introduce some better tooling for this. For example webservice is currently the only way to manage Service objects so it would not be possible to set up a frontend-backend split without raw Kubernetes object manipulation.

For example webservice is currently the only way to manage Service objects so it would not be possible to set up a frontend-backend split without raw Kubernetes object manipulation.

I think that it would be more helpful that instead of focusing on what k8s features are there, and how to expose them to the users, we think on what needs the users have, and how to implement them on top of k8s.

Said that, the above solution allows you to do the fronted-backend split, or using a flask/whatever framework you use to proxy achieves the same thing. It's a common way of doing it to test it while doing local development and for the size of the applications that we have more than enough imo.

Though I agree that an abstraction should be made, and probably that would be part of the move to "applications" instead of just "webservice", as when you allow the creation of multiple applications, we should create an easy way for the user to use the internal application with some sort of load balancing, and that probably should not be using a path for the public endpoint on the ingress proxy.

But for the specific case of cross-site issues with external pages/apis I would still go with the custom app-side proxy (be that by nginx/apache or the application itself proxying it). I think it's simple enough to be useful with little documentation before we get around to create the higher level abstraction and easy enough for us (and the users) to maintain and test to not be a burden now or later.

Said that, the above solution allows you to do the fronted-backend split, or using a flask/whatever framework you use to proxy achieves the same thing. It's a common way of doing it to test it while doing local development and for the size of the applications that we have more than enough imo.

Does it? How would you route traffic to the backend pod from the frontend without a Service object?

Said that, the above solution allows you to do the fronted-backend split, or using a flask/whatever framework you use to proxy achieves the same thing. It's a common way of doing it to test it while doing local development and for the size of the applications that we have more than enough imo.

Does it? How would you route traffic to the backend pod from the frontend without a Service object?

In the above case yes, the user has two tools, one that's the backend and one that is the frontend, it would be just a proxypass to https://<backend_tool>.toolforge.org.

In the current status, you can't start a second webservice without playing with k8s anyhow (that would be the move to 'applications' that I mentioned above). You can try using a continuous job, but as you say that would not create the service either (that I know of).

But for the specific case of cross-site issues with external pages/apis I would still go with the custom app-side proxy (be that by nginx/apache or the application itself proxying it). I think it's simple enough to be useful with little documentation before we get around to create the higher level abstraction and easy enough for us (and the users) to maintain and test to not be a burden now or later.

The example from T250922: MoeData causes visiting browser to load data from 3rd party sites is exactly making a custom app-side proxy using nginx. Is this broadly an objection to using the nginx that is managed via Ingress objects? Do you also object to solutions like https://wikitech.wikimedia.org/wiki/User:BryanDavis/Kubernetes#Make_a_tool_redirect_to_another_tool_WITHOUT_running_a_webservice which implement 302 redirection using an Ingress object directly rather than by deploying a third HTTP service (Toolforge front proxyIngressServicePod containing HTTP 302 service)?

But for the specific case of cross-site issues with external pages/apis I would still go with the custom app-side proxy (be that by nginx/apache or the application itself proxying it). I think it's simple enough to be useful with little documentation before we get around to create the higher level abstraction and easy enough for us (and the users) to maintain and test to not be a burden now or later.

The example from T250922: MoeData causes visiting browser to load data from 3rd party sites is exactly making a custom app-side proxy using nginx. Is this broadly an objection to using the nginx that is managed via Ingress objects? Do you also object to solutions like https://wikitech.wikimedia.org/wiki/User:BryanDavis/Kubernetes#Make_a_tool_redirect_to_another_tool_WITHOUT_running_a_webservice which implement 302 redirection using an Ingress object directly rather than by deploying a third HTTP service (Toolforge front proxyIngressServicePod containing HTTP 302 service)?

I'm not objecting on you (a power user, a toolforge root, and someone who knows what they are doing) using custom hand-made ingress objects to implement a redirection at the ingress proxy level.

What I'm reluctant of is to create a new widely available service/abstraction for it, when there's already (or on most cases) a workaround available that is not technically horrible, for the few usecases that we have, specially taking into account that we are going to introduce eventually a way to tackle most common usecases anyhow (internal services).

Given the little demand (though that's hard to measure), the existence of "good enough" workarounds and the future support for most usecases, the cost of maintenance and development for such a service makes it, again, in my point of view, a non-optimal effort.

I'm not objecting on you (a power user, a toolforge root, and someone who knows what they are doing) using custom hand-made ingress objects to implement a redirection at the ingress proxy level.

What I'm reluctant of is to create a new widely available service/abstraction for it, when there's already (or on most cases) a workaround available that is not technically horrible, for the few usecases that we have, specially taking into account that we are going to introduce eventually a way to tackle most common usecases anyhow (internal services).

Given the little demand (though that's hard to measure), the existence of "good enough" workarounds and the future support for most usecases, the cost of maintenance and development for such a service makes it, again, in my point of view, a non-optimal effort.

This helps me contextualize your responses here, thank you. :)

I too am not sure of how frequently this sort of thing is needed. Anecdotally, I have seen a number of Toolforge hosted projects that are doing a frontend/backend split and using separate tool accounts to host each. Finding better ways to support those tools seems useful, but that doesn't necessarily have to take the form of some toolforge proxy ... abstraction. Figuring out how to use buildpacks to compile Vue/React/etc javascript down to static assets that can be served from the same webservice as the tool's API might prove to be more easily explainable and valuable to folks if the two ideas are competing for finite resources.