Page MenuHomePhabricator

[api-gateway] Add a python server to serve consolidated openapi docs
Closed, ResolvedPublic

Description

Let's call it oapi-server. At a minimum, this will need:

  • The oapi-server code itself (just hello world to start)
  • A blubber pipeline to build the oapi-server image
  • A container for the oapi-server in deployment/chart/templates/nginx-deployment.yaml.tpl
  • A new location block added to deployment/chart/templates/nginx-config.yaml.tpl routing to /oapi (or whatever we decide to call it)
  • Updates to values.yaml to include oapi-server
  • Update gitlab CLI to include oapi-server
  • Changes on the toolforge-deploy side?

Also:

  • Add python-specific hooks to existing pre-commit config
  • Add tox.ini

Then:

  • Add the logic to fetch & merge the oapi docs from the backend APIs

Details

TitleReferenceAuthorSource BranchDest Branch
[oapi-spec] add oapi-server to gatewayrepos/cloud/toolforge/api-gateway!17sstefanovaslavina/add-oapi-servermain
Customize query in GitLab

Event Timeline

For the deployment we can reuse the same that's there already and add two containers, like we do in jobs-api for example, that also removes the need to have a service to communicate between them as they would be running in the same pod.

For the CI, we might want to use the same as builds-api for example, that does pre-commit + blubber test. That would mean that we have to add a test variant to the blubber file too ([[https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/blob/main/.pipeline/blubber.yaml?ref_type=heads | like so]), we don't use in yet in any other that runs python tests (ex. jobs-api), so there's no base for it yet, but should be easy to setup and then we can consolidate all of them to do the same (pre-commit + blubber test, as opposed to only tox using a manually built image).

On the toolforge-deploy side, we will need to add any new values that might be needed (maybe not, or only for debugging reasons in the local.yaml).

dcaro triaged this task as High priority.Apr 11 2024, 8:33 AM

For the deployment we can reuse the same that's there already and add two containers

Do you mean add one container, so there would be two containers all together, nginx + oapi-server?

For the CI, we might want to use the same as builds-api for example, that does pre-commit + blubber test. That would mean that we have to add a test variant to the blubber file

Maybe we can start with just one variant for now, then add one for test once the basic plumbing is working?

For the deployment we can reuse the same that's there already and add two containers

Do you mean add one container, so there would be two containers all together, nginx + oapi-server?

Two containers in one pod yes.

For the CI, we might want to use the same as builds-api for example, that does pre-commit + blubber test. That would mean that we have to add a test variant to the blubber file

Maybe we can start with just one variant for now, then add one for test once the basic plumbing is working?

The issue is that we use pre-commit in some, and only tox in others, if you want pre-commit + tox we'll have to do a first there

For the deployment we can reuse the same that's there already and add two containers

Do you mean add one container, so there would be two containers all together, nginx + oapi-server?

Two containers in one pod yes.

I think it would be better to have two separate pods for isolation purposes. This way the system can schedule/restart/healthcheck them separately.

A k8s Service is just a few firewalling/NAT rules, is not adding much overhead, no?

For the deployment we can reuse the same that's there already and add two containers

Do you mean add one container, so there would be two containers all together, nginx + oapi-server?

Two containers in one pod yes.

I think it would be better to have two separate pods for isolation purposes. This way the system can schedule/restart/healthcheck them separately.

A k8s Service is just a few firewalling/NAT rules, is not adding much overhead, no?

It also complicates debugging, as now your pods will potentially run in different hosts, in different containers, across the network etc.

That also means having to ssl-ify the traffic (tls traffic in-between the pods), so probably not the simplest, we can split them once we have an idea of the flakiness of each with real traffic.

For the CI, we might want to use the same as builds-api for example, that does pre-commit + blubber test. That would mean that we have to add a test variant to the blubber file too ([[https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/blob/main/.pipeline/blubber.yaml?ref_type=heads | like so]), we don't use in yet in any other that runs python tests (ex. jobs-api), so there's no base for it yet, but should be easy to setup and then we can consolidate all of them to do the same (pre-commit + blubber test, as opposed to only tox using a manually built image).

Actually, we might want to use a pre-built image for pre-commit too, I'm playing with adding pre-caching for the pre-commit hooks we have to the image, that should reduce considerably the runtime and the external fetching of dependencies (that starts being an issue it seems for golanglintci, see https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/jobs/239414).

It also complicates debugging, as now your pods will potentially run in different hosts, in different containers, across the network etc.

That also means having to ssl-ify the traffic (tls traffic in-between the pods), so probably not the simplest, we can split them once we have an idea of the flakiness of each with real traffic.

I think I misread the earlier comment, I thought you were suggesting to run this new oapi-server container within the api-gateway deployment pod. Sorry for the noise.

Slst2020 changed the task status from Open to In Progress.Apr 11 2024, 3:29 PM
Slst2020 updated the task description. (Show Details)
Slst2020 moved this task from Next Up to In Progress on the Toolforge (Toolforge iteration 08) board.

I think I misread the earlier comment, I thought you were suggesting to run this new oapi-server container within the api-gateway deployment pod. Sorry for the noise.

So, apparently I did not misread it, because the merge request just contains this architecture.

I don't think we should go this route. I don't like this pattern. We should create a separate pod/deployment for this new service.

The API gateway pod is in the hot path for all user-facing things in Toolforge. It should be isolated, this way the system can schedule/restart/healthcheck it separately.

All the concerns mentioned here or in the merge requests (TLS, debugging, separate repo with boilerplate, etc) are not blockers, or complex to introduce for that matter. Is the same pattern we have for the rest of the system.

I think I misread the earlier comment, I thought you were suggesting to run this new oapi-server container within the api-gateway deployment pod. Sorry for the noise.

So, apparently I did not misread it, because the merge request just contains this architecture.

I don't think we should go this route. I don't like this pattern. We should create a separate pod/deployment for this new service.

The API gateway pod is in the hot path for all user-facing things in Toolforge. It should be isolated, this way the system can schedule/restart/healthcheck it separately.

All the concerns mentioned here or in the merge requests (TLS, debugging, separate repo with boilerplate, etc) are not blockers, or complex to introduce for that matter. Is the same pattern we have for the rest of the system.

I think you are confusing the api gateway (nginx + code) with nginx only, both things are "the api gateway", and yes, I agree it should be resilient and treated carefully. I do disagree that the way to make it so is to split it and make it distributed.

dcaro moved this task from In Review to Done on the Toolforge (Toolforge iteration 09) board.

This is deployed already:

dcaro@tools-bastion-13:~$ curl -v --insecure https://api.svc.tools.eqiad1.wikimedia.cloud:30003/openapi.json -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 172.16.6.113:30003...
* Connected to api.svc.tools.eqiad1.wikimedia.cloud (172.16.6.113) port 30003 (#0)
* ALPN: offers h2,http/1.1
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
} [512 bytes data]
* TLSv1.3 (IN), TLS handshake, Server hello (2):
{ [122 bytes data]
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
{ [25 bytes data]
* TLSv1.3 (IN), TLS handshake, Request CERT (13):
{ [76 bytes data]
* TLSv1.3 (IN), TLS handshake, Certificate (11):
{ [462 bytes data]
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
{ [80 bytes data]
* TLSv1.3 (IN), TLS handshake, Finished (20):
{ [52 bytes data]
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
} [1 bytes data]
* TLSv1.3 (OUT), TLS handshake, Certificate (11):
} [8 bytes data]
* TLSv1.3 (OUT), TLS handshake, Finished (20):
} [52 bytes data]
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN: server accepted http/1.1
* Server certificate:
*  subject: CN=api-gateway.api-gateway.svc
*  start date: May  2 18:07:37 2024 GMT
*  expire date: May 23 18:07:37 2024 GMT
*  issuer: CN=api-gateway.api-gateway.svc
*  SSL certificate verify result: self-signed certificate (18), continuing anyway.
* using HTTP/1.1
} [5 bytes data]
> GET /openapi.json HTTP/1.1
> Host: api.svc.tools.eqiad1.wikimedia.cloud:30003
> User-Agent: curl/7.88.1
> Accept: */*
> 
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [297 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [297 bytes data]
* old SSL session ID is stale, removing
{ [5 bytes data]
< HTTP/1.1 200 OK
< Server: nginx/1.21.0
< Date: Wed, 15 May 2024 10:34:35 GMT
< Content-Type: application/json
< Content-Length: 27088
< Connection: keep-alive
< 
{ [16227 bytes data]
100 27088  100 27088    0     0  98349      0 --:--:-- --:--:-- --:--:-- 98861
* Connection #0 to host api.svc.tools.eqiad1.wikimedia.cloud left intact

You can explore the openapi definition in https://api-docs.toolforge.org