Page MenuHomePhabricator

toolforge lima-kilo: fails to install because api-gateway fails to deploy
Closed, ResolvedPublic

Description

The default local.yaml value fails to install by default.

Log:

TASK [k8s : Deploy custom component into k8s] ***************************************************************************************************************
changed: [localhost] => (item=image-config)
changed: [localhost] => (item=cert-manager)
changed: [localhost] => (item=volume-admission)
failed: [localhost] (item=api-gateway) => {"ansible_loop_var": "item", "changed": true, "cmd": "./deploy.sh local", "delta": "0:00:10.484506", "end": "2023-06-05 16:56:50.933966", "item": {"deploy": "./deploy.sh local", "docker_build": false, "git_url": "https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway", "name": "api-gateway"}, "msg": "non-zero return code", "rc": 1, "start": "2023-06-05 16:56:40.449460", "stderr": "Warning: environments and releases cannot be defined within the same YAML part. Use --- to extract the environments into a dedicated part\nWarning: environments and releases cannot be defined within the same YAML part. Use --- to extract the environments into a dedicated part\nBuilding dependency release=api-gateway, chart=charts/api-gateway\nUpgrading release=api-gateway, chart=charts/api-gateway\n\nFAILED RELEASES:\nNAME\napi-gateway\nin ./helmfile.yaml: failed processing release api-gateway: command \"/usr/local/bin/helm\" exited with non-zero status:\n\nPATH:\n  /usr/local/bin/helm\n\nARGS:\n  0: helm (4 bytes)\n  1: upgrade (7 bytes)\n  2: --install (9 bytes)\n  3: api-gateway (11 bytes)\n  4: charts/api-gateway (18 bytes)\n  5: --create-namespace (18 bytes)\n  6: --namespace (11 bytes)\n  7: api-gateway (11 bytes)\n  8: --values (8 bytes)\n  9: /tmp/helmfile1661587238/api-gateway-api-gateway-values-5df657d897 (65 bytes)\n  10: --reset-values (14 bytes)\n  11: --history-max (13 bytes)\n  12: 10 (2 bytes)\n\nERROR:\n  exit status 1\n\nEXIT STATUS\n  1\n\nSTDERR:\n  Error: UPGRADE FAILED: failed to create resource: Internal error occurred: failed calling webhook \"webhook.cert-manager.io\": failed to call webhook: Post \"https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s\": dial tcp 10.96.141.119:443: i/o timeout\n\nCOMBINED OUTPUT:\n  Error: UPGRADE FAILED: failed to create resource: Internal error occurred: failed calling webhook \"webhook.cert-manager.io\": failed to call webhook: Post \"https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s\": dial tcp 10.96.141.119:443: i/o timeout", "stderr_lines": ["Warning: environments and releases cannot be defined within the same YAML part. Use --- to extract the environments into a dedicated part", "Warning: environments and releases cannot be defined within the same YAML part. Use --- to extract the environments into a dedicated part", "Building dependency release=api-gateway, chart=charts/api-gateway", "Upgrading release=api-gateway, chart=charts/api-gateway", "", "FAILED RELEASES:", "NAME", "api-gateway", "in ./helmfile.yaml: failed processing release api-gateway: command \"/usr/local/bin/helm\" exited with non-zero status:", "", "PATH:", "  /usr/local/bin/helm", "", "ARGS:", "  0: helm (4 bytes)", "  1: upgrade (7 bytes)", "  2: --install (9 bytes)", "  3: api-gateway (11 bytes)", "  4: charts/api-gateway (18 bytes)", "  5: --create-namespace (18 bytes)", "  6: --namespace (11 bytes)", "  7: api-gateway (11 bytes)", "  8: --values (8 bytes)", "  9: /tmp/helmfile1661587238/api-gateway-api-gateway-values-5df657d897 (65 bytes)", "  10: --reset-values (14 bytes)", "  11: --history-max (13 bytes)", "  12: 10 (2 bytes)", "", "ERROR:", "  exit status 1", "", "EXIT STATUS", "  1", "", "STDERR:", "  Error: UPGRADE FAILED: failed to create resource: Internal error occurred: failed calling webhook \"webhook.cert-manager.io\": failed to call webhook: Post \"https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s\": dial tcp 10.96.141.119:443: i/o timeout", "", "COMBINED OUTPUT:", "  Error: UPGRADE FAILED: failed to create resource: Internal error occurred: failed calling webhook \"webhook.cert-manager.io\": failed to call webhook: Post \"https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s\": dial tcp 10.96.141.119:443: i/o timeout"], "stdout": "Comparing release=api-gateway, chart=charts/api-gateway\napi-gateway, api-gateway-nginx-config, ConfigMap (v1) has changed:\n  # Source: api-gateway/templates/nginx-config.yaml\n  apiVersion: v1\n  kind: ConfigMap\n  metadata:\n    name: api-gateway-nginx-config\n  data:\n    nginx.conf: |\n      worker_processes auto;\n\n      events {\n        worker_connections 1024;\n      }\n\n      # see \"Running nginx as a non-root user\" in https://hub.docker.com/_/nginx/\n      pid /tmp/nginx.pid;\n\n      http {\n        # More non-root stuff\n        client_body_temp_path /tmp/client_temp;\n        proxy_temp_path       /tmp/proxy_temp_path;\n        fastcgi_temp_path     /tmp/fastcgi_temp;\n        uwsgi_temp_path       /tmp/uwsgi_temp;\n        scgi_temp_path        /tmp/scgi_temp;\n\n        server {\n          listen 8443 ssl;\n\n          ssl_certificate        /etc/nginx/tls/server/tls.crt;\n          ssl_certificate_key    /etc/nginx/tls/server/tls.key;\n          ssl_client_certificate /var/run/secrets/kubernetes.io/serviceaccount/ca.crt;\n          ssl_verify_client      on;\n          ssl_protocols          TLSv1.2 TLSv1.3;\n          ssl_ciphers            HIGH:!aNULL:!MD5;\n\n          # For details see here: https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Toolforge_API_gateway#Authentication\n          proxy_ssl_certificate         /etc/nginx/tls/backend/tls.crt;\n          proxy_ssl_certificate_key     /etc/nginx/tls/backend/tls.key;\n          proxy_ssl_verify              on;\n          proxy_ssl_trusted_certificate /etc/nginx/tls/backend/ca.crt;\n          proxy_set_header              ssl-client-subject-dn $ssl_client_s_dn;\n\n          location = / {\n            return 200 \"This is the Toolforge API gateway!\\n\";\n          }\n\n\n-         location /jobs/ {\n-           proxy_pass \"https://jobs-api.jobs-api.svc:8443/\";\n-         }\n- \n        }\n\n        server {\n          listen 9000;\n          access_log off;\n\n          location = /healthz {\n            return 200 \"ok\";\n          }\n        }\n      }", "stdout_lines": ["Comparing release=api-gateway, chart=charts/api-gateway", "api-gateway, api-gateway-nginx-config, ConfigMap (v1) has changed:", "  # Source: api-gateway/templates/nginx-config.yaml", "  apiVersion: v1", "  kind: ConfigMap", "  metadata:", "    name: api-gateway-nginx-config", "  data:", "    nginx.conf: |", "      worker_processes auto;", "", "      events {", "        worker_connections 1024;", "      }", "", "      # see \"Running nginx as a non-root user\" in https://hub.docker.com/_/nginx/", "      pid /tmp/nginx.pid;", "", "      http {", "        # More non-root stuff", "        client_body_temp_path /tmp/client_temp;", "        proxy_temp_path       /tmp/proxy_temp_path;", "        fastcgi_temp_path     /tmp/fastcgi_temp;", "        uwsgi_temp_path       /tmp/uwsgi_temp;", "        scgi_temp_path        /tmp/scgi_temp;", "", "        server {", "          listen 8443 ssl;", "", "          ssl_certificate        /etc/nginx/tls/server/tls.crt;", "          ssl_certificate_key    /etc/nginx/tls/server/tls.key;", "          ssl_client_certificate /var/run/secrets/kubernetes.io/serviceaccount/ca.crt;", "          ssl_verify_client      on;", "          ssl_protocols          TLSv1.2 TLSv1.3;", "          ssl_ciphers            HIGH:!aNULL:!MD5;", "", "          # For details see here: https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Toolforge_API_gateway#Authentication", "          proxy_ssl_certificate         /etc/nginx/tls/backend/tls.crt;", "          proxy_ssl_certificate_key     /etc/nginx/tls/backend/tls.key;", "          proxy_ssl_verify              on;", "          proxy_ssl_trusted_certificate /etc/nginx/tls/backend/ca.crt;", "          proxy_set_header              ssl-client-subject-dn $ssl_client_s_dn;", "", "          location = / {", "            return 200 \"This is the Toolforge API gateway!\\n\";", "          }", "", "", "-         location /jobs/ {", "-           proxy_pass \"https://jobs-api.jobs-api.svc:8443/\";", "-         }", "- ", "        }", "", "        server {", "          listen 9000;", "          access_log off;", "", "          location = /healthz {", "            return 200 \"ok\";", "          }", "        }", "      }"]}

Apparently this has something to do with secret volumes:

arturo@nostromo:~/git/wmf/cloud/toolforge/api-gateway $ kubectl -n api-gateway describe pod/api-gateway-nginx-56847fb644-b5l26
[...]
Events:
  Type     Reason       Age                From               Message
  ----     ------       ----               ----               -------
  Normal   Scheduled    2m1s               default-scheduler  Successfully assigned api-gateway/api-gateway-nginx-56847fb644-b5l26 to toolforge-control-plane
  Warning  FailedMount  2m (x2 over 2m)    kubelet            MountVolume.SetUp failed for volume "server-certs" : secret "api-gateway-server" not found
  Warning  FailedMount  117s (x4 over 2m)  kubelet            MountVolume.SetUp failed for volume "backend-certs" : secret "api-gateway-backend-server" not found
  Normal   Pulling      113s               kubelet            Pulling image "docker-registry.tools.wmflabs.org/nginx:latest"
  Normal   Pulled       109s               kubelet            Successfully pulled image "docker-registry.tools.wmflabs.org/nginx:latest" in 4.076066564s
  Normal   Created      109s               kubelet            Created container nginx
  Normal   Started      109s               kubelet            Started container nginx
arturo@nostromo:~/git/wmf/cloud/toolforge/api-gateway $ kubectl -n api-gateway get secrets
NAME                                TYPE                                  DATA   AGE
api-gateway-backend-server          kubernetes.io/tls                     3      75s
api-gateway-nginx-token-kmql8       kubernetes.io/service-account-token   3      80s
api-gateway-server                  kubernetes.io/tls                     3      78s
default-token-xn29d                 kubernetes.io/service-account-token   3      80s
sh.helm.release.v1.api-gateway.v1   helm.sh/release.v1                    1      80s

Manually deleting the pods solved the problem.

Event Timeline

This seems like a timing issue where cert-manager hasn't started up in time for the api-gateway chart to deploy.

This seems like a timing issue where cert-manager hasn't started up in time for the api-gateway chart to deploy.

I think this is correct. I tried today in a slower laptop and this problem did not show up.

Definitely a race condition.

aborrero claimed this task.
aborrero moved this task from Next to Doing on the User-aborrero board.