Page MenuHomePhabricator

[lima-kilo,builds] Fix local development setup
Closed, ResolvedPublic

Description

Currently the build service is not working correctly under lima-kilo setup.

The current status is that the builds-builder is unable to verify access to harbor during the analyze step:

step-analyze: 2023-09-27T09:04:22.406397284Z ERROR: failed to initialize analyzer: validating registry read access: ensure registry read access to 172.19.0.1/tool-tf-test/tool-tf-test:latest
step-analyze:

There's several moving parts:

  • This so far has only been tested outside lima-kilo, using the installer and configuration in the builds-builder repository (so it seems it has never worked inside lima-kilo deployed harbor)
  • There is two secrets that are relevant:
    • basic-user-pass -> http user and password for tekton, deployed always
    • dockerconfig -> docker configuration to setup insecure registries, as we use http harbor (due to the lifecycle image not supporting self-signed certificates)
  • The current lima-kilo setup deploys harbor in a non-standard port (8080), and the builds-builder and builds-api charts don't have a way to specify it (as it has not been needed yet)
  • The current lima-kilo setup is using vagrant to start a VM, where everything else is installed

Event Timeline

dcaro changed the task status from Open to In Progress.Sep 27 2023, 8:41 AM
dcaro claimed this task.
dcaro moved this task from Next Up to In Progress on the Toolforge (Toolforge iteration 00) board.

some differences:

Lima-kilo install cluster inside the VM has containerd as engine:

vagrant@bullseye:~$ kubectl get nodes -o wide
NAME                      STATUS   ROLES                  AGE     VERSION    INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                         KERNEL-VERSION    CONTAINER-RUNTIME
toolforge-control-plane   Ready    control-plane,master   7h44m   v1.23.17   172.18.0.2    <none>        Debian GNU/Linux 11 (bullseye)   5.10.0-22-amd64   containerd://1.7.1

while on my local (where the setup works) has docker:

dcaro@urcuchillay$ kubectl get nodes -o wide
NAME       STATUS   ROLES                  AGE   VERSION    INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION           CONTAINER-RUNTIME
minikube   Ready    control-plane,master   53d   v1.23.17   192.168.49.2   <none>        Ubuntu 22.04.2 LTS   6.4.14-200.fc38.x86_64   docker://24.0.4

Production uses docker too:

root@tools-k8s-control-4:/srv/git/toolforge-deploy# kubectl get nodes -o wide
NAME                  STATUS   ROLES                  AGE      VERSION    INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                       KERNEL-VERSION          CONTAINER-RUNTIME
tools-k8s-control-4   Ready    control-plane,master   175d     v1.22.17   172.16.3.156   <none>        Debian GNU/Linux 10 (buster)   4.19.0-23-cloud-amd64   docker://23.0.1
tools-k8s-control-5   Ready    control-plane,master   175d     v1.22.17   172.16.1.176   <none>        Debian GNU/Linux 10 (buster)   4.19.0-23-cloud-amd64   docker://23.0.1
...

toolbeta too:

root@toolsbeta-test-k8s-control-5:~# kubectl get nodes -o wide
NAME                           STATUS   ROLES                  AGE      VERSION    INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                       KERNEL-VERSION          CONTAINER-RUNTIME
toolsbeta-test-k8s-control-4   Ready    control-plane,master   2y312d   v1.23.17   172.16.2.7     <none>        Debian GNU/Linux 10 (buster)   4.19.0-25-amd64         docker://23.0.3
toolsbeta-test-k8s-control-5   Ready    control-plane,master   2y307d   v1.23.17   172.16.0.79    <none>        Debian GNU/Linux 10 (buster)   4.19.0-25-amd64         docker://23.0.3

We might want to change eventually, but that's the current setup.

this are the logs from the nginx in front of harbor for a successful docker login:

172.19.0.1 - "GET /v2/ HTTP/1.1" 401 76 "-" "docker/24.0.6 go/go1.20.7 git-commit/1a79695 kernel/5.10.0-22-amd64 os/linux arch/amd64 UpstreamClient(Docker-Client/24.0.6 \x5C(linux\x5C))" 0.001 0.001 .
172.19.0.1 - "GET /service/token?account=robot%24tekton&client_id=docker&offline_token=true&service=harbor-registry HTTP/1.1" 200 897 "-" "docker/24.0.6 go/go1.20.7 git-commit/1a79695 kernel/5.10.0-22-amd64 os/linux arch/amd64 UpstreamClient(Docker-Client/24.0.6 \x5C(linux\x5C))" 0.019 0.019 .
172.19.0.1 - "GET /v2/ HTTP/1.1" 200 2 "-" "docker/24.0.6 go/go1.20.7 git-commit/1a79695 kernel/5.10.0-22-amd64 os/linux arch/amd64 UpstreamClient(Docker-Client/24.0.6 \x5C(linux\x5C))" 0.003 0.004 .

This are the logs for the failed build run:

172.19.0.1 - "POST /api/v2.0/projects HTTP/1.1" 409 91 "-" "WMCS toolforge-builds-api Go-http-client" 0.005 0.005 .  -> this is the builds-api trying to create the project, and getting conflict as it already exists
172.19.0.1 - "GET /v2/ HTTP/1.1" 401 76 "-" "Go-http-client/1.1" 0.000 0.001 . -> this is the login attempt, but no more logs from here

This is the logs when a curl to http port using https happens:

172.19.0.1 - "\x16\x03\x01\x02\x00\x01\x00\x01\xFC\x03\x033%\x5C\x1E\xF0\xE0\xED\xF5\xE9\xC5\xF5\xAC@\x03\xC3\x15\xDE\x98\xFE,I\xF2\xAE\xA2\xA46\x5C\xDE\xDE\xEEq_ \x9A\xA8\x1E0+\xA4\x85DI\xE8wa1X\xED\xDC\xCE\x22\xAC\x5C\x8Bwd&\x18@\x84\x06\x19\xC3\xAC\x91\x00>\x13\x02\x13\x03\x13\x01\xC0,\xC00\x00\x9F\xCC\xA9\xCC\xA8\xCC\xAA\xC0+\xC0/\x00\x9E\xC0$\xC0(\x00k\xC0#\xC0'\x00g\xC0" 400 150 "-" "-" 0.001 - .

This is with a modified setup, where I have harbor running on port 80:

vagrant@bullseye:~$ kubectl get secret -n image-build basic-user-pass -o yaml
apiVersion: v1
data:
  password: RHVtbXlyMGJvdHBhc3M=
  username: cm9ib3QkdGVrdG9u
kind: Secret
metadata:
  annotations:
    meta.helm.sh/release-name: builds-builder
    meta.helm.sh/release-namespace: builds-builder
    tekton.dev/docker-0: http://172.19.0.1
  creationTimestamp: "2023-09-26T09:08:40Z"
  labels:
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/part-of: toolforge-build-service
  name: basic-user-pass
  namespace: image-build
  resourceVersion: "1127"
  uid: 5b2aa024-26a4-4f54-8651-1af3f060e13e
type: kubernetes.io/basic-auth
vagrant@bullseye:~$ kubectl get secret -n image-build dockerconfig -o yaml
apiVersion: v1
data:
  .dockerconfigjson: eyJpbnNlY3VyZS1yZWdpc3RyaWVzIjogWyJodHRwOi8vMTcyLjE5LjAuMSJdfQ==
kind: Secret
metadata:
  annotations:
    meta.helm.sh/release-name: builds-builder
    meta.helm.sh/release-namespace: builds-builder
  creationTimestamp: "2023-09-26T09:08:40Z"
  labels:
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/part-of: toolforge-build-service
  name: dockerconfig
  namespace: image-build
  resourceVersion: "1125"
  uid: 98f20edf-49fe-4afb-9d29-83149087ecda
type: kubernetes.io/dockerconfigjson
vagrant@bullseye:~$ kubectl get secret -n image-build basic-user-pass -o yaml
apiVersion: v1
data:
  password: RHVtbXlyMGJvdHBhc3M=
  username: cm9ib3QkdGVrdG9u
kind: Secret
metadata:
  annotations:
    meta.helm.sh/release-name: builds-builder
    meta.helm.sh/release-namespace: builds-builder
    tekton.dev/docker-0: http://172.19.0.1
  creationTimestamp: "2023-09-26T09:08:40Z"
  labels:
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/part-of: toolforge-build-service
  name: basic-user-pass
  namespace: image-build
  resourceVersion: "1127"
  uid: 5b2aa024-26a4-4f54-8651-1af3f060e13e
type: kubernetes.io/basic-auth
vagrant@bullseye:~$ kubectl get secret -n image-build dockerconfig -o yaml
apiVersion: v1
data:
  .dockerconfigjson: eyJpbnNlY3VyZS1yZWdpc3RyaWVzIjogWyJodHRwOi8vMTcyLjE5LjAuMSJdfQ==
kind: Secret
metadata:
  annotations:
    meta.helm.sh/release-name: builds-builder
    meta.helm.sh/release-namespace: builds-builder
  creationTimestamp: "2023-09-26T09:08:40Z"
  labels:
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/part-of: toolforge-build-service
  name: dockerconfig
  namespace: image-build
  resourceVersion: "1125"
  uid: 98f20edf-49fe-4afb-9d29-83149087ecda
type: kubernetes.io/dockerconfigjson
vagrant@bullseye:~$ kubectl get deployment -n builds-api -o yaml
apiVersion: v1
items:
- apiVersion: apps/v1
  kind: Deployment
  metadata:
    annotations:
      deployment.kubernetes.io/revision: "1"
      meta.helm.sh/release-name: builds-api
      meta.helm.sh/release-namespace: builds-api
      secret.reloader.stakater.com/reload: builds-api-certs,builds-api-passwords
    creationTimestamp: "2023-09-26T09:08:42Z"
    generation: 1
    labels:
      app.kubernetes.io/managed-by: Helm
      name: builds-api
    name: builds-api
    namespace: builds-api
    resourceVersion: "1762"
    uid: e385ee97-6cd4-4728-af31-bf9c751a31a1
  spec:
    progressDeadlineSeconds: 600
    replicas: 1
    revisionHistoryLimit: 10
    selector:
      matchLabels:
        name: builds-api
    strategy:
      rollingUpdate:
        maxSurge: 25%
        maxUnavailable: 25%
      type: RollingUpdate
    template:
      metadata:
        creationTimestamp: null
        labels:
          name: builds-api
        name: builds-api
      spec:
        containers:
        - env:
          - name: DEBUG
            value: "true"
          - name: PORT
            value: "8000"
          - name: HARBOR_REPOSITORY
            value: http://172.19.0.1
          - name: HARBOR_USERNAME
            value: admin
          - name: BUILDER
            value: toolsbeta-harbor.wmcloud.org/toolforge/heroku-builder-classic:22
          - name: OK_BUILDS_TO_KEEP
            value: "1"
          - name: FAILED_BUILDS_TO_KEEP
            value: "2"
          envFrom:
          - secretRef:
              name: builds-api-passwords
          image: toolsbeta-harbor.wmcloud.org/toolforge/builds-api:image-0.0.96-20230914095229-353f3d3b
          imagePullPolicy: IfNotPresent
          name: api
          resources:
            limits:
              cpu: 500m
              memory: 100Mi
            requests:
              cpu: 500m
              memory: 100Mi
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            runAsGroup: 101
            runAsUser: 101
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /etc/api/certs
            name: toolforge-builds-api-certs
            readOnly: true
        - image: docker-registry.tools.wmflabs.org/nginx:latest
          imagePullPolicy: Always
          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /v1/healthz
              port: 9000
              scheme: HTTP
            initialDelaySeconds: 3
            periodSeconds: 3
            successThreshold: 1
            timeoutSeconds: 1
          name: nginx
          ports:
          - containerPort: 8443
            name: https
            protocol: TCP
          - containerPort: 9000
            name: metrics
            protocol: TCP
          resources: {}
          securityContext:
            allowPrivilegeEscalation: false
            runAsGroup: 101
            runAsUser: 101
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /etc/nginx/api-gateway-ssl
            name: api-gateway-server-cert
            readOnly: true
          - mountPath: /etc/nginx/nginx.conf
            name: nginx-config
            readOnly: true
            subPath: nginx.conf
        dnsPolicy: ClusterFirst
        restartPolicy: Always
        schedulerName: default-scheduler
        securityContext: {}
        serviceAccount: builds-api
        serviceAccountName: builds-api
        terminationGracePeriodSeconds: 30
        volumes:
        - name: toolforge-builds-api-certs
          secret:
            defaultMode: 420
            secretName: builds-api-certs
        - configMap:
            defaultMode: 420
            items:
            - key: nginx.conf
              path: nginx.conf
            name: nginx-config
          name: nginx-config
        - name: api-gateway-server-cert
          secret:
            defaultMode: 420
            secretName: builds-api-api-gateway-server
  status:
    availableReplicas: 1
    conditions:
    - lastTransitionTime: "2023-09-26T09:09:42Z"
      lastUpdateTime: "2023-09-26T09:09:42Z"
      message: Deployment has minimum availability.
      reason: MinimumReplicasAvailable
      status: "True"
      type: Available
    - lastTransitionTime: "2023-09-26T09:08:42Z"
      lastUpdateTime: "2023-09-26T09:09:42Z"
      message: ReplicaSet "builds-api-64976bdf4f" has successfully progressed.
      reason: NewReplicaSetAvailable
      status: "True"
      type: Progressing
    observedGeneration: 1
    readyReplicas: 1
    replicas: 1
    updatedReplicas: 1
kind: List
metadata:
  resourceVersion: ""

I added a copy of the analyze step to the tasks on my local, running the bash image and just a very long sleep script, and exec'd inside it and was able to strace the analyze command.

I was able to see where it's getting the auth data from:

bash-5.1# apk add curl
...
bash-5.1# apk add strace
...
bash-5.1# curl -L https://github.com/buildpacks/lifecycle/releases/download/v0.16.0/lifecycle-v0.16.0+linux.x86-64.tgz > lifecycle.tgz
bash-5.1# tar xvzf lifecycle.tgz 
...
bash-5.1# strace -f lifecycle/analyzer -run-image=heroku/heroku:22-cnb -layers=/layers  -stack=/layers/stack.toml  -cache-image= -uid=1000    -gid=1000 192.168.1.101/tool-minikube-user/tool-minikube-user:latest
...
[pid   217] newfstatat(AT_FDCWD, "/tekton/home/.docker/config.json", {st_mode=S_IFREG|0600, st_size=157, ...}, 0) = 0
[pid   217] openat(AT_FDCWD, "/tekton/home/.docker/config.json", O_RDONLY|O_CLOEXEC) = 3

...
bash-5.1# cat /tekton/home/.docker/config.json
{"auths":{"http://192.168.1.101":{"username":"robot$tekton","password":"Dummyr0botpass","auth":"cm9ib3QkdGVrdG9uOkR1bW15cjBib3RwYXNz","email":"not@val.id"}}}

Note that at that point it does not actually have the insecure-registries configuration there, just the user-pass (tekton puts it from the basic-user-pass secret).

I'll do the same on the vagrant VM, see what happens

Hmm. it also opens that same file, but still fails:

117   openat(AT_FDCWD, "/tekton/home/.docker/config.json", O_RDONLY|O_CLOEXEC <unfinished ...>
117   <... openat resumed>)             = 3

bash-5.1# cat /tekton/home/.docker/config.json
{"auths":{"http://172.19.0.1":{"username":"robot$tekton","password":"Dummyr0botpass","auth":"cm9ib3QkdGVrdG9uOkR1bW15cjBib3RwYXNz","email":"not@val.id"}}}

this is the full unauthorized response, I'm suspicious of the bearer realm there, maybe the hostname is not set correctly on harbor side?

11:49:34.283373 IP (tos 0x0, ttl 63, id 50919, offset 0, flags [DF], proto TCP (6), length 558)
    172.19.0.1.80 > 10.244.0.21.60024: Flags [P.], cksum 0xb93d (incorrect -> 0xf06d), seq 1:507, ack 95, win 509, options [nop,nop,TS val 607153706 ecr 4126740171], length 506: HTTP, length: 506
        HTTP/1.1 401 Unauthorized
        Server: nginx
        Date: Wed, 27 Sep 2023 11:49:34 GMT
        Content-Type: application/json; charset=utf-8
        Content-Length: 76
        Connection: keep-alive
        Docker-Distribution-Api-Version: registry/2.0
        Set-Cookie: sid=8a94b8e3add87ecac66655f0357abdc6; Path=/; HttpOnly
        Www-Authenticate: Bearer realm="http://localhost/service/token",service="harbor-registry"
        X-Request-Id: 4e37fa57-608a-41ad-8cd4-1905556b5d40

        {"errors":[{"code":"UNAUTHORIZED","message":"unauthorized: unauthorized"}]}

tcpdump from inside the container seems to show that it tries to connect to localhost port 80 and failing 🚫

11:49:34.283376 IP (tos 0x0, ttl 64, id 53296, offset 0, flags [DF], proto TCP (6), length 52)
    10.244.0.21.60024 > 172.19.0.1.80: Flags [.], cksum 0xb743 (incorrect -> 0x3ebb), seq 95, ack 507, win 501, options [nop,nop,TS val 4126740172 ecr 607153706], length 0
11:49:34.284402 IP6 (flowlabel 0xcd77e, hlim 64, next-header TCP (6) payload length: 40) ::1.53474 > ::1.80: Flags [S], cksum 0x0030 (incorrect -> 0xbae9), seq 1954204323, win 65476, options [mss 65476,sackOK,TS val 4109340158 ecr 0,nop,wscale 7], length 0
11:49:34.284406 IP6 (flowlabel 0x28b33, hlim 64, next-header TCP (6) payload length: 20) ::1.80 > ::1.53474: Flags [R.], cksum 0x001c (incorrect -> 0x9f7d), seq 0, ack 1954204324, win 0, length 0
11:49:34.284719 IP (tos 0x0, ttl 64, id 29220, offset 0, flags [DF], proto TCP (6), length 60)
    127.0.0.1.43632 > 127.0.0.1.80: Flags [S], cksum 0xfe30 (incorrect -> 0xf6f8), seq 4095550085, win 65495, options [mss 65495,sackOK,TS val 1710305716 ecr 0,nop,wscale 7], length 0
11:49:34.284721 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40)
    127.0.0.1.80 > 127.0.0.1.43632: Flags [R.], cksum 0xf069 (correct), seq 0, ack 4095550086, win 0, length 0

yep, that might be it;

vagrant@bullseye:~$ grep hostname .toolforge-lima-kilo/harbor/harbor/harbor.yml
hostname: localhost

we should really reuse the scripts that we already have in the builds-builder repo instead of trying to redo with ansible tasks...

I suspect that the dockerconfig might not be needed

thanks for working on this @dcaro, really there are too many moving parts involved. I was wondering, if we were to switch from using kind in lima-kilo to using minikube, can that solve many of our problems? taking a glace at the things you discussed here, it seems like the way minikube does things is closest to our production setup compared to kind.

thanks for working on this @dcaro, really there are too many moving parts involved. I was wondering, if we were to switch from using kind in lima-kilo to using minikube, can that solve many of our problems? taking a glace at the things you discussed here, it seems like the way minikube does things is closest to our production setup compared to kind.

I'd say not specially, there's not much more difference, the issues here were more related to ansible and misconfiguration of bulids-* charts than kind/minikube stuff.

The only thing I have in favor of kind is that it's already working on lima-kilo, and adding minikube would take some time to change stuff. But if you have any specifics for one or the other please share (or if you want to tackle the changes on lima-kilo directly).

dcaro moved this task from In Review to Done on the Toolforge (Toolforge iteration 00) board.