Page MenuHomePhabricator

tegola-vector-tiles doesnt execute new tile pregeneration jobs
Closed, ResolvedPublic

Description

Context

tegola-vector-tiles is using k8s cronjobs to execute tile pregeneration tasks on the background. Each job (pod) spawns the actual container that does the pregeneration and envoy to proxy connections to DB.
In the past we had an issue with envoy not exiting that got fixed after this patch:
https://phabricator.wikimedia.org/T283159#7419042

Current issue

We currently have the same issue with one job not exiting gracefully because of envoy that is affecting the pipeline by not spawning new tasks.
It looks like an one-off issue because we had other jobs executing just fine. From kubectl on codfw:

tegola-vector-tiles-main-pregeneration-1636040700-5zblz   0/2     Completed   0          3d22h
tegola-vector-tiles-main-pregeneration-1636040700-8jd24   0/2     Completed   0          3d22h
tegola-vector-tiles-main-pregeneration-1636040700-bchmf   0/2     Completed   0          3d22h
tegola-vector-tiles-main-pregeneration-1636040700-gpbj6   0/2     Completed   0          3d22h
tegola-vector-tiles-main-pregeneration-1636040700-ltc2t   0/2     Completed   0          3d22h
tegola-vector-tiles-main-pregeneration-1636040700-s9pkr   0/2     Completed   0          3d22h
tegola-vector-tiles-main-pregeneration-1636041000-89c9b   0/2     Completed   0          3d22h
tegola-vector-tiles-main-pregeneration-1636041000-8n7tl   0/2     Completed   0          3d22h
tegola-vector-tiles-main-pregeneration-1636041000-j48hl   0/2     Completed   0          3d22h
tegola-vector-tiles-main-pregeneration-1636041000-kjdk8   0/2     Completed   0          3d22h
tegola-vector-tiles-main-pregeneration-1636041000-mqgh4   0/2     Completed   0          3d22h
tegola-vector-tiles-main-pregeneration-1636041000-xrntv   0/2     Completed   0          3d22h
tegola-vector-tiles-main-pregeneration-1636041300-6mnzh   0/2     Completed   0          3d21h
tegola-vector-tiles-main-pregeneration-1636041300-7t47f   0/2     Completed   0          3d21h
tegola-vector-tiles-main-pregeneration-1636041300-jwzbr   0/2     Completed   0          3d21h
tegola-vector-tiles-main-pregeneration-1636041300-kx99v   0/2     Completed   0          3d21h
tegola-vector-tiles-main-pregeneration-1636041300-t8n2m   0/2     Completed   0          3d21h
tegola-vector-tiles-main-pregeneration-1636041300-vsjx6   0/2     Completed   0          3d21h
tegola-vector-tiles-main-pregeneration-1636041600-89mv9   1/2     Error       0          3d21h
tegola-vector-tiles-main-pregeneration-1636041600-9grl6   0/2     Completed   0          3d21h
tegola-vector-tiles-main-pregeneration-1636041600-fb6rg   0/2     Completed   0          3d21h
tegola-vector-tiles-main-pregeneration-1636041600-fdcct   0/2     Completed   0          3d21h
tegola-vector-tiles-main-pregeneration-1636041600-ptwbl   0/2     Completed   0          3d21h
tegola-vector-tiles-main-pregeneration-1636041600-q92j2   0/2     Completed   0          3d21h

From the failing job logs it looks its stuck because of envoy not working:

+ /srv/service/cmd/tegola/tegola --config /etc/tegola/config.toml cache seed tile-list /tmp/tegola-2Yy2FLlYsS/tilelist.txt
2021-11-04 16:00:29 [INFO] root.go:62: Loading config file: /etc/tegola/config.toml
2021-11-04 16:00:29 [INFO] config.go:306: loading local config (/etc/tegola/config.toml)
Error: could not register providers: Failed while creating connection pool: dial tcp [::1]:5432: connect: connection refused
could not register providers: Failed while creating connection pool: dial tcp [::1]:5432: connect: connection refused

exit_envoy
+ exit_envoy
+ echo 'Exit envoy pod'
Exit envoy pod
+ curl -X POST 127.0.0.1:1666/quitquitquit
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to 127.0.0.1 port 1666: Connection refused

This raises the following questions:

  • Can we fix this in the short term? (maybe by deleting the stuck job and wait for new ones to spawn)
  • How can we make the process more reliable in the long term?

Event Timeline

From looking at the envoy logs from that particular Pod I'd assume that envoy was not up/ready when tegola tried to connect.
You should be able to wait for it to be up by checking for HTTP 200 on 127.0.0.1:9361/healthz

From looking at the envoy logs from that particular Pod I'd assume that envoy was not up/ready when tegola tried to connect.
You should be able to wait for it to be up by checking for HTTP 200 on 127.0.0.1:9361/healthz

I'm not sure you are allowed to reach /healthz via that port, which is the public one and is restricted to reach /stats IIRC. They should probably use port 1666 (the admin pod-only port) instead.

I'm not sure you are allowed to reach /healthz via that port, which is the public one and is restricted to reach /stats IIRC. They should probably use port 1666 (the admin pod-only port) instead.

It's configured as health check for that container, though.

Is it something that can be fixed in the k8s level or the job script should be orchestrated to wait for some time for envoy to be ready and fail in case of timeout?

Is it something that can be fixed in the k8s level or the job script should be orchestrated to wait for some time for envoy to be ready and fail in case of timeout?

Sorry for not being very clear on this. From the kubernetes perspective there is no dependency between containers in one pod. So all we can do is verify if they are ready/healthy from the outside. Because of that you would need to put some piece of code in your job script to check for availability of envoy. You could ofc. also retry tcp/5432 a couple of times instead of doing the HTTP health check.

Yeah i was wondering if there is some sort of dependency between containers in a pod. Thanks will add a check on the script.

Meanwhile, can somebody help me force exit the hung pod? I don't think I have access. We only allow one job instance to run (to avoid overlaps) and k8s wont let new jobs to be spawned.

I think even manually POSTing on the running envoy /quitquitquit admin endpoint would do the trick.

Change 737479 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] tegola-vector-tiles: Wait for DB before pregeneration

https://gerrit.wikimedia.org/r/737479

Change 737481 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/software/tegola@wmf/v0.14.x] tile-pregeneration: Wait for envoy to get ready

https://gerrit.wikimedia.org/r/737481

Yeah i was wondering if there is some sort of dependency between containers in a pod. Thanks will add a check on the script.

This particular situation may still re-occur when the check times out before envoy get's ready. In that case, /quitquitquit won't reach the envoy API (as it is not up) and, if envoy comes up at some point, this will leave your job Pod hanging. You could set activeDeadlineSeconds for your Jobs to make them fail if they don't finish in time (see https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup) to work around this.

Meanwhile, can somebody help me force exit the hung pod? I don't think I have access. We only allow one job instance to run (to avoid overlaps) and k8s wont let new jobs to be spawned.

I think even manually POSTing on the running envoy /quitquitquit admin endpoint would do the trick.

I've deleted the hanging job Pods in codfw and eqiad, hope that helps.

Change 737665 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] tegola-vector-tiles: Configure pregeneration retries

https://gerrit.wikimedia.org/r/737665

Change 737481 merged by jenkins-bot:

[operations/software/tegola@wmf/v0.14.x] tile-pregeneration: Wait for envoy to get ready

https://gerrit.wikimedia.org/r/737481

Unfortunately deleting the pod didnt do the trick. From kubernetes events:

2m30s       Warning   FailedNeedsStart   cronjob/tegola-vector-tiles-main-pregeneration   Cannot determine if job needs to be started: too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew

We've had this before and as a workaround we deleted the cronjob resource and recreated it.

Change 737665 merged by jenkins-bot:

[operations/deployment-charts@master] tegola-vector-tiles: Configure pregeneration retries

https://gerrit.wikimedia.org/r/737665

JMeybohm claimed this task.

As said on IRC I've deleted the CronJob objects in all 3 clusters and @Jgiannelos re-deployed them with https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/737665/

I'll boldly close this. Feel free to reopen if this situation re-occurs!

Change 737479 abandoned by Jgiannelos:

[operations/deployment-charts@master] tegola-vector-tiles: Wait for DB before pregeneration

Reason:

https://gerrit.wikimedia.org/r/737479