Page MenuHomePhabricator

Replace ingress-nginx before upstream EOL date
Open, In Progress, HighPublic

Description

The ingress-nginx project that we heavily rely on for handling Toolforge traffic is now in maintenance mode: https://github.com/kubernetes/ingress-nginx/issues/13002

We have several options:

  1. Do nothing.
  2. Migrate to some other ingress provider.
  3. Once it's ready, migrate to InGate, which is a work-in-progress Ingress/Gateway API controller by the same people who maintain ingress-nginx
  4. Migrate to some other Gateway API implementation

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
kubernetes: Remove ingress-nginx specific alertsrepos/cloud/toolforge/alerts!61taavimain-I103ce32659250d1c56782534d4ca4de573821b09main
Undeploy ingress-nginxrepos/cloud/toolforge/toolforge-deploy!1222taavimain-I92a669407df35312cf6a62bccacbd23a5fc84b0fmain
kubernetes: Stop manipulating Ingress objectsrepos/cloud/toolforge/webservice-cli!100taavimain-Ieebd5be6d70c7f3ede4674e5f565d9c88fa865ccmain
istio-gateway: Expose metrics endpointrepos/cloud/toolforge/toolforge-deploy!1180taavimain-I239dad25031471ccbb3be1fd2e3b98f13121245fmain
Customize query in GitLab

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

https://www.kubernetes.dev/blog/2025/11/12/ingress-nginx-retirement/

To prioritize the safety and security of the ecosystem, Kubernetes SIG Network and the Security Response Committee are announcing the upcoming retirement of Ingress NGINX. Best-effort maintenance will continue until March 2026. Afterward, there will be no further releases, no bugfixes, and no updates to resolve any security vulnerabilities that may be discovered.

taavi renamed this task from toolforge: Investigate ingress-nginx replacements to Replace ingress-nginx before upstream EOL date.Jan 15 2026, 12:41 PM
taavi raised the priority of this task from Medium to High.

It looks like InGate is no option:

https://github.com/kubernetes-sigs/ingate/commit/7c0f10563c6ef12e97172b8ca40e1a2006f73a35

InGate is being retired (early 2026).
SIG Network and the Security Response Committee recommend that all users begin migration to Gateway API or another Ingress controller immediately. Many options are listed in the Kubernetes documentation: Gateway API and Ingress. Additional options may be available from vendors you work with.

dcaro changed the task status from Open to In Progress.Mar 3 2026, 3:25 PM
dcaro moved this task from Next Up to In Progress on the Toolforge (Toolforge iteration 25) board.

Mentioned in SAL (#wikimedia-cloud) [2026-03-18T12:00:34Z] <taavi> restarting existing web services to backfill HTTPRoute resources T392356

Change #1258948 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:toolforge: k8s: haproxy: Add option for sending traffic to Istio

https://gerrit.wikimedia.org/r/1258948

Change #1258948 merged by Majavah:

[operations/puppet@production] P:toolforge: k8s: haproxy: Add option for sending traffic to Istio

https://gerrit.wikimedia.org/r/1258948

Change #1258980 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:toolforge: k8s: haproxy: Use HTTP/1.1 for health checks

https://gerrit.wikimedia.org/r/1258980

Change #1258980 merged by Majavah:

[operations/puppet@production] P:toolforge: k8s: haproxy: Use HTTP/1.1 for health checks

https://gerrit.wikimedia.org/r/1258980

Change #1259000 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:toolforge: k8s: haproxy: Fix istio-gateway health checks

https://gerrit.wikimedia.org/r/1259000

Change #1259000 merged by Majavah:

[operations/puppet@production] P:toolforge: k8s: haproxy: Fix istio-gateway health checks

https://gerrit.wikimedia.org/r/1259000

Mentioned in SAL (#wikimedia-cloud) [2026-03-23T10:53:31Z] <taavi> send 5% of traffic to istio T392356

Mentioned in SAL (#wikimedia-cloud) [2026-03-23T11:16:07Z] <taavi> send 10% of traffic to istio T392356

Mentioned in SAL (#wikimedia-cloud) [2026-04-13T07:33:03Z] <taavi> bump istio traffic percentage 10% -> 25% T392356

Mentioned in SAL (#wikimedia-cloud) [2026-04-13T09:11:09Z] <taavi> bump istio traffic percentage 25% -> 50% T392356

Mentioned in SAL (#wikimedia-cloud) [2026-04-15T10:45:34Z] <taavi> bump istio traffic percentage 50% -> 75% T392356

Mentioned in SAL (#wikimedia-cloud) [2026-04-16T13:09:27Z] <taavi> bump istio traffic percentage 75% -> 100% T392356

Change #1272714 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:toolforge::k8s::haproxy: Remove old ingress nodes

https://gerrit.wikimedia.org/r/1272714

Change #1272714 merged by Majavah:

[operations/puppet@production] P:toolforge::k8s::haproxy: Remove old ingress nodes

https://gerrit.wikimedia.org/r/1272714

Change #1272714 merged by Majavah:

[operations/puppet@production] P:toolforge::k8s::haproxy: Remove old ingress nodes

https://gerrit.wikimedia.org/r/1272714

This unannounced breaking change has caught at least T423652: uploadmap reverse proxy broken because of platform changes and likely a few more Ingress objects without matching HTTPRoute objects unaware.

I checked cloud-announce and was unable to find any message mentioning HTTPRoute.

group_203_bot_3c0afd0d9fd9529f3b7bc7e69a4a3bce opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1224

maintain-kubeusers: bump to 0.0.194-20260421133553-217c468a

Mentioned in SAL (#wikimedia-cloud) [2026-04-21T14:06:10Z] <taavi> save backup of all ingress objects to ~taavi/ingresses-backup-2026-04-21.json T392356

Mentioned in SAL (#wikimedia-cloud) [2026-04-21T14:10:20Z] <taavi> delete all ingress objects from toolsbeta T392356

Mentioned in SAL (#wikimedia-cloud) [2026-04-23T08:08:18Z] <taavi> delete all ingress objects T392356

Change #1276596 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:toolforge::prometheus: Stop monitoring ingress-nginx

https://gerrit.wikimedia.org/r/1276596

Change #1276596 merged by Majavah:

[operations/puppet@production] P:toolforge::prometheus: Stop monitoring ingress-nginx

https://gerrit.wikimedia.org/r/1276596

Mentioned in SAL (#wikimedia-cloud) [2026-04-23T09:12:52Z] <taavi> uninstall ingress-nginx-gen2 from the cluster T392356

Change #1276631 had a related patch set uploaded (by Majavah; author: Majavah):

[cloud/wmcs-cookbooks@main] toolforge: Remove support for ingress workers

https://gerrit.wikimedia.org/r/1276631

Change #1276631 merged by jenkins-bot:

[cloud/wmcs-cookbooks@main] toolforge: Remove support for ingress workers

https://gerrit.wikimedia.org/r/1276631

Issue in T424029 with a Python webservice (lingua-libre) where OAuth callbacks were failing with upstream connect error or disconnect/reset before headers after the Istio migration. Turned out the X-Envoy-Peer-Metadata header injected by the sidecar was exceeding uwsgi's default buffer size. Adding buffer-size = 65535 to uwsgi.ini seems to resolve it. Leaving this here in case others hit the same thing.

Issue in T424029 with a Python webservice (lingua-libre) where OAuth callbacks were failing with upstream connect error or disconnect/reset before headers after the Istio migration. Turned out the X-Envoy-Peer-Metadata header injected by the sidecar was exceeding uwsgi's default buffer size. Adding buffer-size = 65535 to uwsgi.ini seems to resolve it. Leaving this here in case others hit the same thing.

Thanks. I don't think that header is very useful for most tools, so I'll see if we can disable it to avoid breaking more things.

Just to make sure: this issue only affects Toolforge, i.e., Cloud Services, right? Is ingress-nginx also used on the production Kubernetes cluster that serves production traffic?

@Arendpieter / FYI :