Page MenuHomePhabricator

istio-envoy enforcing strict RFC 7231 compliance for 204 status responses
Open, Needs TriagePublicBUG REPORT

Description

https://gitlab.wikimedia.org/toolforge-repos/wikibugs2/-/jobs/812913

Executing "step_script" stage of the job script 00:01
$ curl --http1.1 --fail --silent --show-error --max-time 30 -X POST -H "Authorization: Bearer $WEB_BEARER_TOKEN" https://wikibugs.toolforge.org/api/config/refresh
curl: (22) The requested URL returned error: 502 Bad Gateway

Event Timeline

From inside the k8s cluster:

tools.wikibugs@tools-bastion-14:~$ webservice php8.4 shell -- curl --http1.1 --fail --show-error --max-time 30 -v -X POST -H "Authorization: Bearer <REDACTED>" http://wikibugs:8000/api/config/refresh
< HTTP/1.1 204
< content-type: application/json
< content-length: 3
< date: Sat, 02 May 2026 19:48:52 GMT
<
* Connection #0 to host wikibugs left intact

From the bastion:

tools.wikibugs@tools-bastion-14:~$ curl --http1.1 --fail --show-error --max-time 30 -v -X POST -H "Authorization: Bearer <REDACTED>" https://wikibugs.toolforge.org/api/config/refresh
* Host wikibugs.toolforge.org:443 was resolved.
* IPv6: 2a02:ec80:a000:1::2bc
* IPv4: 172.16.18.169
*   Trying [2a02:ec80:a000:1::2bc]:443...
* ALPN: curl offers http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384 / X25519MLKEM768 / id-ecPublicKey
* ALPN: server accepted http/1.1
* Server certificate:
*  subject: CN=toolforge.org
*  start date: Apr  9 09:34:14 2026 GMT
*  expire date: Jul  8 09:34:13 2026 GMT
*  subjectAltName: host "wikibugs.toolforge.org" matched cert's "*.toolforge.org"
*  issuer: C=US; O=Let's Encrypt; CN=E8
*  SSL certificate verify ok.
*   Certificate level 0: Public key type EC/prime256v1 (256/128 Bits/secBits), signed using ecdsa-with-SHA384
*   Certificate level 1: Public key type EC/secp384r1 (384/192 Bits/secBits), signed using sha256WithRSAEncryption
*   Certificate level 2: Public key type RSA (4096/152 Bits/secBits), signed using sha256WithRSAEncryption
* Connected to wikibugs.toolforge.org (2a02:ec80:a000:1::2bc) port 443
* using HTTP/1.x
> POST /api/config/refresh HTTP/1.1
> Host: wikibugs.toolforge.org
> User-Agent: curl/8.14.1
> Accept: */*
> Authorization: Bearer <REDACTED>
>
* Request completely sent off
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
< HTTP/1.1 502 Bad Gateway
< content-length: 87
< content-type: text/plain
< date: Sat, 02 May 2026 19:50:49 GMT
< server: istio-envoy
< x-clacks-overhead: GNU Terry Pratchett
< strict-transport-security: max-age=31622400
< content-security-policy-report-only: default-src 'self' 'unsafe-eval' 'unsafe-inline' blob: data: filesystem: mediastream: *.toolforge.org wikibooks.org *.wikibooks.org wikidata.org *.wikidata.org wikimedia.org *.wikimedia.org wikinews.org *.wikinews.org wikipedia.org *.wikipedia.org wikiquote.org *.wikiquote.org wikisource.org *.wikisource.org wikiversity.org *.wikiversity.org wikivoyage.org *.wikivoyage.org wiktionary.org *.wiktionary.org *.wmcloud.org *.wmflabs.org wikimediafoundation.org mediawiki.org *.mediawiki.org wss://wikibugs.toolforge.org; report-uri https://csp-report.toolforge.org/collect;
< permissions-policy: browsing-topics=()
< report-to: {"group": "wm_nel", "max_age": 604800, "endpoints": [{"url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0"}]}
< nel: {"report_to": "wm_nel", "max_age": 604800, "failure_fraction": 0.05, "success_fraction": 0.0}
* The requested URL returned error: 502
<
* closing connection #0
curl: (22) The requested URL returned error: 502

@taavi Does it seem reasonably possible that istio is turning the 204 into a 502? The code and config here hasn't changed for a very long time. If I talk directly to the service from inside the k8s cluster I'm getting the expected 204 response. If I call it from the bastion or gitlab ci inside WMCS I get the 502 response.

You can get the <REDACTED> bit via toolforge envvars show WEB_BEARER_TOKEN inside the wikibugs tool if you want to poke things yourself.

Could this be the return {}, 204 in api_config_refresh()? Quart sends that as a 3-byte JSON body, which Envoy may reject since 204 must not have a body per RFC 7231. return '', 204 should fix it.

Could this be the return {}, 204 in api_config_refresh()? Quart sends that as a 3-byte JSON body, which Envoy may reject since 204 must not have a body per RFC 7231. return '', 204 should fix it.

So... yeah. This looks to be the problem. The istio-envoy front proxy is apparently enforcing RFC 7231 conformance by deciding to return a 502 response when the upstream service returns a 204 status and a 3 byte body ({}\n). I couldn't quickly figure out how to get the quart-schema response validation stuff to actually return an empty body, so I changed the response code from 204 to 200 and my tests in the wikibugs-testing tool now pass. This response validation feels like a thing that is going to cause lots of random problems. It also makes me wonder if T425172: Toolforge web egress with Istio/Envoy seems slow is related.

Both this and possibly T425172 look like fallout from T392356, since Envoy seems to enforce what ingress-nginx tolerated. For the empty 204, dropping @validate_response on the handler and returning Response(status=204) directly should bypass quart-schema's serialization, if you ever want to revisit the 200 workaround.

I agree that my webservice response of a 204 status with a 3-byte body is not strictly valid per RFC 7231. I also assert that everything involved is just fine with that except the new ingress service provided by istio-envoy. I believe that this istio-envoy behavior is unnecessary and likely to cause problems with other tools even if the validation is only occurring for 204 responses.

bd808 renamed this task from Post-merge webhook failing to istio-envoy enforcing strict RFC 7231 compliance for 204 status responses.May 14 2026, 3:58 PM
bd808 added a project: Toolforge.