- Upgrade base image to node 22 (To be done in: T393437: Provide nodejs22 base images for production)
- Make sure that tests are passing with node 22
- Deploy in staging
- Deploy in production
Description
Details
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | KartikMistry | T394982 Migrate cxserver in production to node22 | |||
| Resolved | Nikerabbit | T404291 Allow proxy server to accept another valid http header instead of 'HOST' |
Event Timeline
Change #1183213 had a related patch set uploaded (by KartikMistry; author: KartikMistry):
[mediawiki/services/cxserver@master] WIP: Migrate cxserver to nodejs22
Change #1183213 merged by jenkins-bot:
[mediawiki/services/cxserver@master] Migrate cxserver to nodejs22
Change #1183761 had a related patch set uploaded (by KartikMistry; author: KartikMistry):
[operations/deployment-charts@master] cxserver: staging: Update to 2025-09-02-045916-production
Change #1183761 merged by jenkins-bot:
[operations/deployment-charts@master] cxserver: staging: Update to 2025-09-02-045916-production
Looks like we've issues with staging APIs after deployment:
$ curl -vk https://staging.svc.eqiad.wmnet:4002/v2/page/en/gu/Hello_Kitty * Uses proxy env variable no_proxy == 'wikipedia.org,wikimedia.org,wikibooks.org,wikinews.org,wikiquote.org,wikisource.org,wikiversity.org,wikivoyage.org,wikidata.org,wikiworkshop.org,wikifunctions.org,wiktionary.org,mediawiki.org,wmfusercontent.org,w.wiki,wikimediacloud.org,wmnet,127.0.0.1,::1' * Trying 2620:0:861:102:10:64:16:55:4002... * Trying 10.64.16.55:4002... * Connected to staging.svc.eqiad.wmnet (10.64.16.55) port 4002 (#0) * ALPN, offering h2 * ALPN, offering http/1.1 * successfully set certificate verify locations: * CAfile: /etc/ssl/certs/ca-certificates.crt * CApath: /etc/ssl/certs * TLSv1.3 (OUT), TLS handshake, Client hello (1): * TLSv1.3 (IN), TLS handshake, Server hello (2): * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8): * TLSv1.3 (IN), TLS handshake, Certificate (11): * TLSv1.3 (IN), TLS handshake, CERT verify (15): * TLSv1.3 (IN), TLS handshake, Finished (20): * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1): * TLSv1.3 (OUT), TLS handshake, Finished (20): * SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384 * ALPN, server did not agree to a protocol * Server certificate: * subject: CN=cxserver-staging-tls-proxy-certs * start date: Sep 2 10:41:00 2025 GMT * expire date: Sep 3 10:41:00 2025 GMT * issuer: C=US; L=San Francisco; O=Wikimedia Foundation, Inc; OU=SRE Foundations; CN=discovery * SSL certificate verify ok. > GET /v2/page/en/gu/Hello_Kitty HTTP/1.1 > Host: staging.svc.eqiad.wmnet:4002 > User-Agent: curl/7.74.0 > Accept: */* > * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): * old SSL session ID is stale, removing * Mark bundle as not supporting multiuse < HTTP/1.1 404 Not Found < access-control-allow-origin: * < access-control-allow-headers: accept, authorization, x-requested-with, content-type, x-wikimedia-debug < access-control-expose-headers: etag < x-xss-protection: 1; mode=block < x-content-type-options: nosniff < x-frame-options: SAMEORIGIN < content-security-policy: default-src 'self'; object-src 'none'; media-src *; img-src *; style-src *; frame-ancestors 'self' < x-content-security-policy: default-src 'self'; object-src 'none'; media-src *; img-src *; style-src *; frame-ancestors 'self' < x-webkit-csp: default-src 'self'; object-src 'none'; media-src *; img-src *; style-src *; frame-ancestors 'self' < date: Tue, 02 Sep 2025 11:50:32 GMT < x-envoy-upstream-service-time: 2 < server: staging-tls < transfer-encoding: chunked < * Connection #0 to host staging.svc.eqiad.wmnet left intact 404: MW API error from URL: http://localhost:6500/w/api.php?format=json&action=query&meta=siteinfo&siprop=general%7Cnamespaces%7Cnamespacealiases%7Cspecialpagealiases&formatversion=2: api_error
$ curl -vk https://staging.svc.eqiad.wmnet:4002/v2/suggest/sections/Gujarat/en/gu * Uses proxy env variable no_proxy == 'wikipedia.org,wikimedia.org,wikibooks.org,wikinews.org,wikiquote.org,wikisource.org,wikiversity.org,wikivoyage.org,wikidata.org,wikiworkshop.org,wikifunctions.org,wiktionary.org,mediawiki.org,wmfusercontent.org,w.wiki,wikimediacloud.org,wmnet,127.0.0.1,::1' * Trying 2620:0:861:102:10:64:16:55:4002... * Trying 10.64.16.55:4002... * Connected to staging.svc.eqiad.wmnet (10.64.16.55) port 4002 (#0) * ALPN, offering h2 * ALPN, offering http/1.1 * successfully set certificate verify locations: * CAfile: /etc/ssl/certs/ca-certificates.crt * CApath: /etc/ssl/certs * TLSv1.3 (OUT), TLS handshake, Client hello (1): * TLSv1.3 (IN), TLS handshake, Server hello (2): * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8): * TLSv1.3 (IN), TLS handshake, Certificate (11): * TLSv1.3 (IN), TLS handshake, CERT verify (15): * TLSv1.3 (IN), TLS handshake, Finished (20): * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1): * TLSv1.3 (OUT), TLS handshake, Finished (20): * SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384 * ALPN, server did not agree to a protocol * Server certificate: * subject: CN=cxserver-staging-tls-proxy-certs * start date: Sep 2 10:41:00 2025 GMT * expire date: Sep 3 10:41:00 2025 GMT * issuer: C=US; L=San Francisco; O=Wikimedia Foundation, Inc; OU=SRE Foundations; CN=discovery * SSL certificate verify ok. > GET /v2/suggest/sections/Gujarat/en/gu HTTP/1.1 > Host: staging.svc.eqiad.wmnet:4002 > User-Agent: curl/7.74.0 > Accept: */* > * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): * old SSL session ID is stale, removing * Mark bundle as not supporting multiuse < HTTP/1.1 504 Gateway Timeout < content-length: 24 < content-type: text/plain < date: Tue, 02 Sep 2025 11:55:21 GMT < server: staging-tls < * Connection #0 to host staging.svc.eqiad.wmnet left intact upstream request timeout
Full error for page API on staging:
{"@timestamp":"2025-09-02T11:52:21.666Z","ecs.version":"8.10.0","error":{"detail":"MW API error from URL: http://localhost:6500/w/api.php?format=json&action=query&meta=siteinfo&siprop=general%7Cnamespaces%7Cnamespacealiases%7Cspecialpagealiases&formatversion=2","message":"404: MW API error from URL: http://localhost:6500/w/api.php?format=json&action=query&meta=siteinfo&siprop=general%7Cnamespaces%7Cnamespacealiases%7Cspecialpagealiases&formatversion=2: api_error","name":"HTTPError","status":404,"type":"api_error"},"http":{"request":{"id":"3f46b84c-354d-4d0f-b040-df0b91ca8a4f","method":"GET"}},"log.level":"error","message":"Unhandled Promise Rejection","service":"cxserver","stack":"HTTPError: 404: MW API error from URL: http://localhost:6500/w/api.php?format=json&action=query&meta=siteinfo&siprop=general%7Cnamespaces%7Cnamespacealiases%7Cspecialpagealiases&formatversion=2: api_error\n at MWApiRequest.mwGet (file:///srv/service/lib/mw/MwApiRequest.js:118:10)\n at process.processTicksAndRejections (node:internal/process/task_queues:105:5)\n at async MWPageLoader.fetch (file:///srv/service/lib/mw/MWPageLoader.js:86:20)","url":{"path":"/v2/suggest/sections/Gujarat/en/gu"}}Node 22 stabilizes the fetch API. It is now feature compatible with browsers fetch API. This is generally good, but it also adds more restrictions to what a valid http request can be. The header field we are setting to pass the wikipedia domain to that wiki proxy is HOST (see the configuration). This is problematic because HOST is a forbidden header.
- https://developer.mozilla.org/en-US/docs/Glossary/Forbidden_request_header
- https://fetch.spec.whatwg.org/#forbidden-request-header
So, the nodes fetch API wont accept HOST header. The wiki proxy will recieve the request without the HOST header and will end up 404 response.
How did I debug?
I asked claude to write a simple wiki proxy that listens at port 6500, accepts URLs like http://localhost:6500/w/api.php?params and replace the domain with the value of "HOST' header from request.
Then I pointed cxserver to use that proxy. I observed HOST header is always localhost:6500 obeying the fetch spec. Whatever HOST we set at cxserver is not accepted.Then I changed the cxserver header to WHOST for testing. This header arrived correctly at proxy and everything works as expected.
What should be the fix?
We need to report this issue, fix it in the code of the proxy server to accept another valid http header instead of 'HOST'. then cxserver can pass it.
Change #1190253 had a related patch set uploaded (by Santhosh; author: Santhosh):
[mediawiki/services/cxserver@master] Use undic request when HOST header needed
It's arguably more complicated. node.js implements the fetch() API by bundling in undici, mostly for convenience of developers. Undici is very specific that it does not implement forbidden headers. I am quoting from there
The Fetch Standard requires implementations to exclude certain headers from requests and responses. In browser environments, some headers are forbidden so the user agent remains in full control over them. In Undici, these constraints are removed to give more control to the user.
Which effectively means that we should NOT be seeing any such issues.
However, the Host header appears to be an exception in the above and is actually not honored when passed by the client, however IT IS NOT because of the forbidden headers of the fetch API. The offending commit is https://github.com/nodejs/undici/commit/470ee38145c5e6b367874b8b67f45143b67557c0, which was PRed in https://github.com/nodejs/undici/pull/2322 which fixes 2318. The author is explicitly clear at https://github.com/nodejs/undici/issues/2318#issuecomment-1753483582 that this has nothing to do with Fetch API forbidden headers, but rather RFC9110, the latest RFC to define HTTP semantics.
As a personal note, I have to say that I find this specific interpretation of that section a bit aggressive.
We need to report this issue
It has been reported already. In the PR linked above the first report is here https://github.com/nodejs/undici/pull/2322#issuecomment-1774692141 with a couple more offering points of view and feeback. A few days later, a new issue was filed at https://github.com/nodejs/node/issues/50305, and the same points of view were re-iterated. Matteo Collina of the Node.js Technical Steering Committee is pretty explicit in saying that it's better for such use cases to not rely on fetch() but rather use undici.request(). I quote as well for posterity's sake
Or are we back to using third-party HTTP clients for this? TL;DR, you'll have a much better production experience anyway. fetch() is significantly slower than undici.request() and almost on par with http.request(). Your use case seem specific enough that a custom client for Node.js is recommended.
This answer has been, perhaps too liberally, generalized a couple of comments below https://github.com/nodejs/node/issues/50305#issuecomment-1804657902. I quote again
Following nodejs/undici#2369 (comment), using fetch is not advocated for backend development. Undici.request is the advised approach.
This DOES NOT come from a member of the node.js team, just 1 reporter of the issue. However, I should also note that it has not been refuted either in close to 2 years. Not only that, but following posts by others justify the choice.
fix it in the code of the proxy server to accept another valid http header instead of 'HOST'. then cxserver can pass it.
I think that's the wrong solution. The proper one is to not rely, in a complex backend environment, on an API that is clearly designed for browsers, unless the use case is very similar, or, at least, very simple. This is clearly none of those 2 escape hatches. I would suggest importing undici and rely instead on undici.request as per Matteo's suggestion. It should be faster and flexible enough to allow this.
Change #1190253 merged by jenkins-bot:
[mediawiki/services/cxserver@master] Use undici request instead of node fetch
Change #1191231 had a related patch set uploaded (by KartikMistry; author: KartikMistry):
[operations/deployment-charts@master] cxserver: staging: Update to 2025-09-25-051716-production
Change #1191231 merged by jenkins-bot:
[operations/deployment-charts@master] cxserver: staging: Update to 2025-09-25-051716-production
Mentioned in SAL (#wikimedia-operations) [2025-09-25T05:30:54Z] <kart_> staging: Updated cxserver to 2025-09-25-051716-production (T394982)
Change #1191249 had a related patch set uploaded (by KartikMistry; author: KartikMistry):
[operations/deployment-charts@master] cxserver: staging: Update to 2025-09-25-074241-production
Change #1191249 merged by jenkins-bot:
[operations/deployment-charts@master] cxserver: staging: Update to 2025-09-25-074241-production
Mentioned in SAL (#wikimedia-operations) [2025-09-25T07:58:02Z] <kart_> staging: Updated cxserver to 2025-09-25-074241-production (T394982)
Change #1191364 had a related patch set uploaded (by KartikMistry; author: KartikMistry):
[operations/deployment-charts@master] Update cxserver to 2025-09-25-074241-production
Change #1191364 merged by jenkins-bot:
[operations/deployment-charts@master] Update cxserver to 2025-09-25-074241-production
During testing, I found that the page API is broken.
ie https://cxserver.wmflabs.org/v2/page/en/gu/Tokyo fails with Internal Server Error
Staging logs look like this:
RangeError: Maximum call stack size exceeded
at Function.assign (<anonymous>)
at DerivedLogger.value (/srv/service/node_modules/winston/lib/winston/logger.js:50:36)
at DerivedLogger.value (/srv/service/node_modules/winston/lib/winston/logger.js:67:18)
at DerivedLogger.value (/srv/service/node_modules/winston/lib/winston/logger.js:67:18)
at DerivedLogger.value (/srv/service/node_modules/winston/lib/winston/logger.js:67:18)
at DerivedLogger.value (/srv/service/node_modules/winston/lib/winston/logger.js:67:18)
at DerivedLogger.value (/srv/service/node_modules/winston/lib/winston/logger.js:67:18)
at DerivedLogger.value (/srv/service/node_modules/winston/lib/winston/logger.js:67:18)
at DerivedLogger.value (/srv/service/node_modules/winston/lib/winston/logger.js:67:18)
at DerivedLogger.value (/srv/service/node_modules/winston/lib/winston/logger.js:67:18)
at DerivedLogger.value (/srv/service/node_modules/winston/lib/winston/logger.js:67:18)
at DerivedLogger.value (/srv/service/node_modules/winston/lib/winston/logger.js:67:18)
at DerivedLogger.value (/srv/service/node_modules/winston/lib/winston/logger.js:67:18)
at DerivedLogger.value (/srv/service/node_modules/winston/lib/winston/logger.js:67:18)
at DerivedLogger.value (/srv/service/node_modules/winston/lib/winston/logger.js:67:18)
at DerivedLogger.value (/srv/service/node_modules/winston/lib/winston/logger.js:67:18)I cannot reproduce locally using master branch of cxserver.
In the logs I would expect to see something like below. But because we don't, it fails while outputting this line or before it:
{"@timestamp":"2025-10-02T12:56:26.511Z","ecs.version":"8.10.0","http":{"request":{"id":"ea22f5a8-2001-4f62-a372-73dd95389edd","method":"GET"}},"log.level":"debug","message":"Getting page en:Albert Einstein for fr","service":"cxserver-dev","url":{"path":"/v2/page/en/fr/Albert%20Einstein"}}
After npm update I see that winston module is updated and cxserver on master seems working fine.
Change #1193821 had a related patch set uploaded (by KartikMistry; author: KartikMistry):
[operations/deployment-charts@master] cxserver: staging: Update to 2025-10-06-084053-production
Change #1193821 merged by jenkins-bot:
[operations/deployment-charts@master] cxserver: staging: Update to 2025-10-06-084053-production
Change #1193989 had a related patch set uploaded (by KartikMistry; author: KartikMistry):
[operations/deployment-charts@master] cxserver: Update to 2025-10-06-084053-production
Change #1193989 merged by jenkins-bot:
[operations/deployment-charts@master] cxserver: Update to 2025-10-06-084053-production
Mentioned in SAL (#wikimedia-operations) [2025-10-07T06:44:04Z] <kart_> Updated cxserver to 2025-10-06-084053-production (T394982, T403574)