Page MenuHomePhabricator

Migrate cxserver in production to node22
Closed, ResolvedPublic8 Estimated Story Points

Description

Event Timeline

KartikMistry changed the task status from Open to In Progress.Aug 30 2025, 6:24 AM
KartikMistry claimed this task.
KartikMistry triaged this task as Medium priority.
KartikMistry updated the task description. (Show Details)
KartikMistry moved this task from Backlog to In Progress on the LPL Essential (2025 Jul-Oct) board.

Change #1183213 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[mediawiki/services/cxserver@master] WIP: Migrate cxserver to nodejs22

https://gerrit.wikimedia.org/r/1183213

Nikerabbit set the point value for this task to 4.

Change #1183213 merged by jenkins-bot:

[mediawiki/services/cxserver@master] Migrate cxserver to nodejs22

https://gerrit.wikimedia.org/r/1183213

Change #1183761 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] cxserver: staging: Update to 2025-09-02-045916-production

https://gerrit.wikimedia.org/r/1183761

Change #1183761 merged by jenkins-bot:

[operations/deployment-charts@master] cxserver: staging: Update to 2025-09-02-045916-production

https://gerrit.wikimedia.org/r/1183761

Looks like we've issues with staging APIs after deployment:

$ curl -vk https://staging.svc.eqiad.wmnet:4002/v2/page/en/gu/Hello_Kitty
* Uses proxy env variable no_proxy == 'wikipedia.org,wikimedia.org,wikibooks.org,wikinews.org,wikiquote.org,wikisource.org,wikiversity.org,wikivoyage.org,wikidata.org,wikiworkshop.org,wikifunctions.org,wiktionary.org,mediawiki.org,wmfusercontent.org,w.wiki,wikimediacloud.org,wmnet,127.0.0.1,::1'
*   Trying 2620:0:861:102:10:64:16:55:4002...
*   Trying 10.64.16.55:4002...
* Connected to staging.svc.eqiad.wmnet (10.64.16.55) port 4002 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server did not agree to a protocol
* Server certificate:
*  subject: CN=cxserver-staging-tls-proxy-certs
*  start date: Sep  2 10:41:00 2025 GMT
*  expire date: Sep  3 10:41:00 2025 GMT
*  issuer: C=US; L=San Francisco; O=Wikimedia Foundation, Inc; OU=SRE Foundations; CN=discovery
*  SSL certificate verify ok.
> GET /v2/page/en/gu/Hello_Kitty HTTP/1.1
> Host: staging.svc.eqiad.wmnet:4002
> User-Agent: curl/7.74.0
> Accept: */*
> 
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
* Mark bundle as not supporting multiuse
< HTTP/1.1 404 Not Found
< access-control-allow-origin: *
< access-control-allow-headers: accept, authorization, x-requested-with, content-type, x-wikimedia-debug
< access-control-expose-headers: etag
< x-xss-protection: 1; mode=block
< x-content-type-options: nosniff
< x-frame-options: SAMEORIGIN
< content-security-policy: default-src 'self'; object-src 'none'; media-src *; img-src *; style-src *; frame-ancestors 'self'
< x-content-security-policy: default-src 'self'; object-src 'none'; media-src *; img-src *; style-src *; frame-ancestors 'self'
< x-webkit-csp: default-src 'self'; object-src 'none'; media-src *; img-src *; style-src *; frame-ancestors 'self'
< date: Tue, 02 Sep 2025 11:50:32 GMT
< x-envoy-upstream-service-time: 2
< server: staging-tls
< transfer-encoding: chunked
< 
* Connection #0 to host staging.svc.eqiad.wmnet left intact
404: MW API error from URL: http://localhost:6500/w/api.php?format=json&action=query&meta=siteinfo&siprop=general%7Cnamespaces%7Cnamespacealiases%7Cspecialpagealiases&formatversion=2: api_error
$ curl -vk https://staging.svc.eqiad.wmnet:4002/v2/suggest/sections/Gujarat/en/gu
* Uses proxy env variable no_proxy == 'wikipedia.org,wikimedia.org,wikibooks.org,wikinews.org,wikiquote.org,wikisource.org,wikiversity.org,wikivoyage.org,wikidata.org,wikiworkshop.org,wikifunctions.org,wiktionary.org,mediawiki.org,wmfusercontent.org,w.wiki,wikimediacloud.org,wmnet,127.0.0.1,::1'
*   Trying 2620:0:861:102:10:64:16:55:4002...
*   Trying 10.64.16.55:4002...
* Connected to staging.svc.eqiad.wmnet (10.64.16.55) port 4002 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server did not agree to a protocol
* Server certificate:
*  subject: CN=cxserver-staging-tls-proxy-certs
*  start date: Sep  2 10:41:00 2025 GMT
*  expire date: Sep  3 10:41:00 2025 GMT
*  issuer: C=US; L=San Francisco; O=Wikimedia Foundation, Inc; OU=SRE Foundations; CN=discovery
*  SSL certificate verify ok.
> GET /v2/suggest/sections/Gujarat/en/gu HTTP/1.1
> Host: staging.svc.eqiad.wmnet:4002
> User-Agent: curl/7.74.0
> Accept: */*
>
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
* Mark bundle as not supporting multiuse
< HTTP/1.1 504 Gateway Timeout
< content-length: 24
< content-type: text/plain
< date: Tue, 02 Sep 2025 11:55:21 GMT
< server: staging-tls
< 
* Connection #0 to host staging.svc.eqiad.wmnet left intact
upstream request timeout

Full error for page API on staging:

{"@timestamp":"2025-09-02T11:52:21.666Z","ecs.version":"8.10.0","error":{"detail":"MW API error from URL: http://localhost:6500/w/api.php?format=json&action=query&meta=siteinfo&siprop=general%7Cnamespaces%7Cnamespacealiases%7Cspecialpagealiases&formatversion=2","message":"404: MW API error from URL: http://localhost:6500/w/api.php?format=json&action=query&meta=siteinfo&siprop=general%7Cnamespaces%7Cnamespacealiases%7Cspecialpagealiases&formatversion=2: api_error","name":"HTTPError","status":404,"type":"api_error"},"http":{"request":{"id":"3f46b84c-354d-4d0f-b040-df0b91ca8a4f","method":"GET"}},"log.level":"error","message":"Unhandled Promise Rejection","service":"cxserver","stack":"HTTPError: 404: MW API error from URL: http://localhost:6500/w/api.php?format=json&action=query&meta=siteinfo&siprop=general%7Cnamespaces%7Cnamespacealiases%7Cspecialpagealiases&formatversion=2: api_error\n    at MWApiRequest.mwGet (file:///srv/service/lib/mw/MwApiRequest.js:118:10)\n    at process.processTicksAndRejections (node:internal/process/task_queues:105:5)\n    at async MWPageLoader.fetch (file:///srv/service/lib/mw/MWPageLoader.js:86:20)","url":{"path":"/v2/suggest/sections/Gujarat/en/gu"}}

Node 22 stabilizes the fetch API. It is now feature compatible with browsers fetch API. This is generally good, but it also adds more restrictions to what a valid http request can be.  The header field we are setting to pass the wikipedia domain to that wiki proxy is HOST (see the configuration). This is problematic because HOST is a forbidden header.

  1. https://developer.mozilla.org/en-US/docs/Glossary/Forbidden_request_header
  2. https://fetch.spec.whatwg.org/#forbidden-request-header

So, the nodes fetch API wont accept HOST header. The wiki proxy will recieve the request without the HOST header and will end up 404 response.

How did I debug?

I asked claude to write a simple wiki proxy that listens at port 6500, accepts URLs like http://localhost:6500/w/api.php?params and replace the domain with the value of "HOST' header from request.
Then I pointed cxserver to use that proxy. I observed HOST header is always localhost:6500 obeying the fetch spec. Whatever  HOST we set at cxserver is not accepted.Then I changed the cxserver header to WHOST for testing. This header arrived correctly at proxy and everything works as expected. 

What should be the fix?

We need to report this issue, fix it in the code of the proxy server to accept another valid http header instead of 'HOST'. then cxserver can pass it.

Nikerabbit changed the task status from In Progress to Stalled.Sep 15 2025, 8:15 AM

Change #1190253 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/services/cxserver@master] Use undic request when HOST header needed

https://gerrit.wikimedia.org/r/1190253

Node 22 stabilizes the fetch API. It is now feature compatible with browsers fetch API. This is generally good, but it also adds more restrictions to what a valid http request can be.  The header field we are setting to pass the wikipedia domain to that wiki proxy is HOST (see the configuration). This is problematic because HOST is a forbidden header.

  1. https://developer.mozilla.org/en-US/docs/Glossary/Forbidden_request_header
  2. https://fetch.spec.whatwg.org/#forbidden-request-header

So, the nodes fetch API wont accept HOST header. The wiki proxy will recieve the request without the HOST header and will end up 404 response.

It's arguably more complicated. node.js implements the fetch() API by bundling in undici, mostly for convenience of developers. Undici is very specific that it does not implement forbidden headers. I am quoting from there

The Fetch Standard requires implementations to exclude certain headers from requests and responses. In browser environments, some headers are forbidden so the user agent remains in full control over them. In Undici, these constraints are removed to give more control to the user.

Which effectively means that we should NOT be seeing any such issues.

However, the Host header appears to be an exception in the above and is actually not honored when passed by the client, however IT IS NOT because of the forbidden headers of the fetch API. The offending commit is https://github.com/nodejs/undici/commit/470ee38145c5e6b367874b8b67f45143b67557c0, which was PRed in https://github.com/nodejs/undici/pull/2322 which fixes 2318. The author is explicitly clear at https://github.com/nodejs/undici/issues/2318#issuecomment-1753483582 that this has nothing to do with Fetch API forbidden headers, but rather RFC9110, the latest RFC to define HTTP semantics.

As a personal note, I have to say that I find this specific interpretation of that section a bit aggressive.

We need to report this issue

It has been reported already. In the PR linked above the first report is here https://github.com/nodejs/undici/pull/2322#issuecomment-1774692141 with a couple more offering points of view and feeback. A few days later, a new issue was filed at https://github.com/nodejs/node/issues/50305, and the same points of view were re-iterated. Matteo Collina of the Node.js Technical Steering Committee is pretty explicit in saying that it's better for such use cases to not rely on fetch() but rather use undici.request(). I quote as well for posterity's sake

Or are we back to using third-party HTTP clients for this?

TL;DR, you'll have a much better production experience anyway. fetch() is significantly slower than undici.request() and almost on par with http.request(). Your use case seem specific enough that a custom client for Node.js is recommended.

This answer has been, perhaps too liberally, generalized a couple of comments below https://github.com/nodejs/node/issues/50305#issuecomment-1804657902. I quote again

Following nodejs/undici#2369 (comment), using fetch is not advocated for backend development.

Undici.request is the advised approach.

This DOES NOT come from a member of the node.js team, just 1 reporter of the issue. However, I should also note that it has not been refuted either in close to 2 years. Not only that, but following posts by others justify the choice.

fix it in the code of the proxy server to accept another valid http header instead of 'HOST'. then cxserver can pass it.

I think that's the wrong solution. The proper one is to not rely, in a complex backend environment, on an API that is clearly designed for browsers, unless the use case is very similar, or, at least, very simple. This is clearly none of those 2 escape hatches. I would suggest importing undici and rely instead on undici.request as per Matteo's suggestion. It should be faster and flexible enough to allow this.

Change #1190253 merged by jenkins-bot:

[mediawiki/services/cxserver@master] Use undici request instead of node fetch

https://gerrit.wikimedia.org/r/1190253

Change #1191231 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] cxserver: staging: Update to 2025-09-25-051716-production

https://gerrit.wikimedia.org/r/1191231

Change #1191231 merged by jenkins-bot:

[operations/deployment-charts@master] cxserver: staging: Update to 2025-09-25-051716-production

https://gerrit.wikimedia.org/r/1191231

Mentioned in SAL (#wikimedia-operations) [2025-09-25T05:30:54Z] <kart_> staging: Updated cxserver to 2025-09-25-051716-production (T394982)

Change #1191249 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] cxserver: staging: Update to 2025-09-25-074241-production

https://gerrit.wikimedia.org/r/1191249

Change #1191249 merged by jenkins-bot:

[operations/deployment-charts@master] cxserver: staging: Update to 2025-09-25-074241-production

https://gerrit.wikimedia.org/r/1191249

Mentioned in SAL (#wikimedia-operations) [2025-09-25T07:58:02Z] <kart_> staging: Updated cxserver to 2025-09-25-074241-production (T394982)

Change #1191364 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update cxserver to 2025-09-25-074241-production

https://gerrit.wikimedia.org/r/1191364

Change #1191364 merged by jenkins-bot:

[operations/deployment-charts@master] Update cxserver to 2025-09-25-074241-production

https://gerrit.wikimedia.org/r/1191364

During testing, I found that the page API is broken.

ie https://cxserver.wmflabs.org/v2/page/en/gu/Tokyo fails with Internal Server Error

Staging logs look like this:

RangeError: Maximum call stack size exceeded
    at Function.assign (<anonymous>)
    at DerivedLogger.value (/srv/service/node_modules/winston/lib/winston/logger.js:50:36)
    at DerivedLogger.value (/srv/service/node_modules/winston/lib/winston/logger.js:67:18)
    at DerivedLogger.value (/srv/service/node_modules/winston/lib/winston/logger.js:67:18)
    at DerivedLogger.value (/srv/service/node_modules/winston/lib/winston/logger.js:67:18)
    at DerivedLogger.value (/srv/service/node_modules/winston/lib/winston/logger.js:67:18)
    at DerivedLogger.value (/srv/service/node_modules/winston/lib/winston/logger.js:67:18)
    at DerivedLogger.value (/srv/service/node_modules/winston/lib/winston/logger.js:67:18)
    at DerivedLogger.value (/srv/service/node_modules/winston/lib/winston/logger.js:67:18)
    at DerivedLogger.value (/srv/service/node_modules/winston/lib/winston/logger.js:67:18)
    at DerivedLogger.value (/srv/service/node_modules/winston/lib/winston/logger.js:67:18)
    at DerivedLogger.value (/srv/service/node_modules/winston/lib/winston/logger.js:67:18)
    at DerivedLogger.value (/srv/service/node_modules/winston/lib/winston/logger.js:67:18)
    at DerivedLogger.value (/srv/service/node_modules/winston/lib/winston/logger.js:67:18)
    at DerivedLogger.value (/srv/service/node_modules/winston/lib/winston/logger.js:67:18)
    at DerivedLogger.value (/srv/service/node_modules/winston/lib/winston/logger.js:67:18)

I cannot reproduce locally using master branch of cxserver.

In the logs I would expect to see something like below. But because we don't, it fails while outputting this line or before it:

{"@timestamp":"2025-10-02T12:56:26.511Z","ecs.version":"8.10.0","http":{"request":{"id":"ea22f5a8-2001-4f62-a372-73dd95389edd","method":"GET"}},"log.level":"debug","message":"Getting page en:Albert Einstein for fr","service":"cxserver-dev","url":{"path":"/v2/page/en/fr/Albert%20Einstein"}}

I cannot reproduce locally using master branch of cxserver.

In the logs I would expect to see something like below. But because we don't, it fails while outputting this line or before it:

{"@timestamp":"2025-10-02T12:56:26.511Z","ecs.version":"8.10.0","http":{"request":{"id":"ea22f5a8-2001-4f62-a372-73dd95389edd","method":"GET"}},"log.level":"debug","message":"Getting page en:Albert Einstein for fr","service":"cxserver-dev","url":{"path":"/v2/page/en/fr/Albert%20Einstein"}}

After npm update I see that winston module is updated and cxserver on master seems working fine.

Change #1193821 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] cxserver: staging: Update to 2025-10-06-084053-production

https://gerrit.wikimedia.org/r/1193821

Change #1193821 merged by jenkins-bot:

[operations/deployment-charts@master] cxserver: staging: Update to 2025-10-06-084053-production

https://gerrit.wikimedia.org/r/1193821

Change #1193989 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] cxserver: Update to 2025-10-06-084053-production

https://gerrit.wikimedia.org/r/1193989

Change #1193989 merged by jenkins-bot:

[operations/deployment-charts@master] cxserver: Update to 2025-10-06-084053-production

https://gerrit.wikimedia.org/r/1193989

Nikerabbit changed the point value for this task from 4 to 8.