Page MenuHomePhabricator

Upgrade eventgate-wikimedia to node20
Closed, ResolvedPublic

Description

https://gitlab.wikimedia.org/repos/data-engineering/eventgate-wikimedia

Done is

All eventgate deployments are upgraded to Node 20:

  • all of the below in beta
  • eventgate-analytics
  • eventgate-logging-external
  • eventgate-analytics-external
  • eventgate-main

Event Timeline

Ottomata triaged this task as Medium priority.

I've been working on this since my head is in eventgate code as part of T382173: Enable Event Platform streams to opt out of collecting User-Agent data.

I hope to deploy both at the same time.

This is deployed and working in beta.

Change #1114795 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate - templatize module name, default to @eventgate/wikimedia

https://gerrit.wikimedia.org/r/1114795

Change #1114798 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate-analytics - upgrade to v1.10.0 and NodeJS 20

https://gerrit.wikimedia.org/r/1114798

Change #1114795 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate - templatize module name, default to @eventgate/wikimedia

https://gerrit.wikimedia.org/r/1114795

Status 2025-01-29

v1.10.0 has been released with node20. It is running in beta.

I'd like to deploy to some production eventgates, but I'm worried about the timing. I have one full work day left before next week's DPE offsite. Node upgrades might work right away, but any performance regressions could take days to manifest.

Let's put off deployment until we are back from the offsite.

Change #1114798 merged by Ottomata:

[operations/deployment-charts@master] eventgate-analytics - upgrade to v1.10.0 and NodeJS 20

https://gerrit.wikimedia.org/r/1114798

Attempted to deploy eventgate-analytics staging today, but it looks like staging k8s is out of IP addresses?

0s          Warning   FailedCreatePodSandBox   pod/eventgate-production-6f5b4bbb64-rhrf4    (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "986549f25d453eb7cccba18c87774f564d4a2c2ffbffe4a0bab97d512ddfb4cf": plugin type="calico" failed (add): failed to request IPv4 addresses: Assigned 0 out of 1 requested IPv4 addresses; No more free affine blocks and strict affinity enabled

...Asking in IRC.

@tchin has discovered that nodejs20-slim does not contain ssl package needed for tls to Kafka. eventgate probably needs the same dependency.

https://gitlab.wikimedia.org/repos/data-engineering/eventstreams/-/merge_requests/16

Change #1120576 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] charts/eventgate - fix name of default node module to load

https://gerrit.wikimedia.org/r/1120576

Change #1120576 merged by Ottomata:

[operations/deployment-charts@master] charts/eventgate - fix name of default node module to load

https://gerrit.wikimedia.org/r/1120576

Change #1120604 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate-analytics - bump to v1.11.0 for node20

https://gerrit.wikimedia.org/r/1120604

Change #1120604 merged by Ottomata:

[operations/deployment-charts@master] eventgate-analytics - bump to v1.11.0 for node20

https://gerrit.wikimedia.org/r/1120604

Mentioned in SAL (#wikimedia-operations) [2025-02-18T17:30:09Z] <ottomata> upgrading eventgate-analytics in codfw to node20 (will let this simmer for a day before proceeding to eqiad) - T383814

eventgate-analytics in codfw is upgraded to node20.

I'll let this simmer for at least a day before proceeding in eqiad and with other eventgate deployments.

Mentioned in SAL (#wikimedia-operations) [2025-02-19T15:04:32Z] <ottomata> upgrading eventgate-analytics in eqiad to node20 - T383814

eventgate-analytics in eqiad is upgraded to node20.

Will proceed with other eventgate instance next week.

Change #1122159 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate-logging-external - upgrade to node20

https://gerrit.wikimedia.org/r/1122159

Mentioned in SAL (#wikimedia-operations) [2025-03-03T15:36:38Z] <ottomata> deploying eventgate-logging-external to bump to node20 - T383814

Change #1124144 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate-analytics-external - upgrade to node20

https://gerrit.wikimedia.org/r/1124144

Change #1122159 merged by Ottomata:

[operations/deployment-charts@master] eventgate-logging-external - upgrade to node20

https://gerrit.wikimedia.org/r/1122159

Uh, oops. I deployed eventgate-logging-external this morning, but didn't merge the patch that bumped the image version! It was basically a no op! Merging and proceeding.

Mentioned in SAL (#wikimedia-operations) [2025-03-03T16:38:04Z] <ottomata> deploying eventgate-logging-external to ACTUALLY bump to node20 - T383814

fgiunchedi subscribed.

Uh, oops. I deployed eventgate-logging-external this morning, but didn't merge the patch that bumped the image version! It was basically a no op! Merging and proceeding.

I believe the latest deploy also removed x-geoip info/lookups from NEL logs, see also T387850: NEL logs are missing geoip information

Mentioned in SAL (#wikimedia-operations) [2025-03-04T16:01:15Z] <ottomata> eventgate-logging-external: rolling back to pre node 20 due to bug likely caused by T382173. -- T387850 , T383814

I rolled back. Let's find and fix the bug before deploying again.

Change #1131415 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate-logging-external - upgrade to node20

https://gerrit.wikimedia.org/r/1131415

Change #1131415 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate-logging-external - upgrade to node20

https://gerrit.wikimedia.org/r/1131415

Mentioned in SAL (#wikimedia-operations) [2025-03-27T15:28:10Z] <ottomata> upgrading eventgate-logging-external to node20 (using new per stream header enrich setting), first testing in staging. - T383814, T387908

Change #1124144 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate-analytics-external - upgrade to node20

https://gerrit.wikimedia.org/r/1124144

Mentioned in SAL (#wikimedia-operations) [2025-03-27T17:34:37Z] <ottomata> upgrading eventgate-analytics-external to node20 - T383814

Change #1131792 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate-main - upgrade to NodeJS 20

https://gerrit.wikimedia.org/r/1131792

Status update:

all eventgate deployments are upgraded except eventgate-main.

Let's do eventgate-main on Monday.

Mentioned in SAL (#wikimedia-operations) [2025-03-31T16:28:13Z] <ottomata> beginning eventgate-main upgrade to NodeJS 20 - T383814

Change #1131792 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate-main - upgrade to NodeJS 20

https://gerrit.wikimedia.org/r/1131792

I deployed eventgate-analytics-external last Thursday March 27.

resource usage seems lower since then? TBD if it will last; perhaps memory usage will climb back to Node 18 status quo. Since the 27th:
Average Latency has dropped.

Screenshot 2025-03-31 at 13.19.12.png (804×1 px, 109 KB)

Average and Max CPU usage has dropped.

Screenshot 2025-03-31 at 13.19.20.png (736×1 px, 202 KB)

Screenshot 2025-03-31 at 13.19.23.png (712×718 px, 81 KB)

Average memory usage has dropped. From about 227MB to 160MB working set for eventgate-analytics-external container.

Screenshot 2025-03-31 at 13.19.36.png (694×1 px, 193 KB)

Too soon to tell for eventgate-main, but at least so far nothing looks worse. eventgate-main is a little different, in that most requests use the hasty=false producer, which means latency should be generally higher than eventgate-analytics-external.

Change #1119206 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate-analytics remove canary release from staging

https://gerrit.wikimedia.org/r/1119206

Change #1119206 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate-analytics remove canary release from staging

https://gerrit.wikimedia.org/r/1119206