Page MenuHomePhabricator

Upgrade nodejs on testreduce1001
Closed, ResolvedPublic

Description

testreduce1001 which Parsoid uses regularly for rt-testing (before deploying new versions of Parsoid on the train every week). In trying to do a npm audit fix, we discovered that we cannot upgrade node libraries there (via npm install) because it has a really old version of node.

This is a request to update node on testreduce1001 to v16 at least (or whatever is feasible).

While we can find temporary work arounds for a week or so, we would ideally like this upgrade to be done at the earliest feasible date.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Side note: don't run npm install , it has a high potential to get the machine owned/taken over or turned in a botnet agent :-]

To upgrade Node you pretty much need to upgrade the OS:

  • Debian Buster has NodeJS 10 and we dont have a backport package of NodeJS 12/14etc
  • Bullseye has NodeJS 12 (which is end of life) but SRE at some point provided 14 and 16 (via the apt components thirdparty/node14 and thirdparty/node16 which were requested for CI T306996)

At a quick glance, the Jenkins CI images defined in integration/config are not using them, instead we use the tarballs (I am not sure why though).

Or you can run the workload in a container. SRE provides nodejs14-slim or nodejs16-slim` but I don't know whether those provide npm since they are intended to run production application on Kubernetes. Then you still get an obsolete Docker version from Buster ...

testreduce is a VM, we can easily spin a new testreduce1002 VM running Bookworm which would provider nodejs 18 and npm 9.

Thanks Mortiz. We would also need to copy over the test database when you create the new VM. It is not catastrophic if not done since we can always reinitialize the test set, but it would save us some trouble and gives us a baseline for continuing our tests.

Ack, I'll look into it tomorrow. For the new VM I'd simply reuse the current specs of testreduce1001, so 4 CPU cores, 6G RAM and 40G disk.

Ack, I'll look into it tomorrow. For the new VM I'd simply reuse the current specs of testreduce1001, so 4 CPU cores, 6G RAM and 40G disk.

Reg disk space, the current server has 50gb "/dev/vdb1 51289548 33209356 15442424 69% /srv/data".

I would say 50gb is probably the minimum we need and 60gb would probably give us a bit more cushion. Let me know if I need to do anything to change the 40gb default to 60gb.. I had also filed T296051 at one point, but I think I am going to decline it right now since, in practice, we now have a workflow where we are able to clear out old results periodically.

I would say 50gb is probably the minimum we need and 60gb would probably give us a bit more cushion. Let me know if I need to do anything to change the 40gb default to 60gb.. I had also filed T296051 at one point, but I think I am going to decline it right now since, in practice, we now have a workflow where we are able to clear out old results periodically.

Ack. We don't have any preset machine types, if you need 60G, you'll just get 60G :-)

Change 954010 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add testreduce1002

https://gerrit.wikimedia.org/r/954010

Change 954010 merged by Muehlenhoff:

[operations/puppet@production] Add testreduce1002

https://gerrit.wikimedia.org/r/954010

Change 954221 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add parsoid::testreduce role to testreduce1002

https://gerrit.wikimedia.org/r/954221

Change 954221 merged by Muehlenhoff:

[operations/puppet@production] Add parsoid::testreduce role to testreduce1002

https://gerrit.wikimedia.org/r/954221

@ssastry testreduce1002.eqiad.wmnet is now running the role, but the Puppet run doesn't fully complete:

The git::clone for the integration/visualdiff repository fails:

Notice: /Stage[main]/Visualdiff/Git::Clone[integration/visualdiff]/Exec[git_clone_integration/visualdiff]/returns: fatal: destination path '/srv/visualdiff' already exists and is not an empty

This isn't surprising since the Puppet class installs testreduce/testrun.ids and the pngs/ directory to it: https://github.com/wikimedia/operations-puppet/blob/production/modules/visualdiff/manifests/init.pp

This was never seen with the old setup since the only the checkout existed initially and the /srv/visualdiffs/pngs and /srv/visualdiffs/testreduce directories were added later. Shall we just move them to a different directory?

visualdiff role isn't needed for testreduce100* .. not sure why it is part of the puppet roles.

It may just be carry over from the times when we ran both kinds of tests on scandium ... but that has changed since. We run visualdiff tests on a cloud VM now. So, if everything else is installed, it might be good to go. I should copy over /srv/data from testreduce1001 to testreduce1002 once it is setup (unless you've already done this).

Change 954682 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Remove visualdiff client/server from testreduce role

https://gerrit.wikimedia.org/r/954682

It may just be carry over from the times when we ran both kinds of tests on scandium ... but that has changed since. We run visualdiff tests on a cloud VM now. So, if everything else is installed, it might be good to go.

I created https://gerrit.wikimedia.org/r/c/operations/puppet/+/954682/ which would remove these two Puppet classes from the testreduce server: https://github.com/wikimedia/operations-puppet/blob/production/modules/profile/manifests/parsoid/vd_client.pp and https://github.com/wikimedia/operations-puppet/blob/production/modules/profile/manifests/parsoid/vd_server.pp , does that look okay to you?

I should copy over /srv/data from testreduce1001 to testreduce1002 once it is setup (unless you've already done this).

I can do that, but I'll first reimage the server after https://gerrit.wikimedia.org/r/c/operations/puppet/+/954682/ is applied (to ensure we don't carry over cruft which confuses us later)

Change 954682 merged by Muehlenhoff:

[operations/puppet@production] Remove visualdiff client/server from testreduce role

https://gerrit.wikimedia.org/r/954682

I can do that, but I'll first reimage the server after https://gerrit.wikimedia.org/r/c/operations/puppet/+/954682/ is applied (to ensure we don't carry over cruft which confuses us later)

testreduce1002 has been reimaged and now Puppet runs cleanly. @ssastry Let me know if you miss anything.

If all is well, you can go ahead and copy over /srv/data from testreduce1001 (or I can do it tomorrow, either works for me). One missing last bit is that we need to update some grants. I'd loop in our of our DBAs for that. And we possibly need to move the DB content from 1001 or is it ephemeral?

/srv/data has the db content from 1001. I would ideally like that copied over. And yes, need db grants (similar to that on testreduce1001). I also need to be able to run npm install on this server (similar to testreduce1001).

Change 955325 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] testreduce: Setup rsync for data transfer

https://gerrit.wikimedia.org/r/955325

Change 955325 merged by Muehlenhoff:

[operations/puppet@production] testreduce: Setup rsync for data transfer

https://gerrit.wikimedia.org/r/955325

/srv/data has the db content from 1001. I would ideally like that copied over.

I synched over the data from 1001 to 1002:/srv/data/mysql. mariadb won't start with the imported data, the mariadb version moved from 10.3 to 10.11 and there appear to be some changes which don't allow to just reuse the on disk format directly with the new version (see "journalctl -u mariadb". We can loop in the DBAs for some ideas on how to best migrate the existing data?

I also need to be able to run npm install on this server (similar to testreduce1001).

The sudo rules are the same as for the old system, that should work just fine already. If any command names changed between Debian 10 and 12, let me know and we can adapt the rules.

/srv/data has the db content from 1001. I would ideally like that copied over.

I synched over the data from 1001 to 1002:/srv/data/mysql. mariadb won't start with the imported data, the mariadb version moved from 10.3 to 10.11 and there appear to be some changes which don't allow to just reuse the on disk format directly with the new version (see "journalctl -u mariadb". We can loop in the DBAs for some ideas on how to best migrate the existing data?

Sounds good. But while we wait, would mysqldump on testreduce1001 and initializing the db from the dump on testreduce1002 do the trick?

I also need to be able to run npm install on this server (similar to testreduce1001).

The sudo rules are the same as for the old system, that should work just fine already. If any command names changed between Debian 10 and 12, let me know and we can adapt the rules.

The sudo works, but looks like npm install fails because of network blocks.

/srv/data has the db content from 1001. I would ideally like that copied over.

I synched over the data from 1001 to 1002:/srv/data/mysql. mariadb won't start with the imported data, the mariadb version moved from 10.3 to 10.11 and there appear to be some changes which don't allow to just reuse the on disk format directly with the new version (see "journalctl -u mariadb". We can loop in the DBAs for some ideas on how to best migrate the existing data?

Sounds good. But while we wait, would mysqldump on testreduce1001 and initializing the db from the dump on testreduce1002 do the trick?

I created T345831 for this.

I also need to be able to run npm install on this server (similar to testreduce1001).

The sudo rules are the same as for the old system, that should work just fine already. If any command names changed between Debian 10 and 12, let me know and we can adapt the rules.

The sudo works, but looks like npm install fails because of network blocks.

Can you paste the failing command and the output?

npm WARN old lockfile FetchError: request to https://registry.npmjs.org/@babel%2fhelper-validator-identifier failed, reason: connect ETIMEDOUT 2606:4700::6810:1e22:443
npm WARN old lockfile     at ClientRequest.<anonymous> (/usr/share/nodejs/minipass-fetch/lib/index.js:130:14)
npm WARN old lockfile     at ClientRequest.emit (node:events:513:28)
npm WARN old lockfile     at TLSSocket.socketErrorListener (node:_http_client:496:9)
npm WARN old lockfile     at TLSSocket.emit (node:events:525:35)
npm WARN old lockfile     at emitErrorNT (node:internal/streams/destroy:151:8)
npm WARN old lockfile     at emitErrorCloseNT (node:internal/streams/destroy:116:3)
npm WARN old lockfile     at process.processTicksAndRejections (node:internal/process/task_queues:82:21)
npm WARN old lockfile  Could not fetch metadata for @babel/helper-validator-identifier@7.12.11 FetchError: request to https://registry.npmjs.org/@babel%2fhelper-validator-identifier failed, reason: connect ETIMEDOUT 2606:4700::6810:1e22:443
npm WARN old lockfile     at ClientRequest.<anonymous> (/usr/share/nodejs/minipass-fetch/lib/index.js:130:14)
npm WARN old lockfile     at ClientRequest.emit (node:events:513:28)
npm WARN old lockfile     at TLSSocket.socketErrorListener (node:_http_client:496:9)
npm WARN old lockfile     at TLSSocket.emit (node:events:525:35)
npm WARN old lockfile     at emitErrorNT (node:internal/streams/destroy:151:8)
npm WARN old lockfile     at emitErrorCloseNT (node:internal/streams/destroy:116:3)
npm WARN old lockfile     at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
npm WARN old lockfile   code: 'ETIMEDOUT',
npm WARN old lockfile   errno: 'ETIMEDOUT',
npm WARN old lockfile   syscall: 'connect',
npm WARN old lockfile   address: '2606:4700::6810:1e22',
npm WARN old lockfile   port: 443,
npm WARN old lockfile   type: 'system'
npm WARN old lockfile }

@ssastry It seems like you're maybe missing the proxy setting? Can you please retry with --proxy http://url-downloader.wikimedia.org:8080?

I just tried a

npm --proxy http://url-downloader.wikimedia.org:8080? install upper-case

and it worked fine for me

Aha .. okay! That did the trick. Thanks for getting us this far @MoritzMuehlenhoff.

Now, the only thing left is getting the mysql grants ... Sep 12 22:07:09 testreduce1002 nodejs[2595576]: Error: connect ENOENT /run/mysqld/mysqld.sock is the current failure.

@Ladsgroup It is not a problem if the database cannot be migrated over. We can reinit the database with a fresh set of test pages .. it is probably time to do a reset anyway. But, getting the grants is probably the blocker now. Would be great to get that granted sooner than later! I'll take care of initializing the database, etc. once that is done.

Change 957251 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] mariadb: Add grants for testreduce1002

https://gerrit.wikimedia.org/r/957251

Reg the database used for rt-testing T266509 documents what was done for testreduce1001. The db there was local and all of us were given grants for the local db. Does that help resolve the situation? @MoritzMuehlenhoff: is this something you can handle? Or do DBAs need to be involved for a local db and grants for it? /cc @Ladsgroup

Change 963392 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[operations/puppet@production] parsoid-rt-client: Reduce worker pool to 24 clients

https://gerrit.wikimedia.org/r/963392

Change 963392 merged by Muehlenhoff:

[operations/puppet@production] parsoid-rt-client: Reduce worker pool to 24 clients

https://gerrit.wikimedia.org/r/963392

Change 963413 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[operations/puppet@production] parsoid-rt-client: Further reduce worker pool to 16 clients

https://gerrit.wikimedia.org/r/963413

Change 963413 merged by Muehlenhoff:

[operations/puppet@production] parsoid-rt-client: Further reduce worker pool to 16 clients

https://gerrit.wikimedia.org/r/963413

Change 957251 abandoned by Ladsgroup:

[operations/puppet@production] mariadb: Add grants for testreduce1002

Reason:

The db in m5 is not being used at all. We should just simply drop everything.

https://gerrit.wikimedia.org/r/957251

Change 963996 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Automatically restart parsoid-rt if it crashes

https://gerrit.wikimedia.org/r/963996

Change 963996 merged by Muehlenhoff:

[operations/puppet@production] testreduce: Auto-restart parsoid-rt server/client and mariadb on failures

https://gerrit.wikimedia.org/r/963996

Change 964560 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[operations/puppet@production] parsoid-rt-client: Increase worker pool to 20 clients

https://gerrit.wikimedia.org/r/964560

Change 964560 merged by Muehlenhoff:

[operations/puppet@production] parsoid-rt-client: Increase worker pool to 20 clients

https://gerrit.wikimedia.org/r/964560

Change 965163 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/dns@master] Failover testreduce to testreduce1002

https://gerrit.wikimedia.org/r/965163

Change 965679 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] testreduce: Set innodb_buffer_pool_size to 4.6G

https://gerrit.wikimedia.org/r/965679

Change 965679 merged by Muehlenhoff:

[operations/puppet@production] testreduce: Set innodb_buffer_pool_size to 4.6G

https://gerrit.wikimedia.org/r/965679

Change 964105 had a related patch set uploaded (by Arlolra; author: Subramanya Sastry):

[mediawiki/services/parsoid@master] Migrate rt-testing to testreduce1002

https://gerrit.wikimedia.org/r/964105

Change 964105 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Migrate rt-testing to testreduce1002

https://gerrit.wikimedia.org/r/964105

Change 965163 merged by Muehlenhoff:

[operations/dns@master] Failover testreduce to testreduce1002

https://gerrit.wikimedia.org/r/965163

Change 988684 had a related patch set uploaded (by Arlolra; author: Arlolra):

[operations/dns@master] Switch testreduce to 1002

https://gerrit.wikimedia.org/r/988684

Change 988684 merged by Ssingh:

[operations/dns@master] Switch testreduce to 1002

https://gerrit.wikimedia.org/r/988684

Mentioned in SAL (#wikimedia-operations) [2024-01-08T19:04:29Z] <sukhe> running authdns-update for CR 988684: T345220

Mentioned in SAL (#wikimedia-operations) [2024-01-08T19:27:38Z] <taavi> make puppet re-generate empty envoy config file on testreduce1002 T345220

https://parsoid-rt-tests.wikimedia.org/ now looks correct

I guess the last step here is to decommission 1001

https://parsoid-rt-tests.wikimedia.org/ now looks correct

I guess the last step here is to decommission 1001

Ack, I'll remove testreduce1001 tomorrow.

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: testreduce1001.eqiad.wmnet

  • testreduce1001.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox

Change 989452 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Remove puppet references to testreduce1001

https://gerrit.wikimedia.org/r/989452

Change 989452 merged by Muehlenhoff:

[operations/puppet@production] Remove puppet references to testreduce1001

https://gerrit.wikimedia.org/r/989452

Change 989454 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Move Puppet 7 config towards the testreduce role

https://gerrit.wikimedia.org/r/989454

Change 989454 merged by Muehlenhoff:

[operations/puppet@production] Move Puppet 7 config towards the testreduce role

https://gerrit.wikimedia.org/r/989454

MoritzMuehlenhoff claimed this task.

testreduce1002 is now working fine and testreduce1001 has been decommissioned, closing.