Page MenuHomePhabricator

Move testreduce away from scandium to a separate Buster Ganeti VM
Closed, ResolvedPublic

Description

Based on an IRC conversation with @ssastry, turning this into a task.

scandium currently runs Stretch (like the Parsoid servers itself). scandium in addition also hosts testreduce, which is written in Nodejs (unlike current Parsoid, which is PHP). The Parsoid parts need to remain on Stretch as long as the production Parsoid servers run Stretch.

scandium runs the node10 component and npm from stretch-backports, but stretch-backports is going away soon and the last version on stretch-backports is no longer compatible with the node10 component, various dependencies clash.

restreduce does not need to run on the same host like Parsoid. Given that Buster has Node 10 by default and npm as well, we should create a separate Ganeti instance for testreduce (and then we can reimage scandium to get a clean Stretch state again (or we remove all Node traces)).

Potentially this would also allow the visual-diff tests to run on the new Ganeti instance, but we need to check the performance characteristics.

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+2 -0
operations/puppetproduction+1 -1
operations/puppetproduction+6 -6
operations/puppetproduction+3 -11
operations/puppetproduction+20 -6
operations/puppetproduction+1 -1
operations/puppetproduction+0 -1
operations/puppetproduction+0 -10
mediawiki/services/parsoidmaster+60 -0
operations/puppetproduction+1 -16
operations/puppetproduction+2 -2
operations/puppetproduction+2 -2
operations/puppetproduction+20 -6
operations/puppetproduction+11 -6
operations/puppetproduction+2 -6
operations/puppetproduction+9 -3
operations/puppetproduction+12 -0
operations/puppetproduction+2 -0
operations/puppetproduction+1 -0
operations/puppetproduction+3 -1
operations/puppetproduction+23 -13
operations/puppetproduction+1 -1
operations/puppetproduction+7 -0
operations/puppetproduction+1 -13
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 620757 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] parsoid::testreduce: add rt_client profile to role

https://gerrit.wikimedia.org/r/620757

Change 620757 merged by Dzahn:
[operations/puppet@production] parsoid::testreduce: add rt_client profile to role

https://gerrit.wikimedia.org/r/620757

Change 620763 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] parsoid::testreduce: add rt_server profile to role

https://gerrit.wikimedia.org/r/620763

Change 620763 merged by Dzahn:
[operations/puppet@production] parsoid::testreduce: add rt_server profile to role

https://gerrit.wikimedia.org/r/620763

@ssastry @cscott rt_client and rt_server have been added to testreduce1001.eqiad.wmnet'.

[testreduce1001:~] $ sudo systemctl status parsoid-rt
● parsoid-rt.service - parsoid-rt: Testreduce HTTP service for Parsoid roundtrip testing
   Loaded: loaded (/lib/systemd/system/parsoid-rt.service; static; vendor preset: enabled)
   Active: active (running) since Mon 2020-08-17 20:26:38 UTC; 8s ago

It fails after a little while though because it does not have access to the database yet.

Unable to connect to database, error: Error: ER_ACCESS_DENIED_ERROR: Access denied for user 'testreduce'@'10.64.48.40'

Thanks!

It fails after a little while though because it does not have access to the database yet.

I suppose I need to get DBA to provide testreduce1001 access to the tesreduce database. Should I file a separate ticket for it? Or should we track that here?

I was about to create a subtask for that. I got it.

I just saw the DB appears to be running on localhost, not on a cluster, fwiw.

I just saw the DB appears to be running on localhost, not on a cluster, fwiw.

Actually, we don't really need the db running on the production server. It is totally fine to spin up mysql on testreduce1001 too! So, whatever makes sense from your POV.

Change 620787 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] testreduce: add profile parsoid::testing to role

https://gerrit.wikimedia.org/r/620787

Change 620787 merged by Dzahn:
[operations/puppet@production] testreduce: add profile parsoid::testing to role

https://gerrit.wikimedia.org/r/620787

Change 620792 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] testreduce: add Hiera keys for API URL and turn off monitoring

https://gerrit.wikimedia.org/r/620792

Change 620792 merged by Dzahn:
[operations/puppet@production] testreduce: add Hiera keys for API URL and turn off monitoring

https://gerrit.wikimedia.org/r/620792

Change 620796 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] parsoid::testing: add support for buster by adding mariadb-client

https://gerrit.wikimedia.org/r/620796

Change 620796 merged by Dzahn:
[operations/puppet@production] parsoid::testing: add support for buster by adding mariadb-client

https://gerrit.wikimedia.org/r/620796

Dzahn changed the task status from Open to Stalled.Aug 17 2020, 11:00 PM

mariadb-client has been installed (added buster support by using that instead of outdated mysql-client package before)

added missing API URL

disabled monitoring like on scandium

created subtask to adjust mysql privileges to allow testreduce1001 to connect to the testreduce db on m5-master.

Change 615831 abandoned by Dzahn:
[operations/puppet@production] parsoid: remove vd_server and vd_client from parsoid::testing role

Reason:
turns out that vd_server/client and rt_server/client are not separated after all and now both are on the new machine on buster

https://gerrit.wikimedia.org/r/615831

This is stalled on T260627 but otherwise should be good to go.

Dzahn changed the task status from Stalled to Open.Aug 19 2020, 6:40 PM

@ssastry The parsoid-rt service is now running on testreduce1001 and does not stop anymore because it can talk to the database now. Database is still on m5 like before.

[testreduce1001:~] $ sudo systemctl status parsoid-rt
● parsoid-rt.service - parsoid-rt: Testreduce HTTP service for Parsoid roundtrip testing
   Loaded: loaded (/lib/systemd/system/parsoid-rt.service; static; vendor preset: enabled)
   Active: active (running) since Wed 2020-08-19 17:58:07 UTC; 43min ago

@ssastry Please let me know what else you need on the new instance.

Change 621337 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] parsoid/testreduce: add paramater to control parsoid-rt service

https://gerrit.wikimedia.org/r/621337

Change 621337 merged by Dzahn:
[operations/puppet@production] parsoid/testreduce: add parameter to control parsoid-rt service

https://gerrit.wikimedia.org/r/621337

Mentioned in SAL (#wikimedia-operations) [2020-08-19T19:20:40Z] <mutante> testreduce1001 - re-enabled puppet, confirmed parsoid-rt service was now stopped properly by puppet while it runs as before on scandium, the previous parsoid-testing host. switching it over is now a Hiera one-liner. (T257906)

Change 621555 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] testreduce: add parameters to control client services

https://gerrit.wikimedia.org/r/621555

Change 621555 merged by Dzahn:
[operations/puppet@production] testreduce: add parameters to control client services

https://gerrit.wikimedia.org/r/621555

Mentioned in SAL (#wikimedia-operations) [2020-08-20T18:18:02Z] <mutante> testreduce1001 - rt_client and vd_client now properly stopped by puppet T257906

I am not sure what exactly needs to be done here as the next step. I have created a VM and both rt client/server and vd client/server are on it. Also added the ability to stop both clients.

So:

  • create a separate Ganeti instance for testreduce
  • apply rt client/server
  • apply vd_client/server
  • reimage scandium to get a clean Stretch state again (or we remove all Node traces)).
  • Potentially this would also allow the visual-diff tests to run on the new Ganeti instance, but we need to check the performance characteristics (?)

@ssastry Can we go ahead with the "reimage scandium" part? How about the performance of vd tests?

Change 634383 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] parsoid: stop using nodejs parsoid on scandium

https://gerrit.wikimedia.org/r/634383

@Dzahn I was supposed to verify that rt testing works from that new server but haven't done it yet. Lost track of this. I'll kick off the next test run from that server. After we shift over all testing there, you can reimage scandium.

Change 635063 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Add a script to kick off rt testing

https://gerrit.wikimedia.org/r/635063

Change 635613 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[operations/puppet@production] Update update_parsoid.sh script for use on testreduce1001

https://gerrit.wikimedia.org/r/635613

Change 635648 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] parsoid/testing: disable vd client and server

https://gerrit.wikimedia.org/r/635648

Change 635648 merged by Dzahn:
[operations/puppet@production] parsoid/testing: disable vd client and server

https://gerrit.wikimedia.org/r/635648

Mentioned in SAL (#wikimedia-operations) [2020-10-21T21:38:23Z] <mutante> testreduce1001 assigned 2 more GBs of RAM - rebooting (T257940, T257906)

Change 635653 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] parsoid/testreduce: disable vd server/client

https://gerrit.wikimedia.org/r/635653

Change 635653 merged by Dzahn:
[operations/puppet@production] parsoid/testreduce: disable vd server/client

https://gerrit.wikimedia.org/r/635653

Change 635613 merged by Dzahn:
[operations/puppet@production] Update update_parsoid.sh script for use on testreduce1001

https://gerrit.wikimedia.org/r/635613

Change 635063 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Add a script to kick off rt testing

https://gerrit.wikimedia.org/r/635063

Change 637003 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] remove scandium from scap dsh group

https://gerrit.wikimedia.org/r/637003

Change 637003 merged by Dzahn:
[operations/puppet@production] remove scandium from scap dsh group

https://gerrit.wikimedia.org/r/637003

Change 634383 merged by Dzahn:
[operations/puppet@production] parsoid: stop using nodejs parsoid on scandium

https://gerrit.wikimedia.org/r/634383

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

scandium.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202010281905_dzahn_29023_scandium_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['scandium.eqiad.wmnet']

Of which those FAILED:

['scandium.eqiad.wmnet']

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

scandium.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202010281907_dzahn_30979_scandium_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['scandium.eqiad.wmnet']

Of which those FAILED:

['scandium.eqiad.wmnet']

Change 637034 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: assign insetup role to scandium, reimaging fails with prod role

https://gerrit.wikimedia.org/r/637034

Change 637034 merged by Dzahn:
[operations/puppet@production] site: assign insetup role to scandium, reimaging fails with prod role

https://gerrit.wikimedia.org/r/637034

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

scandium.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202010282159_dzahn_905_scandium_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['scandium.eqiad.wmnet']

and were ALL successful.

scandium has been reimaged (stretch like before) and after merging https://gerrit.wikimedia.org/r/634383

It still has nodejs installed though and some warnings about failing to git clone the visualdiff repo.

Change 637070 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] parsoid: add parameters to skip installing nodejs on test hosts

https://gerrit.wikimedia.org/r/637070

Change 637070 merged by Dzahn:
[operations/puppet@production] parsoid: allow skipping installation of nodejs on test hosts

https://gerrit.wikimedia.org/r/637070

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

scandium.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202010282358_dzahn_1471_scandium_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['scandium.eqiad.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2020-10-29T01:41:10Z] <mutante> scandium reimaged a second time after making puppet changes to ensure nodejs/npm is NOT installed anymore (T257906)

scandium is back up again. But unfortunately even after the puppet changes above and a second reinstall the code still pulled the nodejs package and tries (but fails) to get npm.

More puppet work will be needed. Will continue tomorrow.

Change 637582 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] parsoid::testing: remove vd_client/server and rt_client/server

https://gerrit.wikimedia.org/r/637582

Subbu is running tests on scandium from testreduce1001. Puppet is disabled on scandium for that reason.

Tomorrow I'm planning to merge the change above and reimage one last time. Then we should be done here.

Change 637582 merged by Dzahn:
[operations/puppet@production] parsoid::testing: remove vd_client/server and rt_client/server

https://gerrit.wikimedia.org/r/637582

Change 637740 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site/parsoid-testing: update comments, apply insetup role

https://gerrit.wikimedia.org/r/637740

Change 637740 merged by Dzahn:
[operations/puppet@production] site/parsoid-testing: update comments, apply insetup role

https://gerrit.wikimedia.org/r/637740

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

scandium.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202010301722_dzahn_979_scandium_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['scandium.eqiad.wmnet']

and were ALL successful.

Change 637745 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site/parsoid-testing: reapply testing role to scandium

https://gerrit.wikimedia.org/r/637745

Change 637745 merged by Dzahn:
[operations/puppet@production] site/parsoid-testing: reapply testing role to scandium

https://gerrit.wikimedia.org/r/637745

scandium has been reimaged.

It is now just an mw appserver plus: git clone of parsoid repo, nginx for test site on port 8001, mysql client/config and firewall holes for 8001/8142.

There is no more nodejs or npm installed.

I think that resolves the ticket.

@ssastry @Muehlenhoff Let me know if you see anything missing. Claiming resolved for now.

npm install is not working on testreduce1001

As reported by subbu, 'npm install' was not working on testreduce1001. It hung after "loadIdealTree:loadAllDepsIntoIdealTree: sill install loadIdealTree"

like https://stackoverflow.com/questions/50522376/npm-install-hangs-on-loadidealtreeloadalldepsintoidealtree-sill-install-loadid which lists different approaches to fix it.

Did these things which fixed the issue:

1Things done on testreduce1001 to fix 'npm install' in /srv/parsoid.
2
3https://stackoverflow.com/questions/50522376/npm-install-hangs-on-loadidealtreeloadalldepsintoidealtree-sill-install-loadid
4
5sudo npm config set registry http://registry.npmjs.org/ --global
6sudo npm cache clear --force
7/srv/parsoid-testing] $ rm package-lock.json
8sudo npm config set proxy http://webproxy.eqiad.wmnet:8080 --global
9sudo npm config set https-proxy http://webproxy.eqiad.wmnet:8080 --global
10
11^ all of these can be done with sudo and --global or without sudo for the current user.

Now there is the next issue:

npm ERR! Error while executing:
npm ERR! /usr/bin/git ls-remote -h -t https://github.com/arlolra/negotiator.git
npm ERR! 
npm ERR! undefined
npm ERR! exited with error code: 128

npm ERR! A complete log of this run can be found in:
npm ERR!     /home/dzahn/.npm/_logs/2020-11-23T20_23_11_312Z-debug.log

meanwhile there is no more /srv/parsoid on testreduce1001 but /srv/parsoid-testing instead. I tried an "npm install-test" to check the current status.

There was still an issue to connect to 443 on github.com

npm ERR! /usr/bin/git ls-remote -h -t https://github.com/cscott/service-runner.git
npm ERR!
npm ERR! fatal: unable to access 'https://github.com/cscott/service-runner.git/': Failed to connect to github.com port 443: Connection timed out

But when I set the http_proxy manually this issue was solved, so:

[testreduce1001:/srv/parsoid-testing] $ https_proxy="http://webproxy.eqiad.wmnet:8080" npm install

Next there were errors that make was missing, so I installed that (puppet patch coming).

As far as I can tell it should be usable now for the parsoid team.

Change 647352 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] parsoid/testreduce: ensure make is installed on testreduce host

https://gerrit.wikimedia.org/r/647352

Change 647352 merged by Dzahn:
[operations/puppet@production] parsoid/testreduce: ensure make is installed on testreduce host

https://gerrit.wikimedia.org/r/647352

As part of this work, scandium puppet code was split into two pieces: (a) retain app-server config on scandium (b) migrate all node.js parsoid-rt & parsoid-vd code to testreduce1001.

In IRC conversation with @Dzahn, we resolved that that we missed a couple of puppetized pieces in the process.

The config files in /etc/testreduce/*parsoid-rt* and /etc/testreduce/*parsoid-vd* should still be puppetized as before. But, they should now apply to testreduce1001 (not scandium since scandium no longer runs any node.js code).

/etc/testreduce does not exist at all on scandium, so that doesn't seem to be a puppetization issue.

The mysql config changed today is in /etc/my.cnf which is separate from that.

Looks like the config files in /etc/testreduce/ are puppetized already!

I manually edited the config file /etc/testreduce/parsoid-rt.settings.js to change the host to "localhost", but in about 30 mins, it got changed back to m5-master.eqiad.wmnet which indicates a puppet run reverted it back to its original setting. I verified this two times. Both times, the change got reverted to its original setting.

The config files in /etc/testreduce/*parsoid-rt* and /etc/testreduce/*parsoid-vd* should still be puppetized as before.

This is the case. The templates for these files are in puppet/modules/testreduce/templates:

parsoid-rt-client.config.js.erb  parsoid-rt.settings.js.erb       parsoid-vd.settings.js.erb
parsoid-rt.config.yaml.erb       parsoid-vd-client.config.js.erb

But, they should now apply to testreduce1001 (not scandium since scandium no longer runs any node.js code).

This is also true, a test change like the one below shows how scandium is not affected by changes to these templates:

https://gerrit.wikimedia.org/r/c/operations/puppet/+/669980/

https://puppet-compiler.wmflabs.org/compiler1003/28454/

Looks like the config files in /etc/testreduce/ are puppetized already!
I verified this two times. Both times, the change got reverted to its original setting.

Yes, confirmed. And that is what was requested, right?

So this is resolved?

Dzahn reassigned this task from Dzahn to ssastry.

Let me know if you see anything else that is missing here.