Page MenuHomePhabricator

Move testreduce away from scandium to a separate Buster Ganeti VM
Open, MediumPublic

Description

Based on an IRC conversation with @ssastry, turning this into a task.

scandium currently runs Stretch (like the Parsoid servers itself). scandium in addition also hosts testreduce, which is written in Nodejs (unlike current Parsoid, which is PHP). The Parsoid parts need to remain on Stretch as long as the production Parsoid servers run Stretch.

scandium runs the node10 component and npm from stretch-backports, but stretch-backports is going away soon and the last version on stretch-backports is no longer compatible with the node10 component, various dependencies clash.

restreduce does not need to run on the same host like Parsoid. Given that Buster has Node 10 by default and npm as well, we should create a separate Ganeti instance for testreduce (and then we can reimage scandium to get a clean Stretch state again (or we remove all Node traces)).

Potentially this would also allow the visual-diff tests to run on the new Ganeti instance, but we need to check the performance characteristics.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 14 2020, 11:18 AM

Change 612568 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Remove apt pin for stretch-backports for npm

https://gerrit.wikimedia.org/r/612568

ssastry updated the task description. (Show Details)Jul 14 2020, 4:44 PM
ssastry edited subscribers, added: ssastry; removed: SubrahamanyamVarma.

Change 612568 merged by Muehlenhoff:
[operations/puppet@production] Remove apt pin for stretch-backports for npm

https://gerrit.wikimedia.org/r/612568

Dzahn added a comment.Jul 16 2020, 9:26 PM

Change 613278 merged by Dzahn:
[operations/puppet@production] parsoid: create new role to install just testreduce

https://gerrit.wikimedia.org/r/613278

Change 613306 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] visualdiff: ensure /srv/visualdiff/testreduce exists

https://gerrit.wikimedia.org/r/613306

Change 613306 merged by Dzahn:
[operations/puppet@production] visualdiff: ensure /srv/visualdiff/testreduce exists

https://gerrit.wikimedia.org/r/613306

Change 613309 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] visualdiff: update git branch from ruthenium to scandium

https://gerrit.wikimedia.org/r/613309

Mentioned in SAL (#wikimedia-operations) [2020-07-16T22:15:58Z] <mutante> testreduce1001 manually git clone 'scandium' branch of integration/visualdiff into /srv/visualdiff (T257906)

Joe assigned this task to Dzahn.Jul 20 2020, 7:05 AM
Joe added a subscriber: Joe.

Assigning to Daniel as he's actively working on this.

Change 613309 merged by Dzahn:
[operations/puppet@production] visualdiff: update git branch from ruthenium to scandium

https://gerrit.wikimedia.org/r/613309

Mentioned in SAL (#wikimedia-operations) [2020-07-23T18:21:47Z] <mutante> testreduce1001 - rm -rf /srv/testreduce and run puppet to re-clone testreduce to it from the scandium branch (T257906)

Merging the change above was a noop on scandium. I did not manually touch it so far, so the git repo at /srv/testreduce is unchanged and on the ruthenium branch there. Puppet does not care about the branch and is running fine. When setting up a new instance it will automatically clone from the scandium branch.

It also fixed the puppet run on the new ganeti VM testreduce1001.eqiad.wmnet which was the point of it.

@ssastry @Muehlenhoff Current status is:

  • there is now testreduce1001.eqiad.wmnet
  • all the parsoid-* admin group members should be able to SSH to it
  • It uses a new puppet role that, so far, installs only the vd_server and vd_client.
  • The testreduce parsoid-vd service is running.
  • /srv/visualdiff has the contents of the testreduce repo on the scandium branch.
  • VM is on buster and npm and node 10 are installed.
ii  npm                                  5.8.0+ds6-4+deb10u1          all          package manager for Node.js

ii  nodejs                               10.21.0~dfsg-1~deb10u1       amd64        evented I/O for V8 javascript - runtime executable

Are other profiles besides vd_client and vd_server needed?

Change 615831 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] parsoid: remove vd_server and vd_client from parsoid::testing role

https://gerrit.wikimedia.org/r/615831

wkandek triaged this task as Medium priority.Jul 23 2020, 10:56 PM
wkandek moved this task from Incoming 🐫 to Doing 😎 on the serviceops board.
ssastry added a comment.EditedAug 5 2020, 3:27 AM

Merging the change above was a noop on scandium. I did not manually touch it so far, so the git repo at /srv/testreduce is unchanged and on the ruthenium branch there. Puppet does not care about the branch and is running fine. When setting up a new instance it will automatically clone from the scandium branch.

Thanks!

  • there is now testreduce1001.eqiad.wmnet
  • all the parsoid-* admin group members should be able to SSH to it
  • It uses a new puppet role that, so far, installs only the vd_server and vd_client.
  • The testreduce parsoid-vd service is running.
  • /srv/visualdiff has the contents of the testreduce repo on the scandium branch.
  • VM is on buster and npm and node 10 are installed.

I had a question. Can I use npm install on this VM? Or would I still need to check in node modules into git and use that to fetch the node modules on this server?

Are other profiles besides vd_client and vd_server needed?

testreduce codebase is used for regular roundtrip testing (parsoid-rt and parsoid-rt-client) as well as visual diff testing (parsoid-vd & parsoid-vd-client). From what I understood chatting with @Muehlenhoff, node 10 and npm aren't going to be available if scandium is reimaged which means round trip testing will not run there either. So, the primary use of testreduce1001 would be to run rt testing services there. Once we get these 2 services migrated there and we can verify we can run these tests from there, we are good to go and at that time, we can uninstall parsoid-rt* services from scandium. So, scandium will end up being just a Parsoid/PHP test server. parsoid-rt-client on testreduce1001 will then contact Parsoid on scandium for rendering pages.

As for visual diff testing, right now, we have migrated our visual diff use cases on a cloud vm (parsoid-qa-01.eqiad.wmflabs). But, as I was mentioning to @Muehlenhoff, if it turns out we can allocate more Ganeti VMs and they are more powerful than the WMCS servers, we might decide to switch from WMCS to production VMs in case it lets us run tests on a larger set of production wiki pages. But, there is no urgency on this. We would like to run these two test modes on different servers since they are both resource intensive.

testreduce codebase is used for regular roundtrip testing (parsoid-rt and parsoid-rt-client) as well as visual diff testing (parsoid-vd & parsoid-vd-client). From what I understood chatting with @Muehlenhoff, node 10 and npm aren't going to be available if scandium is reimaged which means round trip testing will not run there either. So, the primary use of testreduce1001 would be to run rt testing services there. Once we get these 2 services migrated there and we can verify we can run these tests from there, we are good to go and at that time, we can uninstall parsoid-rt* services from scandium. So, scandium will end up being just a Parsoid/PHP test server. parsoid-rt-client on testreduce1001 will then contact Parsoid on scandium for rendering pages.

Exactly, the node10 packages we build for stretch are now incompatible with the npm backport that was once uploaded to stretch-backports (but not updated further after the initial upload). It still works on scandium since it's already present, but it would fail to install if we would reimage scandium. Thus the split-off of the testing infrastructure to a Buster VM (testreduce1001) which has nodejs and npm natively. In the end scandium would be otherwise identical to a standard Parsoid host (I'd suggest that we actually reimage it once testreduce1001 is split off to properly clean out the old cruft and validate that it works as expected).

cscott added a subscriber: cscott.Aug 5 2020, 3:09 PM

@ssastry one minor wrinkle to keep in mind is that to start an rt test run you need to update files on both scandium and the rt test server (checking out a specific git hash on scandium, telling the rt test server to kick off a test of that new hash). It will be a little more awkward doing this on two different hosts; we should probably build some checks into the system so that scandium can report to the rt server whether it is currently running the hash which the rt server thinks it is.

And just to elaborate on subbu's "resource intensive" comment above -- rt testing is on our critical path every week to deploying parsoid on the train. So it is important that it execute as quickly as possible, which is why we've given it dedicated resources. The slowest part of that process is done by scandium, though, and the client/server code is relatively lightweight. For visual diff testing, the client is more heavyweight (an automated browser).

@ssastry one minor wrinkle to keep in mind is that to start an rt test run you need to update files on both scandium and the rt test server (checking out a specific git hash on scandium, telling the rt test server to kick off a test of that new hash). It will be a little more awkward doing this on two different hosts; we should probably build some checks into the system so that scandium can report to the rt server whether it is currently running the hash which the rt server thinks it is.

Yup, noted. While this scenario can play out even now on scandium, it is much less likely since we do the steps in the right order: update-code, update-test-hash (and which we could have presumably unified in the same script). Ideally, we could run a command on testreduce1001 which would run the update_parsoid.sh script on scandium and we can then script the ordering into a single script on testreduce1001. But for now, we can just document this ordering and consdering adding sanity checks if required.

Change 619888 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] parsoid/testreduce: add a service_ensure parameter, stop on new server

https://gerrit.wikimedia.org/r/619888

Change 619888 merged by Dzahn:
[operations/puppet@production] parsoid/testreduce: add a service_ensure parameter, stop on new server

https://gerrit.wikimedia.org/r/619888

Dzahn added a comment.Aug 17 2020, 5:38 PM

So to summarize: The vd_client/vd_server that are on testreduce1001 should NOT be on it and instead the rt_client/rt_server should be on them? So exactly the other way around.

So to summarize: The vd_client/vd_server that are on testreduce1001 should NOT be on it and instead the rt_client/rt_server should be on them? So exactly the other way around.

Neither should be on scandium since they both require nodejs v10 which won't be available on scandium. But, you should first add parsoid-rt and parsoid-rt-client to testreduce1001 and only after we get that working on testreduce1001, we should remove it from scandium.

As for parsoid-vd-client and parsoid-vd services, yes, you can remove them from testreduce1001. We can separately talk about whether we want to continue running in WMCS or use a testreduce1002 for example. This part is lower priority for now.

Change 620757 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] parsoid::testreduce: add rt_client profile to role

https://gerrit.wikimedia.org/r/620757

Change 620757 merged by Dzahn:
[operations/puppet@production] parsoid::testreduce: add rt_client profile to role

https://gerrit.wikimedia.org/r/620757

Change 620763 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] parsoid::testreduce: add rt_server profile to role

https://gerrit.wikimedia.org/r/620763

Change 620763 merged by Dzahn:
[operations/puppet@production] parsoid::testreduce: add rt_server profile to role

https://gerrit.wikimedia.org/r/620763

Dzahn added a comment.Aug 17 2020, 8:28 PM

@ssastry @cscott rt_client and rt_server have been added to testreduce1001.eqiad.wmnet'.

[testreduce1001:~] $ sudo systemctl status parsoid-rt
● parsoid-rt.service - parsoid-rt: Testreduce HTTP service for Parsoid roundtrip testing
   Loaded: loaded (/lib/systemd/system/parsoid-rt.service; static; vendor preset: enabled)
   Active: active (running) since Mon 2020-08-17 20:26:38 UTC; 8s ago

It fails after a little while though because it does not have access to the database yet.

Unable to connect to database, error: Error: ER_ACCESS_DENIED_ERROR: Access denied for user 'testreduce'@'10.64.48.40'

Thanks!

It fails after a little while though because it does not have access to the database yet.

I suppose I need to get DBA to provide testreduce1001 access to the tesreduce database. Should I file a separate ticket for it? Or should we track that here?

Dzahn added a comment.Aug 17 2020, 8:32 PM

I was about to create a subtask for that. I got it.

Dzahn added a comment.Aug 17 2020, 8:37 PM

I just saw the DB appears to be running on localhost, not on a cluster, fwiw.

I just saw the DB appears to be running on localhost, not on a cluster, fwiw.

Actually, we don't really need the db running on the production server. It is totally fine to spin up mysql on testreduce1001 too! So, whatever makes sense from your POV.

Change 620787 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] testreduce: add profile parsoid::testing to role

https://gerrit.wikimedia.org/r/620787

Change 620787 merged by Dzahn:
[operations/puppet@production] testreduce: add profile parsoid::testing to role

https://gerrit.wikimedia.org/r/620787

Change 620792 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] testreduce: add Hiera keys for API URL and turn off monitoring

https://gerrit.wikimedia.org/r/620792

Change 620792 merged by Dzahn:
[operations/puppet@production] testreduce: add Hiera keys for API URL and turn off monitoring

https://gerrit.wikimedia.org/r/620792

Change 620796 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] parsoid::testing: add support for buster by adding mariadb-client

https://gerrit.wikimedia.org/r/620796

Change 620796 merged by Dzahn:
[operations/puppet@production] parsoid::testing: add support for buster by adding mariadb-client

https://gerrit.wikimedia.org/r/620796

Dzahn changed the task status from Open to Stalled.Aug 17 2020, 11:00 PM

mariadb-client has been installed (added buster support by using that instead of outdated mysql-client package before)

added missing API URL

disabled monitoring like on scandium

created subtask to adjust mysql privileges to allow testreduce1001 to connect to the testreduce db on m5-master.

Change 615831 abandoned by Dzahn:
[operations/puppet@production] parsoid: remove vd_server and vd_client from parsoid::testing role

Reason:
turns out that vd_server/client and rt_server/client are not separated after all and now both are on the new machine on buster

https://gerrit.wikimedia.org/r/615831

Dzahn added a comment.Aug 19 2020, 6:37 PM

This is stalled on T260627 but otherwise should be good to go.

Dzahn changed the task status from Stalled to Open.Aug 19 2020, 6:40 PM
Dzahn added a comment.Aug 19 2020, 6:42 PM

@ssastry The parsoid-rt service is now running on testreduce1001 and does not stop anymore because it can talk to the database now. Database is still on m5 like before.

[testreduce1001:~] $ sudo systemctl status parsoid-rt
● parsoid-rt.service - parsoid-rt: Testreduce HTTP service for Parsoid roundtrip testing
   Loaded: loaded (/lib/systemd/system/parsoid-rt.service; static; vendor preset: enabled)
   Active: active (running) since Wed 2020-08-19 17:58:07 UTC; 43min ago
Dzahn added a comment.Aug 19 2020, 6:43 PM

@ssastry Please let me know what else you need on the new instance.

Change 621337 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] parsoid/testreduce: add paramater to control parsoid-rt service

https://gerrit.wikimedia.org/r/621337

Change 621337 merged by Dzahn:
[operations/puppet@production] parsoid/testreduce: add parameter to control parsoid-rt service

https://gerrit.wikimedia.org/r/621337

Mentioned in SAL (#wikimedia-operations) [2020-08-19T19:20:40Z] <mutante> testreduce1001 - re-enabled puppet, confirmed parsoid-rt service was now stopped properly by puppet while it runs as before on scandium, the previous parsoid-testing host. switching it over is now a Hiera one-liner. (T257906)

Change 621555 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] testreduce: add parameters to control client services

https://gerrit.wikimedia.org/r/621555

Change 621555 merged by Dzahn:
[operations/puppet@production] testreduce: add parameters to control client services

https://gerrit.wikimedia.org/r/621555

Mentioned in SAL (#wikimedia-operations) [2020-08-20T18:18:02Z] <mutante> testreduce1001 - rt_client and vd_client now properly stopped by puppet T257906