Page MenuHomePhabricator

Dedicated server for running Parsoid's roundtrip tests to get reliable parse latencies and use as perf. benchmarking tests
Closed, DeclinedPublic

Description

Right now, Parsoid's roundtrip tests run on ruthenium.eqiad.wmnet. As part of the test runs, we also collect performance metrics (parse latencies, serialization latencies, etc.). However, we cannot reliably use these metrics right now because the load on ruthenium varies quite a lot depending on what else is running on it.

Right now, it runs the following:

  • roundtrip testing clients (which is roughly equivalent to what happens in production)
  • the roundtrip testreduce server (which doles out test jobs to the rt-testing clients)
  • mysql server (hosts the testreduce database)
  • visual diff testing testreduce server
  • visual diff testing clients which then spawn phantomjs processes (which once in a while get stuck)
  • on occasion, @GWicke runs some dumper scripts

On top of this, our testreduce server code that provides the web interface for our roundtrip test results and visual diff test results has mysql queries that haven't been tuned for a long time and once in a while they load the mysql server.

On occasion, when all the above are running at the same time, swap space is zero or close to zero.

To make a long story short, we are requesting a dedicated spare server on which we run nothing else but the roundtrip testing clients which let us isolate parsing and serialization requests without having them compete with everything else. All the other things above can continue to run on ruthenium. This will now let us use the performance metrics gathered during roundtrip tests as a semi-reliable indicator of how our code might change performance in production when said code is deployed there.

Event Timeline

ssastry raised the priority of this task from to Needs Triage.
ssastry updated the task description. (Show Details)
ssastry set Security to None.
Dzahn triaged this task as Medium priority.Oct 22 2015, 10:24 PM
Dzahn added a project: hardware-requests.
Dzahn subscribed.

I added the hardware-requests tag which should start the process

I have a slight concern about moving things off ruthenium.eqiad.wmnet and onto other systems, since when I login to ruthenium.eqiad.wmnet it immediately informs me:
ruthenium is a Unpuppetised system for testing (role::testsystem)
So are these jobs that you wish to move to a new host fully puppetized and easily moved, or do they require special permissions on the new host?

Outside of the approval to move this to another system (which will involve more than just my review), I hop in to provide the hardware specifications and details:

ruthenium is an R610, purchased back in 2011-08-03. It has dual Intel Xeon X5647 (4 cores each @ 2.93GHz), 16GB DDR3 Synchronous 1333 MHz RAM (4GB * 4, so only 4 of the 12 banks populated), and dual 1TB disks.

Since it seems it is simply memory bound, I'm creating an on-site sub-task to check if we have any spare 4GB Dimms that match the spec above. If so, we may be able to upgrade ruthenium's memory, possibly eliminating the swap space issue (more memory, less swap, etc...)

@ssastry: Would you agree that raising the memory on the existing system may negate your need to migrate away from it? Please advise.

They are not puppetized. It would require nodejs, and some upstart configs to be migrated over. I suppose if I won't have permissions to do this myself, I would need special permissions. @GWicke had set that up on ruthenium.

Memory usage will mitigate the problem somewhat but not entirely eliminate it since this has only 8 cores in total. So, to allow for other cpu-bound processes like the dumper scripts that @GWicke might run, mysql, the testreduce servers, and the visual diff tests, we would have to reduce the # of rt-testing clients and also be careful about scheduling visual diff and rt-tests at the same time. Reducing # of rt-testing clients increases wait time before we verify that a code change is deployable.

At some point in the next several months, we also anticipate doing more automated visual diff tests (to get more authoritative numbers about rendering parity between Parsoid and PHP parser; or test other PHP parser changes like Tidy replacements). These will be more resource intensive since they run phantomjs + image comparison over large images. So, I anticipate that it would be better to have these 2 kinds of tests either run on separate servers or be scheduled at different times to eliminate resource contention between them.

So, to summarize, the memory bump would be good in and of itself and can mitigate the issue in the short term. But ideally, a separate server for rt-testing and perf benchmarking would be good.

@ssastry: So if it isn't puppetized, you'd need to have the same full sudo root level permissions on the box as well? Is there a reason they cannot be fully puppetized (and code-review) the services since this would be running on a dedicated production host?

Typically, we prefer to not launch production level services without having puppet mange the service. So part of the basic requirements for getting a bare metal server allocated to asking about this. If it isn't done, and cannot be done, I'll discuss the request with @mark. However, requests are a lot easier to get approved when we can see everything it will be doing in puppet manifests.

@RobH I didn't set up ruthenium and I am not familiar with what is involved with puppetization but will investigate.

RobH claimed this task.

So this request has been pending more information for approval.

As this service is already running on bare metal (sharing space on ruthenium), it should have all the available resources (as its not even labs restricted) to get its service fully documented/puppetized.

This task has sat awaiting that info since November. I'm going to decline this request at this time (as it has not had any further updates.) If/when this service is ready for migration to a new server, simply reopen this task.

Thanks!

Puppetization is more or less done and ruthenium has also been re-imaged to jessie (T122328). So, I am reopening this ticket to get access to a physical machine so we can offload the rt testing clients from ruthenium to the new server. If there are no servers available, I'll check with labs folks to use their physical machine in labs setup.

Isn't this now going to be handled by bare metal in labs? Please advise.

I think it needs @Andrew to give access to the labs-test machine to subbu.

I am meeting with @Andrew tomorrow to discuss this and related matters .. how convenient that we both live in Minneapolis. :-) I'll update this ticket after that.

We now have a bare metal labs hardware that we'll configure for use for visual diff testing. Once that is set up, I am going to move over all the visual diff services from ruthenium to that server. This should free up ruthenium that we can use only for round trip testing and nothing else. That should eliminate the need for other dedicated hardware.

@RobH: Do we still have the option of the RAM upgrade that you said was possible in T116090#1769532? That might enable us to run more test clients and have the test runs finish faster. But, this is not a critical or urgent request. Once I hear from you, I can either repurpose this ticket or open a separate ticket for it.

We now have a bare metal labs hardware that we'll configure for use for visual diff testing. Once that is set up, I am going to move over all the visual diff services from ruthenium to that server. This should free up ruthenium that we can use only for round trip testing and nothing else. That should eliminate the need for other dedicated hardware.

@RobH: Do we still have the option of the RAM upgrade that you said was possible in T116090#1769532? That might enable us to run more test clients and have the test runs finish faster. But, this is not a critical or urgent request. Once I hear from you, I can either repurpose this ticket or open a separate ticket for it.

We shouldn't have used it for anything else I'm aware of. I'd suggest a new task to request it though, since this is now a very long, confusing task of shifting requirements. You can of course reference this task in that via quote, but just re-purposing this seems bad.

Since a new system isn't required, I'm resolving this task.