HomePhabricator
Moving performance testing tools out of AWS

Some time ago I did the final work to move out the web performance tooling out of Amazon Web Services (AWS). It's been a long journey. Let me tell you how it all started.

But first, what is the performance tooling? It's tools that automate the web browsers, where we navigate to Wikipedia and measure the performance of Wikipedia pages. When we are measuring the performance we want as stable server and environment as possible. The more stable server where we deploy our tool, the smaller performance regressions we can pick up.

The AWS Background

In 2015 I deployed our first own hosted version of the Open Source web performance tool WebPageTest (T109666). At that time, WebPageTest was Windows only and we didn't want to host Windows machines in our own hosting so we started to use AWS. Hosting outside of our data center also made sense since we wanted to simulate performance testing as our users. Our users do not live in our data center. We started out with a couple of Amazon Elastic Compute Cloud servers and used Amazon object storage for hosting the HTML reports from the tool.

A couple of years later (2018 in T189184), the tool evolved and moved to Linux. I did a lot of testing to make Firefox work in the new setup. We continued to use AWS because it was easy to just spin up a couple of Linux servers.

We also started to host our own time series database outside of our own hosting, to make it easy to add metrics from servers running outside of our own hosting. Have you guessed where it ended up? On AWS because it was easy.

We continued to work on our performance testing and @Gilles and I started to try out front end performance testing using replay proxies (T176361). A replay proxy records the web traffic and replay it locally. That way you measure the front end performance without the disturbance of the internet. That worked out great and we also tried out different host providers to see where we would get the most stable metrics and at that point it was on AWS. With the replay proxy tests, we added a couple of more servers.

We also tried running tests on bare metal servers, however in retrospect, at that time I didn't have the correct skills to configure the servers correctly and AWS gave us better results so we continued to use AWS.

The AWS Problem

Why did we want to move out of AWS? The numbers were good compared to other cloud providers. However we still had problems. One thing that the Netflix team talked about how even if you deploy the same EC2 type, the performance can vary between 10-20% of the instance. You can read more about the variance at Brendan Greggs blog post. We also did try to measure the variance. We could see that we had different performance on the same EC2 type. That meant we needed to make sure that a test ran on a specific EC2 instance. If it ran on another one, we would have a change in our performance metrics.

We also have seen that performance changes over time and that we get different performance at different times T289516. That makes it harder to find small performance regressions.

The New Setup

A couple of years ago I re-evaluated running tests on bare metal servers and it worked out great this time. We can get stable metrics on bare metal servers . We can get the same performance all the time by tuning and configuring the server.

I started to move all of our testing out of AWS to use bare metal (T311981 and T345595). The price for the servers was also cheaper, so the move was a win/win. The new provider re-use old servers from other customers. Instead of buying new hardware we can reuse older ones that were bought for other purposes. That's better in an ecological aspect.

A couple of weeks ago I took the last step and moved our time series database to another host T362020 and instead of Amazon object storage we store the performance result on a disk on a server.

The new setup run on bare metal servers. That is cheaper and gives us more reliable performance metrics. For services that do not need that stability we use Cloud VPS.

The Result

By moving the test servers to a cheaper provider I could run more performance tests at a lower price. We increased the number of test servers with a lower total price. With more test servers, we can do more tests and more runs.

I used extra test runs to implement the Mann Whitney U test (T351929). By running more tests and using Mann Whitney we know if a performance regression is of statistical significance. That way we can make sure that we only alert on real regressions. That decreases the number of false alerts and time spent investigating regressions. I'll write more about how we use Mann Whitney U in my next blog post.

Written by Peter on Aug 12 2024, 7:44 AM.
Software Engineer, Wikimedia Foundation

Event Timeline

Uh! Thanks!

Are you aware if anything else is on AWS? (excluding Wikimedia Enterprise)?

(I'm asking to contribute in this page: https://meta.wikimedia.org/wiki/FLOSS-Exchange/Matrix)

@valerio.bozzolan no I don't know. It seems like there was one account that was specific to the performance team, I couldn't see other usage.

By running more tests and using Mann Whitney we know if a performance regression is of statistical significance. That way we can make sure that we only alert on real regressions. That decreases the number of false alerts and time spent investigating regressions.

Excellent work!