Page MenuHomePhabricator

Evaluate deploying synthetic testing on bare metal
Closed, ResolvedPublic

Assigned To
Authored By
Peter
Aug 29 2018, 6:40 AM
Referenced Files
F35013676: dns-change.jpg
Mar 21 2022, 9:21 AM
F35013672: Screenshot 2022-03-21 at 09.33.06.png
Mar 21 2022, 9:21 AM
F35013702: Screenshot 2022-03-21 at 10.01.26.png
Mar 21 2022, 9:21 AM
F35013668: Screenshot 2022-03-21 at 09.31.53.png
Mar 21 2022, 9:21 AM
F35013637: Screenshot 2022-03-21 at 09.08.08.png
Mar 21 2022, 9:21 AM
F35013657: Screenshot 2022-03-21 at 09.20.05.png
Mar 21 2022, 9:21 AM
F35006709: Screenshot 2022-03-15 at 09.14.58.png
Mar 15 2022, 8:48 AM
F35006711: Screenshot 2022-03-15 at 09.14.13.png
Mar 15 2022, 8:48 AM

Description

We have a problem today that we get unstable metrics for our synthetic testing that depends on the server/cloud setup that we are using.

Background

When @Gilles and me was working on implementing a replay proxy in T176361 we tried out many different server solutions: running on AWS, GC, DO, Cloud VPS and bare metal. The solution that gave us most stable metrics was AWS so we deployed our Browsertime/WebPageReplay there.

However we have seen that AWS instances have different stability and can change over time (let me find that task later). We usually have an instability of first visual change (when something is painted in the screen) with 33 ms, but it happens that it increases and we need to test out new AWS servers and redeploy.

What to do

I want to us to try out running on bare metal again and get help on how to configure it (trying out different kernels, setting up everything for best performance over time). It really annoys me that we didn't get good stability the last time we tested on bare metal.

To run the tests, the server needs Docker and needs to be able to send the metrics to Graphite. We need to run the tests for each setup at least a couple of days to know that the metrics are stable.

When we run the tests we start the browser (Firefox/Chrome) and record a video of the screen using FFMPEG. We then analyze the video to find when elements appears on the screen.

Why is this important?

The more stable metrics we get the smaller regressions we can find.

Event Timeline

Peter renamed this task from Evaluate to Evaluate deploying synthetic testing on bare metal.Aug 29 2018, 7:11 AM

Let me work on this. I'm gonna rent a machine for a month and run tests there. We should run:

  • Desktop tests inside Docker against Wikipedia using Chrome
  • Desktop tests inside Docker against Wikipedia using Chrome and WebPagReplay
  • Emulated mobile tests inside Docker against Wikipedia using Chrome and WebPagReplay
  • Emulated mobile tests inside Docker against Wikipedia using Chrome
  • Desktop tests directly on the OS
  • Emulated mobile tests directly on the OS
  • Emulated mobile tests directly on the OS with throttled CPU

That way we can compare with our current AWS tests + a Mac mini M1.

Today I was able to push the test on a new clean server. At the moment we test the following URLs:

Desktop

Mobile

The server runs Ubuntu 20. I've disabled auto updates on it, installed Docker, NodeJS, Chrome/Firefox and dependencies needed for analysing the vide. The data is sent to our Graphite instance under a new key (baremetal and baremetadocker).

I use the exact same configuration as on AWS except that we only run Chrome tests at the moment.

I'm gonna keep it running this weekend and then have a look on Monday to see if something needs to be tuned.

I need to add one more tests where we use WebPageReplay without Docker.

Change 764311 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[performance/synthetic-monitoring-tests@master] Update to 23.0.1.

https://gerrit.wikimedia.org/r/764311

Change 764311 merged by jenkins-bot:

[performance/synthetic-monitoring-tests@master] Update to 23.0.1.

https://gerrit.wikimedia.org/r/764311

I added a dashboard:
https://grafana.wikimedia.org/d/RuYD0Af7z/bare-metal-vs-cloud and will continue to work on that.

@dpifke :
On the new server (running Ubuntu 20) I've disabled auto updates in /etc/apt/apt.conf.d/20auto-upgrades and do you think there's something else that should be done? I could see that there's more variation in DNS lookup than our other tests, so I tried to configure Google DNS server (they used their own) to see if I can spot a difference and the Cloudflare. I plan to tune/fix things this week and then cleanup the metrics and run with the same setup for two weeks.

Curious why we're using Ubuntu here instead of Debian?

I can do some tuning, remove unnecessary packages, etc. but for a prototype this is probably fine as-is.

Ubuntu because I want to use Chrome instead of Chromium, I run test both inside a Docker container and directly on the machine to try to tick off that box too, to see any difference. When Gilles and me tried years ago, we got more stable performance inside the container but I'm not sure if that is still true.

The latency was set lower on the bare metal tests for WebPageReplay, so I just increased them, that way the metrics will match. Also sees that we did fewer runs on bare metal, so I increased those too.

Today I increased the latency for the WebPageReplay tests, somehow I used old settings. With the new ones we test the exact same thing.

I also added test for Firefox on the new machine so we can keep track of that too.

Adding a couple of screenshots to show the unstable TTFB and the unstable DNS and gonna use them in the upcoming blog post.

dns-2-waterfall.png (570×2 px, 529 KB)

dns-1-waterfall.png (574×2 px, 554 KB)

dns-change-2.png (736×1 px, 284 KB)

dns-change.png (768×1 px, 361 KB)

Been going through the metrics this morning and it looks like everything runs smooth. I'm gonna push one change today so we also pickup time spent fetching the main document per dns/ssl etc so I can graph data in more detail. Then my plan is to let it run and analyse after a couple of weeks.

Change 765509 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[performance/synthetic-monitoring-tests@master] Update container with support for main document timings.

https://gerrit.wikimedia.org/r/765509

Change 765509 merged by jenkins-bot:

[performance/synthetic-monitoring-tests@master] Update container with support for main document timings.

https://gerrit.wikimedia.org/r/765509

I think we have enough data so I will start adding screenshots and some analyse the coming days that I later on can convert to a blog post.

First up the actual CPU benchmark. With the benchmark we can see how fast the actual CPU is on that machine. Here we compare running desktop (with no CPU throttling) and emulated mobile were we throttle with Chrome CPU throttling x8.

Screenshot 2022-03-04 at 11.17.16.png (1×2 px, 949 KB)

Screenshot 2022-03-04 at 11.19.38.png (1×2 px, 909 KB)

Screenshot 2022-03-04 at 11.19.28.png (1×2 px, 953 KB)

Here we can see that the cloud machine is faster than the bare metal one.

And then lets looks at the standard deviation for the CPU benchmark. Here we look at all tests running per page. We want as low standard deviation as possible.

Screenshot 2022-03-04 at 11.24.21.png (1×2 px, 1 MB)

Screenshot 2022-03-04 at 11.23.45.png (1×2 px, 1 MB)

Screenshot 2022-03-04 at 11.23.58.png (1×2 px, 1 MB)

Screenshot 2022-03-04 at 11.24.11.png (1×2 px, 1 MB)

And the only looking at desktop standard deviation (where the CPU is not throttled):

Screenshot 2022-03-04 at 11.35.28.png (1×2 px, 1020 KB)

Screenshot 2022-03-04 at 11.34.59.png (1×2 px, 1 MB)

Screenshot 2022-03-04 at 11.35.19.png (1×2 px, 1 MB)

Screenshot 2022-03-04 at 11.35.10.png (1×2 px, 1 MB)

And then also the emulated mobile standard deviation:

Screenshot 2022-03-04 at 11.38.41.png (1×2 px, 1009 KB)

Screenshot 2022-03-04 at 11.38.31.png (1×2 px, 1 MB)

Screenshot 2022-03-04 at 11.38.19.png (1×2 px, 1 MB)

Screenshot 2022-03-04 at 11.38.10.png (1×2 px, 1 MB)

Comparing using WebPageReplay is the easiest since the tests runs in an isolated environment. Looking at first visual change:

Screenshot 2022-03-04 at 12.31.32.png (996×2 px, 982 KB)

Screenshot 2022-03-04 at 12.31.04.png (1×2 px, 1 MB)

These are the exact same tests. We can see that some URLs are much more unstable on the cloud host than on the bare metal server.

Looking into the standard deviation using WebPageReplay on the bare metal server I found something interesting, The 28th of Feb the standard deviation increased:

Screenshot 2022-03-04 at 13.03.03.png (1×2 px, 1 MB)

First I looked into the server, thinking maybe something going on there, but it looked ok.

Checking the results, I could see three out of eleven runs had faster first visual change than the rest:

Screenshot 2022-03-04 at 13.01.01.png (136×2 px, 34 KB)

Then I could see that the same runs had different number of requests:

Screenshot 2022-03-04 at 13.01.24.png (128×1 px, 77 KB)

And comparing the HAR I could see that these four requests are the extra ones:

Screenshot 2022-03-04 at 12.58.59.png (672×2 px, 513 KB)

Looking at the cloud server we do not have the same behaviour, there we have the same number of requests across the board:

Screenshot 2022-03-04 at 13.10.28.png (122×2 px, 86 KB)

Here's a two good examples running against WebPageReplay on the dedicated server vs the cloud.

webpagereplay1.png (988×3 px, 363 KB)

webpagereplay2.png (1×3 px, 329 KB)

I got one more week with the server. I disabled Firefox tests and then increased number of runs of the tests that runs without WebPageReplay to get one week of tests where we can compare running with and without WebPageReplay (with that change the do the same number of runs).

For tests running without WebPageReplay the stability is best measured with the deta of first visual change and TTFB, that way we take away some of the instability of the network.

Screenshot 2022-03-15 at 09.14.58.png (1×2 px, 1020 KB)

Screenshot 2022-03-15 at 09.13.51.png (1×2 px, 1000 KB)

Screenshot 2022-03-15 at 09.14.13.png (1×2 px, 1 MB)

Today I closed down the server and will continue with the blog post.

Summary

I've been testing the performance running tests on our AWS cloud instance and compared it to a bare metal server that I rented from Glesys. The specification of the server (running Ubuntu so we could use Chrome):

mobo: x9scd-f
cpu: e3-1230
ram: 1x8gb ddr3
storage: 1x 480gb ssd
8 CPUs

I've chosen that server since it was one of the cheaper ones. One thing that differs than our cloud servers is that we have double the CPUs but I had a hard time finding a matching servers. However, we can compare our servers with the CPU benchmark metric where we run JavaScript in the browser and measure how long time it takes. In the graph you can see that the cloud servers runs that JavaScript faster than the bare metal server. Running on desktop the metric is closer to each other, on emulated mobile (where we slow down the CPU) you can see that the cloud instance is faster.

Screenshot 2022-03-21 at 09.08.08.png (870×2 px, 666 KB)

That is good so we are sure we run a much faster machine on bare metal, instead it is slower.

We run the exact same tests on both servers, the same number of iterations and then I compare the metrics and the standard deviation. I've focused on running the tests on Chrome since they metrics are more stable already. The idea was to run the tests almost a month so we could pickup variations that could happens on the cloud and the bare metal server.

CPU benchmark stability

The first thing to look at is how stable is the CPU benchmark metric. If the server can focus on the tests the standard deviation should be low. A lower standard deviation is better. Here's the standard deviation of the CPU metric when we test the Barack Obama page. The benchmark run after the test is done so it should matter what page we test. And I've checked and we have the same pattern on all pages we test:

Screenshot 2022-03-21 at 09.20.05.png (1×2 px, 1 MB)

Here you can see that we have lower standard deviation on the bare metal server across the board. One thing that is interesting is that on the cloud machine, running the tests against WebPageReplay the standard deviation is lower than the rest. I think that it should be the same as the deviation for running in Docker without WebPageReplay, so I wonder why. If we continue to run tests on the cloud we should look into that.

TTFB stability

Another important this is having stable time to first byte. For WebPageReplay tests that shouldn't be a problem since we run all the test locally and use a local proxy, For other tests we use TC (traffic control) to try to have the same connectivity for all the tests. However depending on the actual connection and the DNS setup that can differ in tests.

Looking at the high level we can see we have more outliers on bare metal server:

Screenshot 2022-03-21 at 09.31.53.png (1×2 px, 1 MB)

Zooming in removing outliers it looks like this:

Screenshot 2022-03-21 at 09.33.06.png (1×2 px, 1013 KB)

Here we can see that often the standard deviation is lower for bare metal BUT we get this higher deviation sometimes. I also want to add that when I started the tests the standard deviation on the bare metal server was worse. It was using a DNS provider for that hosting company and when I changed that it started to look better. Here's a graph showing the standard deviation of TTFB before and after the change:

dns-change.jpg (698×2 px, 176 KB)

Our cloud provider get smaller deviation and I think if we move to a bare metal server here we should focus work to get more stable metric (and focus on DNS).

Both CPU benchmark and TTFB are metrics we want to have stable so that they can give us stable First Visual Change and Largest contentful paint. If the first metrics are unstable, these other metrics will also be unstable.

First visual change using WebPageReplay

One of the metrics we use for alerts is first visual change. Lets look how that differs between our tests when we use WebPageReplay. Here we run the test the exact same way, any the left we have the metrics from the bare metal server and to the right is the metric from the cloud instance.

webpagereplay1.png (988×3 px, 363 KB)

webpagereplay2.png (1×3 px, 329 KB)

You can look at the dashboards and the "range" in the table and see that we get more stable metrics on the bare metal server.

First visual change against Wikipedia

We also run tests directly against Wikipedia. These metrics can differ because we will pickup variations when something is going on our Wikipedia servers or internet in general. But it can also of course be instability on the server. Look at these first visual changes:

Screenshot 2022-03-21 at 10.01.26.png (1×2 px, 1 MB)

Here you can see that something happened for the main page on the bare metal server tests and for the cloud tests you can see a couple of highs.

Another way to compare is suing the delta of first visual change and time to first byte. We take the first visual change and remove the time to first byte, that way we will remove the variability in time to first byte.

Lets look and both desktop and emulated mobile tests. I've zoomed in to a week because we had some disturbance in the cloud instance:

Screenshot 2022-03-15 at 09.14.58.png (1×2 px, 1020 KB)

Screenshot 2022-03-15 at 09.13.51.png (1×2 px, 1000 KB)

Here you can see that almost all URLs get better metrics on bare metal.

Zooming out and looking at the desktop tests (we din't get the emulated mobile tests running correctly in the beginning so I can zoom out on them) you can see that the delta is much more stable on the bare metal server.

Screenshot 2022-03-15 at 09.14.13.png (1×2 px, 1 MB)

My conclusion

We had tests almost running a month that gave us more stable metrics on bare metal vs the cloud instance both using WebPageReplay and testing direct against Wikipedia.

Our WebPageReplay tests would benefit a lot on running on a bare metal server. The good thing with that is that since we run those tests locally we could host that server at our own hosting and we could move down those thresholds on the alerts to be so we can alert on 33 ms change.

It also looks like it would be beneficial to run direct Wikipedia tests on a bare metal server. It would need some work on our side (can we get more stable DNS time, should we move to use delta metrics in our alerts?). That server or those servers should be hosted somewhere else.

Evaluation is done.