Page MenuHomePhabricator

Run performance tests using local proxy
Closed, ResolvedPublic

Description

As a follow up on T168526 where we run performance test against Mahimahi as a proxy to get stable values, we wanna try out to do the same on Labs. If we can do that we can easily scale our testing to reach our end goal: test on commits and earlier find performance regressions.

The modified version of Mahimahi isn't open source yet but since we tested it out, there's another alternative: WebPageReplay. I've tried it out during the offsite and it is much easier to get to work. The proxy is more of an implementation issue, so we can safely start off with WebPageReplay and then add support for mahimahi if we want that in the future.

In this task we should:

  • Make it easy to run tests collecting SpeedIndex/visualCompleteness on Labs by installing WebPageReplay/FFmpeg/Browsertime and document the setup.
  • Run tests continuously through out the day and see how much jitter we get and fine tune and see how many runs we need to do per URL to get stable values.
  • Create the follow up tasks to finish our main goal: tests on commits.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I've spent some time last days creating Docker images for WebPageReplay and Mahimahi: The mahimahi one doesn't work yet for me, I will look more today/tomorrow else you @Gilles maybe can try it out when you get back. I'll cleanup the Docker container for WebPageReplay today. I've pushed into AWS last day running 11 runs for three desktop URLs and three mobile URLs.

The metrics looks great IMHO. SpeedIndex was changes yesterday when we pushed Jimmys banner and that affected all metrics everywhere we measure but that is ok for now:

https://grafana.wikimedia.org/dashboard/db/webpagereplay

Looking at First Visual Change for Obama the span is 1691 - 2091 (23%) on our regular testing, with WebPageReplay it is 2100 - 2133 (1,5%) but let this run for a week and we can compare more.

I run WebPageReplay now on AWS (c4.large) testing three desktop/three mobile 100 ms latency and push the metrics to https://grafana.wikimedia.org/dashboard/db/webpagereplay

I've installed Docker and bttostatsv to send the metrics to Graphite:

It runs every hour in the crontab and one run looks like this:

#!/bin/bash
LATENCY=100
RUNS=11
URL=https://en.wikipedia.org/wiki/Barack_Obama
docker run --cap-add=NET_ADMIN --shm-size=1g --rm -v "$(pwd)":/browsertime -e RUNS=$RUNS -e LATENCY=$LATENCY soulgalore/browsertime-webpagereplay --resultDir result $URL
bttostatsv result/browsertime.json browsertime.enwiki.desktop.anonymous.replay.100.BarackObama https://www.wikimedia.org/beacon/statsv
sleep 3
sudo rm -fR result

The Docker container for the modified mahimahi is here
https://github.com/soulgalore/browsertime-replays/tree/master/mahimahi

Right now both delay and replay is not working. Replaying I get:

Domain: en.wikipedia.org
ALT NAMElogin.wikimedia.org
Generating a 2048 bit RSA private key
.................+++
...............................+++
writing new private key to '/tmp/private.AX7X33'
-----
Using configuration from /tmp/cacfg.wFM35m
unable to load CA private key
139805697707736:error:0906D06C:PEM routines:PEM_read_bio:no start
line:pem_lib.c:701:Expecting: ANY PRIVATE KEY
Died on std::runtime_error: `/usr/bin/openssl ca -batch -config
/tmp/cacfg.wFM35m -policy signing_policy -extensions signing_req -out
/tmp/certificate.JLSVxU -infiles /tmp/csr.vk26Bw': process exited with
failure status 1

Ok, now we I've got it running for a while so that I feel comfortable that it works really good. I've been running 11 runs with WebPageReplay, 11 runs without WebPageReplay (but there we collect the trace log from chrome so that could change the metrics) and compare it to our continuous run on WebPageTest.

First Visual Change Facebook

WebPageTest: 1560 -> 1790 (14,7 %)
11 runs browsertime: 1366 -> 1433 (4,9%)
11 runs WebPageReplay: 2100 -> 2133 (1,6%)

First Visual Change Obama

WebPageTest: 1888 -> 2091 (10,7 %)
11 runs browsertime: 1633 -> 1733 (6,1%)
11 runs WebPageReplay: 2100 -> 2133 (1,6%)

First Visual Change Sweden

WebPageTest: 1693 -> 1827 (7,9 %)
11 runs browsertime: 1500 -> 1600 (6,7%)
11 runs WebPageReplay: 2167 -> 2200 (1,5%)

Speed Index Facebook

WebPageTest: 1684 -> 1873 (11%)
11 runs browsertime: 1404 -> 1475 (5.0%)
11 runs WebPageReplay: 2115 -> 2149 (1,6%)

Speed Index Obama

WebPageTest: 1992 -> 2254 (13%)
11 runs browsertime: 1737->1827 (5,1%)
11 runs WebPageReplay: 2124 -> 2165 (1,9%)

Speed Index Sweden

WebPageTest: 1798-> 1997 (11 %)
11 runs browsertime: 1570 -> 1671 (6,4%)
11 runs WebPageReplay: 2186 -> 2223 (1,7%)

And the same when emulated mobile. Replaying gives us so much more stable values.

I had a go at using WebPageReplay for Firefox but no luck. I got help from Mozillians how they set the proxy for MITMproxy:

--firefox.preference network.proxy.type:1 --firefox.preference
network.proxy.http:127.0.0.1 --firefox.preference
network.proxy.http_port:8080 --firefox.preference
network.proxy.ssl:127.0.0.1 --firefox.preference
network.proxy.ssl_port:8081

and also tested iptables to make redirect traffic:

iptables -t nat -A PREROUTING -i eth0 -p tcp --dport 80 -j DNAT
--to-destination 127.0.0.1:8080
iptables -t nat -A PREROUTING -i eth0 -p tcp --dport 443 -j DNAT
--to-destination 127.0.0.1:8081

But it doesn't work for me (I do 11 runs in the Docker container and close the network and it fails). But I'm missing the part where Chrome doesn't redirect traffic for localhost, could that be the problem? Do you have any idea @Gilles? Would be cool to run the same URLs for Firefox too.

Assuming that the figures you have were measured at 60fps, you seem to have reached the maximum granularity possible (2 frames, i.e. 33,33ms), which is incredible. In terms of stability it can't get better than that, short of increasing the recording's fps. As a result, I don't think it's worth comparing alternatives once more for the sake of hoping for better stability, unless we have other reasons to use them. Firefox support would definitely be a valid one.

What's probably missing for Firefox is DNS. The trick for Chrome is to tell it that everything resolves to those addresses. I think webpagereplay has 8080 and 8081 act as web servers, not as proxies. Which is why the techniques you've tried to use don't work at all.

I think things are set up this way for webpagereplay because Chrome's host-resolver-rules option allows one to override host resolution beyond just DNS override. I think what we need to make Firefox work with webpagereplay is a DNS server that tells Firefox every website is 127.0.0.1, a way to override what Firefox uses as its DNS server (I believe it relies on the OS for that). Mahimahi uses dnsmasq to achieve this. And finally, we need to run webpagereplay on ports 80 and 443, because overriding DNS results alone doesn't let you change the port your browser wants to use for HTTP and HTTPS by default. I don't think that's a big deal because we don't need port 80 and 443 for anything on those machines (we just need to make sure they're available). Anyway, it's worth trying dnsmasq for this.

Experimenting with mitmproxy might be a better bet than webpagereplay for cross-browser support (it would probably make the setup simpler anyway). Unlike webpagereplay that acts as a web server, it acts as a proxy, which makes things a whole lot easier since all browsers have settings for a custom proxy.

@RobH is digging up a spare server with an SSD we can test this on, to verify the SSD/spinning disk theory we think is giving AWS its edge at the moment. T179968

Gilles renamed this task from Run performance tests on VPS/Labs using local proxy to Run performance tests using local proxy.Nov 9 2017, 1:09 PM

The bare metal server with SSDs is now available, it's lawrencium.eqiad.wmnet

I've added a desktop test for the Barack Obama article running on lawrencium every 15 minutes with the same settings as your docker image, sending metrics to https://grafana.wikimedia.org/dashboard/db/webpagereplay

AWS still wins in terms of stability, by a wide margin. @Peter did you see a stability difference when running the code directly vs inside a docker container?

I can't get mitmproxy to work with firefox run by browsertime. I'm not sure who's the culprit, it's as if the firefox.preference options are ignored and firefox doesn't proxy the traffic at all. Is --firefox.preference well tested in browsertime?

I've co-opted the "webpagereplay" machine on WMCS and started running browsertime + mitmproxy on it, still with the same settings as everything else, sending metrics to https://grafana.wikimedia.org/dashboard/db/webpagereplay

AWS still wins in terms of stability, by a wide margin. @Peter did you see a stability difference when running the code directly vs inside a docker container?

I'm not 100% sure, I think Docker added some better numbers but I think only way to know is to run it side by side. I do run the exact same setup without the proxy in Docker on AWS and the proxy gives us so much better values.

Looking at the numbers for lawrencium it pretty ok right (at least better than we have had before) but not as good as AWS.

For mitmproxy it looks unusable for us and we could stop testing right (great though that you set it up!)?

For Firefox I'm 100% we did get it to work with mahimahi when we tried in in the summer but I haven't seen the exact config we used in the phabricator tasks, could be that we missed adding that. Let me verify that the proxy for Firefox really works.

Yep you are right. The problem is somehow how Selenium setup the proxy with Firefox. The preferences are set ok but I guess then you need to configure it to Selenium. For us that means we need to add:

--proxy.https 127.0.0.1:8081 --proxy.http 127.0.0.1:8080

I tried it without a proxy and then at least the URL didn't load (=the config made a difference).

Mitmproxy actually gives slightly better stability than webpagereplay did on that WMCS machine (218 Speed index min-max range, vs 278 on a 24-hour period). Even just for Chrome it's a contender for webpagereplay we ought to test in the same docker AWS environment and on lawrencium.

How can I make the selenium thing work? Modify browsertime's code?

To get it to work with Browsertime, just skip setting Firefox preferences and run (changing to the port you are using):

--proxy.https 127.0.0.1:8081 --proxy.http 127.0.0.1:8080

Aha nice. Yep we should try that then.

I think that that's what we did with mahimahi if I remember correctly.

Yeah this works, except for the root CA. I see that setting the selenium capability acceptInsecureCerts to true might help. The syntax for the option to set custom capabilities in browsertime is deliberately undocumented. What's the proper format to use it?

This was what we got from Mozilla but I guess you saw that already? https://searchfox.org/mozilla-central/source/testing/talos/talos/mitmproxy/mitmproxy.py#94

selenium capability acceptInsecureCerts.

We don't have it open for setting selenium capabilities. Actually insecure certs has never worked between cross browsers (as far as I know), so you need to it per browser instead. Chrome used to have command line you can use that that has been disabled I think later version. For Firefox I think things changed when they moved to Geckodriver (was that Firefox 45?). I don't have it in my head but I've seen users that has problem getting it to work.

They use the autoconf feature, which is an enterprise thing. I tried setting that up at the OS level by putting the files in the right places, but that doesn't seem to work. I can't see any examples of people using that with selenium.

What everyone suggests when I search for selenium + firefox + mitmproxy is to open firefox manually, add the mitmproxy root CA in there and then you have a file you can copy to your Firefox profile. Problem is that browsertime seems to build the profile from scratch with Selenium. Short of saving the profile generated by browsertime, modifying that and loading the modified profile instead of generating a new one every time, I'm out of options.

Adding an option to browsertime to point to a firefox profile would help, I think. That'd probably be the most future-proof thing to do here.

It seems to have been fixed in Marionette (that Geckodriver talks to): https://bugzilla.mozilla.org/show_bug.cgi?id=1103196

So then the hack would be:

--options.selenium.capabilities acceptInsecureCerts=true

I need to test it first, I'm not sure about the format either.

Yep we don't open for pointing out a personal profile only add new preferences at the moment because we override some CSS that gets in the way when recording a video, but I guess we could change that.

No that format is not right, it takes a Map, I haven't used it before.

Yeah I couldn't figure out the format to pass a map from the command line

Yep that will not work. I made a PR to fix acceptInsecureCerts https://github.com/sitespeedio/browsertime/pull/399

It works! First working replayed runs with Firefox:

[2017-11-14 08:47:50] 95 requests, 1321.54 kb, backEndTime: 224ms (±0.72ms), firstVisualChange: 3.09s (±29.23ms), DOMContentLoaded: 2.86s (±27.56ms), Load: 6.38s (±203.87ms), speedIndex: 3198 (±27.98), visualComplete85: 3.09s (±29.23ms), lastVisualChange: 7.67s (±148.70ms), rumSpeedIndex: 1899 (±13.85) (11 runs)
[2017-11-14 08:47:50] Wrote data to result
Succesfully sent metrics

carltondance

Great @Gilles ! Do you run in Docker or should I prepare a version (if so where do I get your setup)? Also have you grabbed a HAR file and checked that everything looks ok (order of responses etc)?

It's not in Docker form, you can go ahead and make that, it should be very simple. Make sure that you use the latest mitmproxy, what comes with Debian and Ubuntu is very outdated (no HTTP/2 support, etc.).

You can find the shell scripts in my home directory on webpagereplay.webperf.eqiad.wmflabs

I haven't checked the HAR files, what mitmproxy is outputting as requests are replaying looks right.

Installed 2.0.2.

I think I still have some problem with the certs for Chrome, but for both Chrome & Firefox I get logs like:

server_playback: killed non-replay request https://meta.wikimedia.org/w/index.php?title=Special:BannerLoader&campaign=WAM_2017&banner=WAM_2017&uselang=en&debug=false

That doesn't sound right, do you see the same in the log?

When I test locally (in Docker & FF) my first out of five runs is wicked:

Screen Shot 2017-11-14 at 1.52.36 PM.png (880×2 px, 226 KB)

Maybe there's some config missing, I'll check.

Also the downloads time seems to differ a lot between runs. I don't remember that's the case for WebPageReplay but let me check that again.

I see that I'm getting a couple, but not the same ones:

server_playback: killed non-replay request https://en.wikipedia.org/w/load.php?debug=false&lang=en&modules=schema.Popups&skin=vector&version=1i02syx
server_playback: killed non-replay request https://en.wikipedia.org/w/load.php?debug=false&lang=en&modules=schema.Popups&skin=vector&version=1i02syx

Note that I specified an option to kill any requests that weren't part of the recording, you can also let them pass through and they'll go to the internet. It would make things less stable to do that, though.

I imagine that this is a problem with the recording stopping too early. Both the banner and the Popups EventLogging schema are most likely loaded with low priority.

The first time we run browsertime, when it's not throttled, it might close the browser "too early", when those low priority modules haven't had a chance to load yet. But once you throttle the connection on the replay, the modules have time to be requested, but they aren't in the recording.

I think the solution would be to keep browsertime/the browser running for longer on the first run, to make sure that the recording captures all the low-priority modules.

Hmm it doesn't seems to work for me, can you check @Gilles that your HAR looks ok?

When I run with 100 ms latency with WebPageReplay the waterfall looks like this:

Screen Shot 2017-11-14 at 2.10.21 PM.png (910×2 px, 307 KB)

But for Firefox and MITM:

Screen Shot 2017-11-14 at 2.09.51 PM.png (802×1 px, 257 KB)

something is really slow for me, hmm.

By default Browsertime closes after onLoadEventEnd + 2 seconds. The

--pageCompleteCheck "return true;"

is probably just needed for WebPageReplay right (that changes JavaScript date etc that is normally used in the complete check).

Let me increase that for my test run, thanks.

Indeed, I'm seeing the same thing with a lot of images being stuck before being served, just for Firefox. Chrome is fine with mitmproxy. This would suggest the bad behavior comes from Firefox or its Selenium driver. Or maybe that's just how Firefox behaves when throttled this way? Or when it's connecting through a proxy?


Hmm, I think it can be that the HAR exporter report start really early for each requests and then just keep it in the blocked state? But the blocking seems crazy long. I have the same up and running without proxy but throttling the connection and there the blocking is only for the connecting to that new domain: https://results.sitespeed.io/en.wikipedia.org/2017-11-13-17-25-29/pages/en.wikipedia.org/wiki/Barack_Obama/ - for me at least it seems to be added by the proxy.

I asked Mozilla if it worked out good for them, let me ask again in a day or two when they are sober after the 57 release.

I'll halt the work on MITM proxy, asked if they have seen the same as we at Mozilla.

Today we run 11 runs on WebPageReplay, I'll change it to 7 now and see if we can keep the same great numbers.

Changed, The runs that start 10 CET will be 7. If that works out we can test 5 too, then we can squeeze in more URLs within an hour.

I could see any change for Desktop (7 runs), mobile looks like this:

Screen Shot 2017-11-17 at 5.59.04 AM.png (990×2 px, 416 KB)

You can see when I did the change easiest in the First Visual Change (when the blue "Sweden" starts to change. Still it is 0.04 seconds but lets go back to 7 for now.

I've also try one with 400 ms latency as suggested by @Krinkle

I've added 400 ms for Barack Obama and Sweden for now and will check the log how long time it takes.

Using 400 ms each run take 2,5 minutes = too long. I've disables Sweden but keep running Obama for a while.

It's possible that to do this right we'll need multiple machines. The bottleneck is the time it takes to complete the slowed down requests. Maybe we could try an instance cheaper than c4.large and see if we get the same stability? We haven't tried that, i.e. seeing what the cheapest kind of AWS instance that can support this level of stability.

That's true. I can try that today. I'm disabling the 400 because it sometimes collided with the other runs, running multiple runs at the same time. It looked like this:

Screen Shot 2017-11-20 at 7.56.34 AM.png (536×2 px, 185 KB)

I'll try a m3.medium (the old default one for WebPageTest). It's hard to navigate through the different prices/sizes on AWS.

It runs the exact same tests now, same setup o both c4.large and m3.medium, added graphs at the bottom of https://grafana.wikimedia.org/dashboard/db/webpagereplay?refresh=15m&orgId=1

m3.medium don't give us the same precise stability as c4.large.

Screen Shot 2017-11-20 at 8.52.24 PM.png (984×2 px, 273 KB)

I will close it down later today.

I'll do one last change, I'll try to increase the runs and check if SpeedIndex gets more stable (first visual change is).

I also just increased to 60fps for the video on AWS (we haven't tried that with WebPageReplay).

Abha (from Wallmart Labs) shared that they "run sitespeed.io inside a isolated docker container inside a VM which is very tightly controlled from resources standpoint. We monitor the resources continuously using sysbench. This gave us a standard benchmarks to start with.". I can reach out to her to see if she can give me more info.

60 fps made it unstable.

Screen Shot 2017-11-22 at 9.00.07 PM.png (1×2 px, 350 KB)

I'll revert tomorrow and do a summary of all testing.

I've switched back to 30 fps but I was wrong. For mobile (when we record a small screen) the metrics gets better:
Before the first red line we have 30 fps, then turn to 60 fps. The variation in SpeedIndex is much better on 60 fps for mobile

Screen Shot 2017-11-23 at 9.34.12 AM.png (938×1 px, 216 KB)

For desktop it looks like this:

Screen Shot 2017-11-23 at 9.35.41 AM.png (792×1 px, 148 KB)

I would say that c4.large is underpowered for desktop on 60 fps but works great for mobile. The metrics on 30 fps is till really good on desktop so let them run on different fps.

I've updated the container with latest stable Browsertime (released last night) and changed emulated mobile to use 60 fps.

Cool. I was to dismiss Fargate since to be able to set connectivity with tc you need to do that with Docker networks or set the right privileges on the host machine BUT we just do it on localhost for our tests, so that should work.

This task has changed over time since we couldn't use our own cloud, How do you think we should move on @Gilles? I've setup alerts for the new tests using Browsertime/WebPageReplay and for the alert it will help us with two things: We will alert closer to when the real issue happens, since we don't need to take the averages of 24h as we do today and the metrics is more stable and we can find smaller regressions. But we will miss tests with Firefox, if we cannot get that DNS magic to work?

I notice that they didn't announce pricing for the i3 instances...but based on their specs, they're going to be quite expensive. Fargate seems like an interesting option, though...

@chasemp I think we need your help here and guide us in the right direction. Let me do a summary:

The long term goal for us is to run performance tests on commits (opening a browser, access a URL, collect metrics). The current solution we have is using a Docker container with Chrome, FFMPEG that records the browser screen so it can be analyzed (for example to get when first pixel is painted on the screen) and WebPageReplay which records the page and replay it to the browser so we get the same content when we do multiple runs and then add network filter to slow it down. We then do X runs and take the median of each metrics. When we do this on AWS the metrics is pretty constant. For example testing our desktop site recording a video on 30 fps gives a diff of 33 ms on a c4.large instance. For mobile (smaller screen) we can use 60 fps to get even better numbers.

If we do the same on VPS or bare metal servers we don't get the same stable metrics. Running on AWS is fine for replacing the testing we already do today with more stable metrics, but for the long term to goal we should be able to move out of AWS.

@Peter did you always run the AWS test inside Docker? That's the one thing we haven't tried on Cloud VPS and bare metal. If you did and it didn't change stability before/after Docker I don't think it's worth looking into, though.

Also, we could try Google's cloud, I believe we were given free credit there or something.

@Gilles no I did without Docker in the initial tests where AWS outperformed everything, however I'm not 100% sure that we got the exact same stable metrics as we got now.

@Gilles and me got Firefox working a couple of days ago. Gilles did a hack with dnsmasq: https://github.com/gi11es/browsertime-replays/tree/ff54-dnsmasq/webpagereplay and I got working with the Firefox preferences network.dns.forceResolv it was introduced in Firefox 55, but we was running 54 since the https://github.com/firebug/har-export-trigger was broken in 55. Running in 57 it works (as long as we turn off getting a HAR).

The problem we have is that we've seen some strange blocking in the waterfall in FF 54. We've seen the exact same thing when using mitmproxy to replay (and reported it to Mozilla):

blocked.png (1×1 px, 413 KB)

We wonder if the problem is FF or if it is the HAR export plugin? Only way to know is to be able to get the HAR from Firefox (it "should" be a plugin landing in Q4). Another alternative is to turn on the HTTP log and parse it (WPT do that) but I think that should be a really last resort.

I've added a Google Compute VM of similar cost than c4.large, running the tests on BarackObama with 7 runs, for desktop and mobile: https://grafana.wikimedia.org/dashboard/db/webpagereplay?refresh=15m&orgId=1&from=now-7d&to=now

Looks good so far, we'll have to see what it looks like after the weekend.

One thing to consider: with the amount of free credit we have, if we make a 1-year commitment for the VM, I think we can run 2 of those for a year on the free credit (or one for 2 years).

All 3 ways of running Docker containers on Google Cloud are now set up and reporting to: https://grafana.wikimedia.org/dashboard/db/webpagereplay?refresh=15m&orgId=1&from=now-7d&to=now

Right now this is all running on n1-highcpu-4 VM types. If the results after a few days are underwhelming, we can try a bigger VM type. I just picked that type because it focuses on powerful CPUs and has a dollar cost on par with c4.large.

@Peter running the following on a new VM just now:

sudo docker run --cap-add=NET_ADMIN --shm-size=1g --rm -v "$(pwd)":/browsertime -e GRAPHITE_KEY=browsertime.gce.n1-highcpu-8.enwiki.desktop.anonymous.replay.100.BarackObama -e RUNS=$RUNS -e LATENCY=$LATENCY soulgalore/browsertime-webpagereplay:statsv --resultDir result --cacheClearRaw --videoParams.framerate 30 $URL

browsertime bails with:

[2017-12-11 09:57:57] Running chrome for url: https://en.wikipedia.org/wiki/Barack_Obama
[2017-12-11 09:57:57] Changing network interfaces needs sudo rights.
[2017-12-11 09:57:57] Error running browsertime TypeError: Path must be a string. Received [ 'result', 'result' ]
    at assertPath (path.js:28:11)
    at Object.resolve (path.js:1186:7)
    at new StorageManager (/usr/src/app/lib/support/storageManager.js:40:27)
    at Engine.run (/usr/src/app/lib/core/engine.js:94:28)
    at /usr/src/app/bin/browsertime.js:63:21
From previous event:
    at run (/usr/src/app/bin/browsertime.js:62:6)
    at Object.<anonymous> (/usr/src/app/bin/browsertime.js:137:1)
    at Module._compile (module.js:635:30)
    at Object.Module._extensions..js (module.js:646:10)
    at Module.load (module.js:554:32)
    at tryModuleLoad (module.js:497:12)
    at Function.Module._load (module.js:489:3)
    at Function.Module.runMain (module.js:676:10)
    at startup (bootstrap_node.js:187:16)
    at bootstrap_node.js:608:3

@chasemp I think we need your help here and guide us in the right direction. Let me do a summary:

The long term goal for us is to run performance tests on commits (opening a browser, access a URL, collect metrics). The current solution we have is using a Docker container with Chrome, FFMPEG that records the browser screen so it can be analyzed (for example to get when first pixel is painted on the screen) and WebPageReplay which records the page and replay it to the browser so we get the same content when we do multiple runs and then add network filter to slow it down. We then do X runs and take the median of each metrics. When we do this on AWS the metrics is pretty constant. For example testing our desktop site recording a video on 30 fps gives a diff of 33 ms on a c4.large instance. For mobile (smaller screen) we can use 60 fps to get even better numbers.

If we do the same on VPS or bare metal servers we don't get the same stable metrics. Running on AWS is fine for replacing the testing we already do today with more stable metrics, but for the long term to goal we should be able to move out of AWS.

(sorry for the delay I have been afk)

The only specific thing I did for the testing instance on VPS is to make sure it landed on a hypervisor that has SSD's, but there are definitely other points of constraint. We have been squashing CPU hogging processes in a few instances as of late. We could move the testing instance to be more or less isolated on its own hypervisor to give the most practical testing isolation. To continue that long term would be a conversation about resourcing, etc but it may be the sanest thing upfront for sure. What does 'bare metal' mean in this context? If nothing except AWS gives consistent numbers I'm curious about the point of constraint that is throwing things off.

We ran this on spare bare metal servers Ops sourced for us (one without SSDs, one with an SSDs) and it still didn't give us test stability anywhere near AWS.

We ran this on spare bare metal servers Ops sourced for us (one without SSDs, one with an SSDs) and it still didn't give us test stability anywhere near AWS.

That's super interesting, I don't have any revealing thoughts atm though. The only common thread there is on-prem vs off-prem which possibly off-prem is more consistent (but I would expect also not as performant?)?

Here's the catch: all the work being done is local to the machine. We record HTTP requests talking to the internet, but then we replay them entirely locally when the measurements are made. A lot of stuff runs (browser, ffmpeg, web servers), but you'd expect that a bare metal machine with bigger specs than c4.large would be more consistent than AWS, but no. I really have no idea what might be different. Linux kernel options? More recent generation of CPU? With so many processes involved it's really a black box, we don't know what makes it considerably more consistent across runs on AWS.

I'm wondering about the 95% cases in all three: cloud vps, aws, and metal. I thought about it for awhile and the best guess I have is that median is more consistent in the AWS case due to reservations. I would potentially expect metal to be the most performant and potentially variable depending, cloud vps to be the least performant and probably wildly variable atm, and aws to be middling performance but the most consistent if this is the case. In reflecting I can understand why median and consistency is the most important baseline here. We could explore running the test cases with the same intentional resource limits in all three cases and see if that brings results inline.

@chasemp When you say "due to reservations", are you referring to underlying resource reservations (eg, core pinning on the underlying CPUs)? Or are you referring to reservations in the AWS reserved instance sense?

Something else to think about: 1 hour spent investigating how to make this work internally, costs about the same as 1 month running in the known working AWS environment. I'm not sure that there's much marginal benefit to continuing to push on the cloud VPS/bare metal options, given that there are no security/privacy concerns with this.

(That changes if there are secondary benefits, eg figuring out how to make cloud vps/bare metal work for this workload will have spillover positive effects for other projects or other similar types of work.)

@chasemp When you say "due to reservations", are you referring to underlying resource reservations (eg, core pinning on the underlying CPUs)? Or are you referring to reservations in the AWS reserved instance sense?

resource reservations within the context of the hypervisor and instance itself. I would be curious if the new bare metal within AWS has similar fluctuations. We could similarly experiment with resource isolation for test cases.

There are definitely cases where if privacy and end-user interaction isn't a concern AWS is an economical choice afaict. We are never going to be able to build and maintain a device farm just as an example :)

I think that if we figured out the cause, we might be able to make things even more stable than our best current setup. But I agree that we've already spent a lot of resources on this issue and have a working setup with no practical issues since there's no PII, I don't want to keep investigating this beyond the current quarter.

@chasemp do you expect Google Cloud to have the same properties as AWS in that respect? Because I'm benchmarking GCE at the moment for this since I got a bunch of free credit there.

@chasemp do you expect Google Cloud to have the same properties as AWS in that respect? Because I'm benchmarking GCE at the moment for this since I got a bunch of free credit there.

If I was taking bets I would say yes, but I have only used GKE and don't have much GCE experience.

I've tried both GKE and GCE and couldn't reproduce the stability of AWS with underlying machines of similar power. I'm pretty mystified by these significantly better results only happening on AWS.

Adding this so we remember it: We've been testing out with the viewport 1200x960 for WebPageReplay. With our current setup with WebPageTest we use 1024x768.

I also tried with a larger viewport an AWS C4.large (1920x1080). It works fine for Firefox 57 but for Chrome 63 I can see that the metrics become more unstable.

With metrics stability at least twice as bad on GCE as what we get on AWS, I think this concludes our experiment with Google Cloud. I'll shut down the VMs and we can keep those credits for something else.

I've just realized that Lawrencium stopped working, or didn't work at all in its latest incarnation. I think it's an issue with the need to go through a proxy on prod servers, but so far I'm not finding a way to fix it. Not sure it's worth the effort, though, with GCE being a disappointment.

Fixed the Lawrencium + Docker setup. I don't have much hope about it performing anywhere near AWS, but we'll see...

Unsurprisingly, Lawrencium with Docker is still nowhere near as stable as AWS.

@akosiaris you can get Lawrencium back, our experiment has run its course. Our best guess is that it's something in Amazon's hypervisor/custom kernel secret sauce that makes this stuff run so well there. Even Google Cloud wasn't any good in comparison.

Ok, cool. Let me know when you had time to read through the docs at https://wikitech.wikimedia.org/wiki/Performance/WebPageReplay and tried that you can access it @Gilles and I think we can close this issue.

I've made small additions to the wiki page, retrieved the PEM and logged into the machine. All good.