Page MenuHomePhabricator

Move to the new Linux-based WPT agents in AWS
Closed, ResolvedPublic

Description

Move WebPageTest instance to Linux

We should test the Linux versions of WebPageTest now when they are available on AWS.

History/Timeline

  • May 17 2017 - We created this task with the intention to move to Linux
  • May 18 2017 - Initial tests with Chrome and Firefox (Firefox "should" work). Issues created upstreams with bugs
  • Aug 25, 2017 - Initial *stable* release for WebPageTest 17.08 release with full support for Chrome and Firefox.
  • Sep 4 2017 - Reported a batch of upstream issues that make Firefox unusable
  • Sep 18 2017 - Tests takes very long time to finish on Linux, but we don't know why :(
  • Jan 17 2018 - WebPageTest.org moves to Linux
  • Feb 1 2018 - Finally a break through with the slow test on Linux. The auto scaling functionality isn't working on Linux. It sometimes kills an agent before it has finished it's work and do not start a new agent immediately.
  • Feb 4 2018 - Firefox TTFB is unrealistic high sometimes making Firefox unusable.
  • Feb 10 2018 - There are a lot more variance in metrics on Linux than on Windows.
  • Feb 26 2018 - Chrome sometimes gets a too early first visual change for authenticated users or second view
  • Mar 2 2018 - New way of setting connectivity on Linux, making the variance in metrics better.

Tasks

  • Setup a Linux Instance
  • Let it run for a couple of days and verify that the metrics is ok - they aren't perfect but at least ok to move one
  • Decide a date when we will remove the Windows version (!)
  • Update docs or add a task describe what needs to be changed and what should be expected when updating (speed changes, name changes etc)
  • Inform reading, portal and wikidata that we will do the change (dashboards needs to be changed)
  • Update our dashboards.
  • Kill the windows instance, remove the Jenkins job
  • Update https://wikitech.wikimedia.org/wiki/WebPageTest
  • Cleanup Graphite and remove old keys (but keep the Windows ones until we get the yearly stats).

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

There's only Chrome agents running on the new Linux version in the settings, let me check if Firefox works (no need to start test until we have Firefox).

I'll just start to test to spin up the new instances, check that Chrome and Firefox works, and then we can compare metrics.

Finally got things through. For Chrome it looks like this: http://wpt.wmftest.org/result/170518_52_CA/ (compare with http://wpt.wmftest.org/result/170518_AA_BK/ on Windows).

We are then running on m3.medium (compared to c3.large).

Firefox doesn't work (http://wpt.wmftest.org/result/170518_3P_CB/) I guess it could be that the AMI isn't prepared yet for FF. At WebPageTest.org it is only Chrome that is configured.

I've been running all Chrome tests we have on a new instance. My plan is to move over the second view/login tests first since we had a problem with TTFB on Windows (it looks good now on Linux).

Desktop

First view: http://wpt.wmftest.org/results.php?test=170518_7N_CP&medianRun=fastest&medianMetric=SpeedIndex
Second view desktop: http://wpt.wmftest.org/results.php?test=170518_J9_D9&medianRun=fastest&medianMetric=SpeedIndex
Login: http://wpt.wmftest.org/results.php?test=170518_8T_D0&medianRun=fastest&medianMetric=SpeedIndex

Emulated mobile

Here we have problem the viewport is not correct, check the screenshots. I created a bug at Github for that.

First view:
http://wpt.wmftest.org/results.php?test=170518_BF_CZ&medianRun=fastest&medianMetric=SpeedIndex
Second view: http://wpt.wmftest.org/results.php?test=170518_HZ_D6&medianRun=fastest&medianMetric=SpeedIndex
Login: http://wpt.wmftest.org/results.php?test=170518_F0_CY&medianRun=fastest&medianMetric=SpeedIndex

You can see that we have a gap in all waterfalls for the new instance and the CPU usage is high, but that could be ok, let us run for a while and see if the metrics are stable. Don't want to spend to much time on it until we can test Firefox since it has taken most resource for us before.

The gap that we have isn't ok, let me try on a larger instance. When I run Chrome on a $40 box in Linux on Digital Ocean we don't get those gaps, but that one has 2 CPUs (whatever that mean comparing AWS/DO).

Peter triaged this task as Medium priority.May 30 2017, 9:37 AM

Let me test this again on a larger instance, I think we need c4.large. When I done the tests and it works out, we should think about skipping paying by hour and instead do two reserved instances on AWS, that will be the same price as running one instance paying by the hour as we do now.

I've manually updated the server with git pull origin master and then after fixing two conflicts I get the latest version and the mobile viewport finally works! About Firefox: It should almost be finished (see https://github.com/WPO-Foundation/webpagetest/issues/878#issuecomment-310098880), so lets check when I'm back from vacation in August. We can close the issue when we know that Firefox works.

Had another try (it is fixed in the WPT code) but the AWS instance isn't updated with the latest (I still get the same error on a new instance) so I guess the new version isn't auto updated.

When we get this working we can add one location and let it run there for a while so we see that it is ok, I haven't been testing it running for some time, only made a couple of shots.

I forgot to add that I tested earlier this week and Firefox works now (at least working as in running) so that it's pretty cool. One first step to test it out would be to setup a new agent and run the same tests under another key and we can watch that everything is ok.

I finally got an agent and server working locally on my Mac so I could send my first PR fixing the HAR in Firefox. I'll document my setup on Wikitech.

I'll continue to test on my local and verify that the functionality that we use works and then spin up a new agent on Linux and do some more testing on that.

I've been testing the Firefox version locally this week and it seems ok, got some hick-ups with SpeedIndex/visual metrics but I think that could be Mac OS X related. I've filed an issue for adding Firefox to the ec2 instances https://github.com/WPO-Foundation/webpagetest/issues/930 (I've missed that they lacked Firefox).

It's been updated now, so I'll update the server (by making sure we get the latest /var/www/webpagetest/www/settings/ec2_locations.ini from Github and setting c4.large as default).

It worked now (spinning up a Linux instance with FF) so I'll continue to verify that the data seems ok next week.

The things I've seen so far testing Firefox Linux

  • Content type Other is really high, compare with Chrome and we can see that something is wrong:

FF: http://wpt.wmftest.org/results.php?test=170904_T4_7F&medianRun=fastest&medianMetric=SpeedIndex
Chrome: http://wpt.wmftest.org/results.php?test=170904_Y4_79&medianRun=fastest&medianMetric=SpeedIndex

Screen Shot 2017-09-04 at 10.57.35 AM.png (938×1 px, 302 KB)

Screen Shot 2017-09-04 at 11.01.19 AM.png (912×1 px, 310 KB)

http://wpt.wmftest.org/result/170904_Q4_7D/1/details/#waterfall_view_step1
Screen Shot 2017-09-04 at 11.04.09 AM.png (700×1 px, 265 KB)

I'll add issues at Github and see if I can fix some of them and continue to test with second view tests.

Next problem: When you login a user to Firefox we get an extra request, look at that second request http://wpt.wmftest.org/result/170904_0B_9H/1/details/#waterfall_view_step1:

Screen Shot 2017-09-04 at 11.40.33 AM.png (722×1 px, 265 KB)

This is what it looks like on Windows (and how it should be).

Screen Shot 2017-09-04 at 11.41.31 AM.png (604×1 px, 238 KB)

Most of the bugs are fixed, but we still have "Internal FF URLs are picked up (https://tracking-protection.cdn.mozilla.net/...)" - https://github.com/WPO-Foundation/wptagent/issues/40 that blocks me from more testing.

The internal URLs are disabled now, so I can move on with the testing.

This is kind of worrying. I'm testing https://gerrit.wikimedia.org/r/#/c/378658/ - 9 urls per script, one run per script (3 scripts). So it should test 27 URLs. Testing the first 9 takes 6 min but then something happens, either I don't get them through or the full tests takes over an hour. I'll fill a couple of upstream bugs during the day. I think the problem is because of a "smart" check that verifies that the CPU is not running to high before we start the next test.

I've added https://github.com/WPO-Foundation/wptagent/issues/56 for the problem that the tests take so long time and then https://github.com/WPO-Foundation/wptagent/issues/57 best practice to update the browser versions (right now it is locked to the one when the AWS image was created).

https://github.com/WPO-Foundation/wptagent/issues/56 is a known issue. We can help debugging it by turning on the log, but at the moment the WebPageTest wrapper API is not supporting that, so I'll start to see if I can just add it, then I need rerun the tests, collect all logs and analyze them.

I've started this again but instead of letting WPT handling the hosts, we'll deploy the agent ourself. I wanna have it up and running and then test run for a while. we can use it test Firefox 56 vs Firefox 57 if I get it to work.

Next step: We should run it on AWS but not on the automatic deploy. We should create a Linux instance (there's a Ubuntu install script so we should use Ubuntu I think). Then start the agent so it connects to our WebPageTest server (we probably need to reconfigure our setting on the server, I had problem with that the last time I tried). Then we make sure we start the agent with full logging and then locally we can fire 10 tests from https://github.com/wikimedia/wpt-reporter and compare the times and keep the log so we can attach it to the issue at Github. Maybe we can do a PR when we find the problem.

Let me start with this tomorrow, would be nice to have it fixed and then start the new year on Linux :)

@Peter Awesome -- let me know if an extra set of hands/eyes would be helpful.

@Imarlier yes, I'll ping you early next week if I don't get it going. I'll try to reproduce with the Docker containers locally first (I actually didn't try that before).

Hmm our Docker instructions doesn't work anymore. Note to myself to update them when I get them to work.

The WebPageTest docker containers aren't tagged per release so our old instructions of how to get it up and running isn't working anymore (at least for me), I've added issue upstream https://github.com/WPO-Foundation/webpagetest/issues/1069

Krinkle renamed this task from Test the new Linux based AWS instance(s) to Test the new Linux-based WPT agents in AWS.Jan 17 2018, 8:58 AM

WebPageTest.org is now running on Linux https://twitter.com/patmeenan/status/951234346458984454 so I think we can move on, even though we had the problems. It's better to just run it side by side with the current version.

Imarlier renamed this task from Test the new Linux-based WPT agents in AWS to Move to the new Linux-based WPT agents in AWS.Jan 18 2018, 5:39 PM

Change 406283 had a related patch set uploaded (by Phedenskog; owner: Phedenskog):
[integration/config@master] Run tests on a Linux agent for WebPageTest

https://gerrit.wikimedia.org/r/406283

I think we should push the change ASAP when think it is ok, so we can verify that it works ok. Then we can check if we can run it all the way through Q4 along with the Windows version? I'm just really curious to know if it will work.

Change 406283 merged by jenkins-bot:
[integration/config@master] Run tests on a Linux agent for WebPageTest

https://gerrit.wikimedia.org/r/406283

Change 407067 had a related patch set uploaded (by Phedenskog; owner: Phedenskog):
[integration/config@master] Remove testing IE test from the WebPageTest Linux instance.

https://gerrit.wikimedia.org/r/407067

Change 407067 merged by jenkins-bot:
[integration/config@master] Remove testing IE test from the WebPageTest Linux instance.

https://gerrit.wikimedia.org/r/407067

It seems like the problem we been having is because if the auto-scaling of AWS (that we use for Windows). It doesn't seems to work for Linux and instead of just keeping the instance alive, it kills it after 1 hour or so.

Let me try to deploy a static instance: https://github.com/WPO-Foundation/webpagetest-docs/blob/master/user/Private%20Instances/ec2_agents.md

Change 407630 had a related patch set uploaded (by Phedenskog; owner: Phedenskog):
[integration/config@master] Change Linux instance name for WebPageTest Linux version

https://gerrit.wikimedia.org/r/407630

Change 407630 merged by jenkins-bot:
[integration/config@master] Change Linux instance name for WebPageTest Linux version

https://gerrit.wikimedia.org/r/407630

Change 413343 had a related patch set uploaded (by Phedenskog; owner: Phedenskog):
[performance/WebPageTest@master] Test only first runs for our three test URLs

https://gerrit.wikimedia.org/r/413343

Change 413347 had a related patch set uploaded (by Phedenskog; owner: Phedenskog):
[integration/config@master] WebPageTest: Use new Linux instance name

https://gerrit.wikimedia.org/r/413347

Change 413347 merged by jenkins-bot:
[integration/config@master] WebPageTest: Use new Linux instance name

https://gerrit.wikimedia.org/r/413347

Change 413343 merged by jenkins-bot:
[performance/WebPageTest@master] Test only first runs for our three test URLs

https://gerrit.wikimedia.org/r/413343

Change 423653 had a related patch set uploaded (by Phedenskog; owner: Phedenskog):
[integration/config@master] WebPageTest: Remove the Windows test agent

https://gerrit.wikimedia.org/r/423653

Change 423655 had a related patch set uploaded (by Phedenskog; owner: Phedenskog):
[performance/WebPageTest@master] Remove Windows agent

https://gerrit.wikimedia.org/r/423655

Change 423655 merged by jenkins-bot:
[performance/WebPageTest@master] Remove Windows agent

https://gerrit.wikimedia.org/r/423655

Change 423653 merged by jenkins-bot:
[integration/config@master] WebPageTest: Remove the Windows test agent

https://gerrit.wikimedia.org/r/423653

I created T192522 as a follow up when we collected the yearly stats from the old machine. I think we can say this task is done and the do the cleanup after we collected the metrics in the end of the quarter.