Page MenuHomePhabricator

Move to the new Linux-based WPT agents in AWS
Closed, ResolvedPublic

Description

Move WebPageTest instance to Linux

We should test the Linux versions of WebPageTest now when they are available on AWS.

History/Timeline

  • May 17 2017 - We created this task with the intention to move to Linux
  • May 18 2017 - Initial tests with Chrome and Firefox (Firefox "should" work). Issues created upstreams with bugs
  • Aug 25, 2017 - Initial *stable* release for WebPageTest 17.08 release with full support for Chrome and Firefox.
  • Sep 4 2017 - Reported a batch of upstream issues that make Firefox unusable
  • Sep 18 2017 - Tests takes very long time to finish on Linux, but we don't know why :(
  • Jan 17 2018 - WebPageTest.org moves to Linux
  • Feb 1 2018 - Finally a break through with the slow test on Linux. The auto scaling functionality isn't working on Linux. It sometimes kills an agent before it has finished it's work and do not start a new agent immediately.
  • Feb 4 2018 - Firefox TTFB is unrealistic high sometimes making Firefox unusable.
  • Feb 10 2018 - There are a lot more variance in metrics on Linux than on Windows.
  • Feb 26 2018 - Chrome sometimes gets a too early first visual change for authenticated users or second view
  • Mar 2 2018 - New way of setting connectivity on Linux, making the variance in metrics better.

Tasks

  • Setup a Linux Instance
  • Let it run for a couple of days and verify that the metrics is ok - they aren't perfect but at least ok to move one
  • Decide a date when we will remove the Windows version (!)
  • Update docs or add a task describe what needs to be changed and what should be expected when updating (speed changes, name changes etc)
  • Inform reading, portal and wikidata that we will do the change (dashboards needs to be changed)
  • Update our dashboards.
  • Kill the windows instance, remove the Jenkins job
  • Update https://wikitech.wikimedia.org/wiki/WebPageTest
  • Cleanup Graphite and remove old keys (but keep the Windows ones until we get the yearly stats).

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Peter added a comment.May 17 2017, 9:38 PM

There's only Chrome agents running on the new Linux version in the settings, let me check if Firefox works (no need to start test until we have Firefox).

Peter moved this task from Inbox to Doing on the Performance-Team board.May 18 2017, 7:42 AM

I'll just start to test to spin up the new instances, check that Chrome and Firefox works, and then we can compare metrics.

Finally got things through. For Chrome it looks like this: http://wpt.wmftest.org/result/170518_52_CA/ (compare with http://wpt.wmftest.org/result/170518_AA_BK/ on Windows).

We are then running on m3.medium (compared to c3.large).

Firefox doesn't work (http://wpt.wmftest.org/result/170518_3P_CB/) I guess it could be that the AMI isn't prepared yet for FF. At WebPageTest.org it is only Chrome that is configured.

I've been running all Chrome tests we have on a new instance. My plan is to move over the second view/login tests first since we had a problem with TTFB on Windows (it looks good now on Linux).

Desktop

First view: http://wpt.wmftest.org/results.php?test=170518_7N_CP&medianRun=fastest&medianMetric=SpeedIndex
Second view desktop: http://wpt.wmftest.org/results.php?test=170518_J9_D9&medianRun=fastest&medianMetric=SpeedIndex
Login: http://wpt.wmftest.org/results.php?test=170518_8T_D0&medianRun=fastest&medianMetric=SpeedIndex

Emulated mobile

Here we have problem the viewport is not correct, check the screenshots. I created a bug at Github for that.

First view:
http://wpt.wmftest.org/results.php?test=170518_BF_CZ&medianRun=fastest&medianMetric=SpeedIndex
Second view: http://wpt.wmftest.org/results.php?test=170518_HZ_D6&medianRun=fastest&medianMetric=SpeedIndex
Login: http://wpt.wmftest.org/results.php?test=170518_F0_CY&medianRun=fastest&medianMetric=SpeedIndex

You can see that we have a gap in all waterfalls for the new instance and the CPU usage is high, but that could be ok, let us run for a while and see if the metrics are stable. Don't want to spend to much time on it until we can test Firefox since it has taken most resource for us before.

Peter added a comment.May 23 2017, 9:44 AM

The gap that we have isn't ok, let me try on a larger instance. When I run Chrome on a $40 box in Linux on Digital Ocean we don't get those gaps, but that one has 2 CPUs (whatever that mean comparing AWS/DO).

Peter triaged this task as Medium priority.May 30 2017, 9:37 AM

Let me test this again on a larger instance, I think we need c4.large. When I done the tests and it works out, we should think about skipping paying by hour and instead do two reserved instances on AWS, that will be the same price as running one instance paying by the hour as we do now.

Peter added a comment.Jun 21 2017, 2:09 PM

On a c4.large the tests looks better (this will work fine for us):
http://wpt.wmftest.org/result/170621_F4_DS/
http://wpt.wmftest.org/result/170621_56_DT/

The mobile tests still has too large viewport:
http://wpt.wmftest.org/result/170621_8Y_DX/1/screen_shot/#step_1

Peter added a comment.Jun 21 2017, 3:36 PM

I've manually updated the server with git pull origin master and then after fixing two conflicts I get the latest version and the mobile viewport finally works! About Firefox: It should almost be finished (see https://github.com/WPO-Foundation/webpagetest/issues/878#issuecomment-310098880), so lets check when I'm back from vacation in August. We can close the issue when we know that Firefox works.

Peter added a comment.Jul 3 2017, 2:16 PM

Had another try (it is fixed in the WPT code) but the AWS instance isn't updated with the latest (I still get the same error on a new instance) so I guess the new version isn't auto updated.

Peter added a comment.Jul 5 2017, 4:02 AM

When we get this working we can add one location and let it run there for a while so we see that it is ok, I haven't been testing it running for some time, only made a couple of shots.

Peter added a comment.Aug 23 2017, 4:49 PM

I forgot to add that I tested earlier this week and Firefox works now (at least working as in running) so that it's pretty cool. One first step to test it out would be to setup a new agent and run the same tests under another key and we can watch that everything is ok.

Peter added a comment.Aug 28 2017, 5:53 AM

I finally got an agent and server working locally on my Mac so I could send my first PR fixing the HAR in Firefox. I'll document my setup on Wikitech.

I'll continue to test on my local and verify that the functionality that we use works and then spin up a new agent on Linux and do some more testing on that.

Peter added a comment.Sep 1 2017, 11:27 AM

I've been testing the Firefox version locally this week and it seems ok, got some hick-ups with SpeedIndex/visual metrics but I think that could be Mac OS X related. I've filed an issue for adding Firefox to the ec2 instances https://github.com/WPO-Foundation/webpagetest/issues/930 (I've missed that they lacked Firefox).

Peter added a comment.Sep 1 2017, 7:29 PM

It's been updated now, so I'll update the server (by making sure we get the latest /var/www/webpagetest/www/settings/ec2_locations.ini from Github and setting c4.large as default).

Peter added a comment.Sep 1 2017, 7:45 PM

It worked now (spinning up a Linux instance with FF) so I'll continue to verify that the data seems ok next week.

Peter added a comment.EditedSep 4 2017, 9:05 AM

The things I've seen so far testing Firefox Linux

  • Content type Other is really high, compare with Chrome and we can see that something is wrong:

FF: http://wpt.wmftest.org/results.php?test=170904_T4_7F&medianRun=fastest&medianMetric=SpeedIndex
Chrome: http://wpt.wmftest.org/results.php?test=170904_Y4_79&medianRun=fastest&medianMetric=SpeedIndex


http://wpt.wmftest.org/result/170904_Q4_7D/1/details/#waterfall_view_step1

I'll add issues at Github and see if I can fix some of them and continue to test with second view tests.

Peter added a comment.EditedSep 4 2017, 9:43 AM

Next problem: When you login a user to Firefox we get an extra request, look at that second request http://wpt.wmftest.org/result/170904_0B_9H/1/details/#waterfall_view_step1:

This is what it looks like on Windows (and how it should be).

Peter added a comment.Sep 5 2017, 5:36 PM

Most of the bugs are fixed, but we still have "Internal FF URLs are picked up (https://tracking-protection.cdn.mozilla.net/...)" - https://github.com/WPO-Foundation/wptagent/issues/40 that blocks me from more testing.

Peter added a comment.Sep 6 2017, 3:52 PM

The internal URLs are disabled now, so I can move on with the testing.

This is kind of worrying. I'm testing https://gerrit.wikimedia.org/r/#/c/378658/ - 9 urls per script, one run per script (3 scripts). So it should test 27 URLs. Testing the first 9 takes 6 min but then something happens, either I don't get them through or the full tests takes over an hour. I'll fill a couple of upstream bugs during the day. I think the problem is because of a "smart" check that verifies that the CPU is not running to high before we start the next test.

I've added https://github.com/WPO-Foundation/wptagent/issues/56 for the problem that the tests take so long time and then https://github.com/WPO-Foundation/wptagent/issues/57 best practice to update the browser versions (right now it is locked to the one when the AWS image was created).

Peter added a comment.Sep 19 2017, 7:13 AM

https://github.com/WPO-Foundation/wptagent/issues/56 is a known issue. We can help debugging it by turning on the log, but at the moment the WebPageTest wrapper API is not supporting that, so I'll start to see if I can just add it, then I need rerun the tests, collect all logs and analyze them.

I've started this again but instead of letting WPT handling the hosts, we'll deploy the agent ourself. I wanna have it up and running and then test run for a while. we can use it test Firefox 56 vs Firefox 57 if I get it to work.

Imarlier removed a subscriber: Imarlier.
Imarlier added a subscriber: Imarlier.
Peter added a comment.Dec 14 2017, 3:19 PM

Next step: We should run it on AWS but not on the automatic deploy. We should create a Linux instance (there's a Ubuntu install script so we should use Ubuntu I think). Then start the agent so it connects to our WebPageTest server (we probably need to reconfigure our setting on the server, I had problem with that the last time I tried). Then we make sure we start the agent with full logging and then locally we can fire 10 tests from https://github.com/wikimedia/wpt-reporter and compare the times and keep the log so we can attach it to the issue at Github. Maybe we can do a PR when we find the problem.

Peter added a comment.Dec 14 2017, 3:28 PM

Let me start with this tomorrow, would be nice to have it fixed and then start the new year on Linux :)

@Peter Awesome -- let me know if an extra set of hands/eyes would be helpful.

Peter added a comment.Dec 15 2017, 8:24 AM

@Imarlier yes, I'll ping you early next week if I don't get it going. I'll try to reproduce with the Docker containers locally first (I actually didn't try that before).

Peter added a comment.Dec 15 2017, 8:36 AM

Hmm our Docker instructions doesn't work anymore. Note to myself to update them when I get them to work.

Peter added a comment.Dec 15 2017, 2:13 PM

The WebPageTest docker containers aren't tagged per release so our old instructions of how to get it up and running isn't working anymore (at least for me), I've added issue upstream https://github.com/WPO-Foundation/webpagetest/issues/1069

Peter added a comment.Dec 18 2017, 6:55 AM

Added upstream issues:
https://github.com/WPO-Foundation/wptagent/issues/70
https://github.com/WPO-Foundation/wptagent/issues/71

I'll have a go at that video problem, see if I can fix it.

Krinkle renamed this task from Test the new Linux based AWS instance(s) to Test the new Linux-based WPT agents in AWS.Jan 17 2018, 8:58 AM
Peter added a comment.Jan 17 2018, 9:02 AM

WebPageTest.org is now running on Linux https://twitter.com/patmeenan/status/951234346458984454 so I think we can move on, even though we had the problems. It's better to just run it side by side with the current version.

Imarlier renamed this task from Test the new Linux-based WPT agents in AWS to Move to the new Linux-based WPT agents in AWS.Jan 18 2018, 5:39 PM

Change 406283 had a related patch set uploaded (by Phedenskog; owner: Phedenskog):
[integration/config@master] Run tests on a Linux agent for WebPageTest

https://gerrit.wikimedia.org/r/406283

Peter added a comment.Jan 26 2018, 2:37 PM

I think we should push the change ASAP when think it is ok, so we can verify that it works ok. Then we can check if we can run it all the way through Q4 along with the Windows version? I'm just really curious to know if it will work.

@Peter Agreed, let's get it out there.

Change 406283 merged by jenkins-bot:
[integration/config@master] Run tests on a Linux agent for WebPageTest

https://gerrit.wikimedia.org/r/406283

Change 407067 had a related patch set uploaded (by Phedenskog; owner: Phedenskog):
[integration/config@master] Remove testing IE test from the WebPageTest Linux instance.

https://gerrit.wikimedia.org/r/407067

Change 407067 merged by jenkins-bot:
[integration/config@master] Remove testing IE test from the WebPageTest Linux instance.

https://gerrit.wikimedia.org/r/407067

Peter added a comment.Feb 2 2018, 6:43 AM

It seems like the problem we been having is because if the auto-scaling of AWS (that we use for Windows). It doesn't seems to work for Linux and instead of just keeping the instance alive, it kills it after 1 hour or so.

Let me try to deploy a static instance: https://github.com/WPO-Foundation/webpagetest-docs/blob/master/user/Private%20Instances/ec2_agents.md

Change 407630 had a related patch set uploaded (by Phedenskog; owner: Phedenskog):
[integration/config@master] Change Linux instance name for WebPageTest Linux version

https://gerrit.wikimedia.org/r/407630

Peter updated the task description. (Show Details)Feb 2 2018, 2:02 PM

Change 407630 merged by jenkins-bot:
[integration/config@master] Change Linux instance name for WebPageTest Linux version

https://gerrit.wikimedia.org/r/407630

Peter updated the task description. (Show Details)Feb 4 2018, 3:52 PM
Peter updated the task description. (Show Details)Feb 4 2018, 6:30 PM

Change 413343 had a related patch set uploaded (by Phedenskog; owner: Phedenskog):
[performance/WebPageTest@master] Test only first runs for our three test URLs

https://gerrit.wikimedia.org/r/413343

Change 413347 had a related patch set uploaded (by Phedenskog; owner: Phedenskog):
[integration/config@master] WebPageTest: Use new Linux instance name

https://gerrit.wikimedia.org/r/413347

Peter updated the task description. (Show Details)Feb 22 2018, 11:14 AM

Change 413347 merged by jenkins-bot:
[integration/config@master] WebPageTest: Use new Linux instance name

https://gerrit.wikimedia.org/r/413347

Change 413343 merged by jenkins-bot:
[performance/WebPageTest@master] Test only first runs for our three test URLs

https://gerrit.wikimedia.org/r/413343

Peter updated the task description. (Show Details)Feb 25 2018, 12:04 PM
Peter updated the task description. (Show Details)
Peter updated the task description. (Show Details)Mar 6 2018, 9:46 AM

Change 423653 had a related patch set uploaded (by Phedenskog; owner: Phedenskog):
[integration/config@master] WebPageTest: Remove the Windows test agent

https://gerrit.wikimedia.org/r/423653

Change 423655 had a related patch set uploaded (by Phedenskog; owner: Phedenskog):
[performance/WebPageTest@master] Remove Windows agent

https://gerrit.wikimedia.org/r/423655

Change 423655 merged by jenkins-bot:
[performance/WebPageTest@master] Remove Windows agent

https://gerrit.wikimedia.org/r/423655

Change 423653 merged by jenkins-bot:
[integration/config@master] WebPageTest: Remove the Windows test agent

https://gerrit.wikimedia.org/r/423653

Peter updated the task description. (Show Details)Apr 19 2018, 7:41 AM
Peter closed this task as Resolved.Apr 19 2018, 7:53 AM

I created T192522 as a follow up when we collected the yearly stats from the old machine. I think we can say this task is done and the do the cleanup after we collected the metrics in the end of the quarter.