Page MenuHomePhabricator

Alerts for scripting and SpeedIndex with Browsertime/WebPageReplay
Closed, ResolvedPublic

Assigned To
Authored By
Peter
Apr 13 2018, 6:53 AM
Referenced Files
F17123821: emulated.png
Apr 19 2018, 9:13 AM
F17123112: allinstances.png
Apr 19 2018, 8:58 AM
F16947666: Screen Shot 2018-04-13 at 9.25.12 AM.png
Apr 13 2018, 7:26 AM
F16947640: Screen Shot 2018-04-13 at 9.10.55 AM.png
Apr 13 2018, 7:14 AM
F16947645: Screen Shot 2018-04-13 at 9.13.50 AM.png
Apr 13 2018, 7:14 AM
F16947019: Screen Shot 2018-04-13 at 8.41.36 AM.png
Apr 13 2018, 6:53 AM
F16947022: Screen Shot 2018-04-13 at 8.43.51 AM.png
Apr 13 2018, 6:53 AM
F16947030: hepp.png
Apr 13 2018, 6:53 AM

Description

We had an outage on the Browsertime/WebPageReplay server the 2018-04-11 16.xx -> 2018-04-12 12:xx. When I logged in to the server I could see that there where no tests running and the /tmp folder was empty (where we log the all information for the runs). I started the tests again and there is an increase for a lot of metrics:

  • Firefox: Start render/Speed Index skyrocketed and we got higher mdev

Screen Shot 2018-04-13 at 8.41.36 AM.png (1×2 px, 341 KB)

  • Chrome we could see an increase in scripting/layout but almost nothing in visual metrics

Screen Shot 2018-04-13 at 8.43.51 AM.png (914×1 px, 123 KB)

  • Emulated mobile increase in most CPU metrics and alerts on SpeedIndex and start render:

hepp.png (2×1 px, 266 KB)

So did something happened on the server with the restart, increasing most metrics or is there a real change? I couldn't spot anything on WebPageTest but will look more into it today. We also increased the % of users that gets the page previews to 10 during that time but that shouldn't affect us right or?

Event Timeline

I got a feeling there's something going on on the server. I'll create a new one during the day and deploy there. Also will change so we log to another dir and start on server restart.

Looking at metrics from my other test server, I cannot see a change at all:

Barack Obama on Firefox (that machine got 4 CPUs though):

Screen Shot 2018-04-13 at 9.10.55 AM.png (950×2 px, 218 KB)

And CPU metrics on emulated mobile (no change at all):

Screen Shot 2018-04-13 at 9.13.50 AM.png (792×2 px, 151 KB)

The CPU usage seems higher on the instance after the stop:

Screen Shot 2018-04-13 at 9.25.12 AM.png (410×540 px, 73 KB)

I have created a new instance and let it send metrics to another key structure for now, need to keep it running for a couple of hours to know more.

What made the server restart? Sounds like it did it on its own, which could suggest faulty hardware.

There's no trace of the reboot reason in syslog. This looks like when the reboot happened:

Apr 11 18:01:56 ip-172-31-61-226 kernel: [6255936.480068] docker0: port 1(veth39a13cc) entered forwarding state
Apr 11 18:03:09 ip-172-31-61-226 systemd[1]: Stopping Authenticate and Authorize Users to Run Privileged Tasks...
Apr 11 18:03:09 ip-172-31-61-226 systemd[1]: Stopping ACPI event daemon...

And then the system starts up again at 18:04:18

I talked to a friend that runs everything on AWS and he said that this kind of things happens. I got the other instance up and running (it took some time), lets see what those metrics looks like.

The old instance runs 4.4.0-1054-aws and the new one I installed runs 4.4.0-1052-aws (older version). So far when I checked the new one the metrics are back as before (checking CPU time spent in Chrome). But it will be easier to see when we got more data.

It adds metrics under browertime.enwiki-test, it looks the same for FF, I'll check tonight or tomorrow to see if there's a difference.

I talked to a friend that runs everything on AWS and he said that this kind of things happens. I got the other instance up and running (it took some time), lets see what those metrics looks like.

A number of bigger AWS users (Netflix, etc) will actually launch 2 or 3 instances every time they need a new one, run a set of benchmarks on each of them, and kill the slowest ones. See https://www.youtube.com/watch?&v=pYbgcDfM2Ts (the whole video is really, really worth a watch, but the relevant part is at https://youtu.be/pYbgcDfM2Ts?t=1575)

Also, not that we're at the point where this matters, but this is a really solid breakdown of AWS instance tuning: https://www.slideshare.net/brendangregg/how-netflix-tunes-ec2-instances-for-performance

Thanks @Imarlier almost no tuning done so far so there are a lot to do there. But I wonder then if it wouldn't be better to try to tune bare-metal instead with some help? We did get the most stable metrics for AWS but if we will spend time to tune it more, maybe it's better to do it on our own servers instead? I mean the exact metrics doesn't matter for us, what's important is that the metrics are stable and we get that with more control?

I got the same behavior on the new instance (very high variance on FF). But also a difference in 50 ms painting for Obama in Chrome (let me add some stats later today), so the instances seems to have a difference.

I've installed to extra servers so it's easier for us to keep track of difference in the metrics, I'll setup a new dashboard for that when we have more data.

Peter renamed this task from Increase/alerts for scripting and SpeedIndex to Alerts for scripting and SpeedIndex with Browsertime/WebPageReplay .Apr 16 2018, 7:31 PM

We have 3 days of metrics now that should be enough. I can actually see difference in stability. Let me first do a summary:

I've deployed three extra c4.large instances on AWS, installed the same software on all servers following https://wikitech.wikimedia.org/wiki/Performance/WebPageReplay#First_time_install and then let them send metrics to Graphite under 4 different keys (we already had one instance up and running):

  • enwiki
  • enwiki-test
  • enwiki-test2
  • enwiki-test3

The enwiki instances is the one we've been running and that was restarted some time ago.

Here are the graphs of first visual change and speed index for four URLs tested with Chrome:

allinstances.png (4×2 px, 1 MB)

The enwiki-test instance has smallest deviation on all URLs except Metalloid, but it is small there too. I'll check the mobile URLs to so we have a summary for that.

For emulated mobile the numbers are more close to each other. That instance isn't always the best but difference is really small:

emulated.png (3×2 px, 799 KB)

I've switched to the one with most stable metrics, closed down two of the others and keep the first server a couple of hours just to make sure everything works ok.

A summary: For Chrome the instance somehow made the metrics more unstable, changing to another instance made the metrics more stable again.

When the server was down, we also pushed a change that made our scripting on page took longer time. That din't affect other for Chrome but for Firefox it made metrics more unstable because of them first paint happening before JavaScript is parsed and after (https://phabricator.wikimedia.org/T160315#4148868).