@Volans cool, I see your point. It's better to be prepared before we start instead of fix things when they break :) For the WebPageTest annotations we should plan for adding a job to delete them after X days or whatever time. We could have something like that with the alerts too.
Timings for FF57 went down 200-500 ms removing the calling home. I've also removed trace logging for Chrome in the graphs, that shaved off 200 ms.
Thanks @fgiunchedi . We will not start with the WebPageTest task yet before we have solution that we all agree on.
I've changed the name to mediaWikiLoad (the delta, I think we said that when we started) and then just skipped start/end, or do you see a need for them @Krinkle ?
Sorry the push was for the WPT agent, I pushed a PR now for the Windows version.
I've seen this https://bugzilla.mozilla.org/show_bug.cgi?id=1418013 and Pat pushed a change yesterday but no change in the metrics, I'm building a new container now with some updated preferences to see if the metrics change. If not lets wait to see how we should do to disable the TP. I'm not completely sure about the impact though since we have the same result for when we test two URLs.
Using 400 ms each run take 2,5 minutes = too long. I've disables Sweden but keep running Obama for a while.
I've added 400 ms for Barack Obama and Sweden for now and will check the log how long time it takes.
I could see any change for Desktop (7 runs), mobile looks like this:
You can see when I did the change easiest in the First Visual Change (when the blue "Sweden" starts to change. Still it is 0.04 seconds but lets go back to 7 for now.
It looks like the second view has potential to be much faster but it is hidden in the unstable values, let me change the tests on the other server and see what it looks like.
Also adding what we have on second view for Firefox (second view = hit one page, then go to another page):
Changed, The runs that start 10 CET will be 7. If that works out we can test 5 too, then we can squeeze in more URLs within an hour.
Today we run 11 runs on WebPageReplay, I'll change it to 7 now and see if we can keep the same great numbers.
I'll halt the work on MITM proxy, asked if they have seen the same as we at Mozilla.
Wed, Nov 15
The automatic annotations from WebPageTest we could purge them like every two weeks if it's an issue, we very very rarely need to go back longer in time. The annotations that we create manually we need to keep.
Make sure mediaWikiLoadEnd or mediaWikiLoadComplete is working
Yep it is a miss match in navtiming2 what is sent from the user (mediaWikiLoadComplete) but the code expect mediaWikiLoadEnd and mediaWikiLoadStart (and then calculate the newly named mediaWikiLoad). Let me fix that.
@Krinkle Yep totally agree let us skip that.
Tue, Nov 14
Hmm, I think it can be that the HAR exporter report start really early for each requests and then just keep it in the blocked state? But the blocking seems crazy long. I have the same up and running without proxy but throttling the connection and there the blocking is only for the connecting to that new domain: https://results.sitespeed.io/en.wikipedia.org/2017-11-13-17-25-29/pages/en.wikipedia.org/wiki/Barack_Obama/ - for me at least it seems to be added by the proxy.
By default Browsertime closes after onLoadEventEnd + 2 seconds. The
--pageCompleteCheck "return true;"
Hmm it doesn't seems to work for me, can you check @Gilles that your HAR looks ok?
When I test locally (in Docker & FF) my first out of five runs is wicked:
Hmmmm, for the 99th percentile, I wonder why load times decrease just before they increase?
Great @Gilles ! Do you run in Docker or should I prepare a version (if so where do I get your setup)? Also have you grabbed a HAR file and checked that everything looks ok (order of responses etc)?
Mon, Nov 13
Yep that will not work. I made a PR to fix acceptInsecureCerts https://github.com/sitespeedio/browsertime/pull/399
No that format is not right, it takes a Map, I haven't used it before.
Yep we don't open for pointing out a personal profile only add new preferences at the moment because we override some CSS that gets in the way when recording a video, but I guess we could change that.
It seems to have been fixed in Marionette (that Geckodriver talks to): https://bugzilla.mozilla.org/show_bug.cgi?id=1103196
This was what we got from Mozilla but I guess you saw that already? https://searchfox.org/mozilla-central/source/testing/talos/talos/mitmproxy/mitmproxy.py#94
I think that that's what we did with mahimhi if I remember correctly. I
To get it to work with Browsertime, just skip setting Firefox preferences and run (changing to the port you are using).
Yep you are right. The problem is somehow how Selenium setup the proxy with Firefox. The preferences are set ok but I guess then you need to configure it to Selenium. For us that means we need to add:
AWS still wins in terms of stability, by a wide margin. @Peter did you see a stability difference when running the code directly vs inside a docker container?
Hi sorry I was on leave last week and I think we missed this one.
@Krinkle would love your input on this before I move on: https://wikitech.wikimedia.org/wiki/Measure_Performance
That seems pretty straightforward as long as we can have the access: http://docs.grafana.org/http_api/annotations/
When we upgrade to Grafana 4.6.X we can test out the new annotation API in Grafana = we can open up only for Grafana instead of using the Graphite API.
I think we can do this when we move to the new first paint as @Gilles implemented for Asia (we should make that one default and then have the old loadTimes as a backup (loadTimes is getting deprecated so we should also make sure that we don't break on it)).
We have more traffic now on 62:
Sun, Nov 5
I had a go at using WebPageReplay for Firefox but no luck. I got help from Mozillians how they set the proxy for MITMproxy:
Ok, now we I've got it running for a while so that I feel comfortable that it works really good. I've been running 11 runs with WebPageReplay, 11 runs without WebPageReplay (but there we collect the trace log from chrome so that could change the metrics) and compare it to our continuous run on WebPageTest.
Fri, Nov 3
The Docker container for the modified mahimahi is here
I run WebPageReplay now on AWS (c4.large) testing three desktop/three mobile 100 ms latency and push the metrics to https://grafana.wikimedia.org/dashboard/db/webpagereplay
Thu, Nov 2
Wow @Krinkle! I'm super impressed with how you investigated this issue and how you so thoroughly described it. I think we can take your description and make it into a blog post on our blog.
I've spent some time last days creating Docker images for WebPageReplay and Mahimahi: The mahimahi one doesn't work yet for me, I will look more today/tomorrow else you @Gilles maybe can try it out when you get back. I'll cleanup the Docker container for WebPageReplay today. I've pushed into AWS last day running 11 runs for three desktop URLs and three mobile URLs.
Tue, Oct 31
I checked time spent in different phases and for all three pages we 50-100% increased the time spent in painting:
Mon, Oct 30
From todays meeting: redirecting and unload is reported incorrect, we need to report the zeros from the client too to make medians etc correct.
Thu, Oct 26
hmm the numbers aren't so impressive. Do the server use SSD? :)
Wed, Oct 25
I pinged Nolan on the original issue: https://github.com/WPO-Foundation/webpagetest/issues/560
Tue, Oct 24
Yes but the stability changed since the download was super fast without any throttling? Take a HAR file with and without and check them out, see what kind of download speed it has.
Lets take this when there's an AWS AMI with everything prepared.
fyi if you build from master, I fixed the throttling some time ago, so you can use:
Mon, Oct 23
But there are still some gaps it seems at least, I will look into it more.
I've started this again but instead of letting WPT handling the hosts, we'll deploy the agent ourself. I wanna have it up and running and then test run for a while. we can use it test Firefox 56 vs Firefox 57 if I get it to work.
Yep we can start with that, and then add WebPageReplay or mahimahi.
Fri, Oct 20
It's here now:
adding @Krinkle for input!
I think the change here is because right now show the median, and the DNS median time is 0. If I change to mean/p-high instead I see the metrics are reported. So how do we wanna do this?
Chrome has had it for a while. I'd suspect a regression like this to maybe be caused by other changes in Chrome 62
Thu, Oct 19
This is what it looked like when I deployed 62 stable on Linux:
Yep it happens on Linux too. On 62 beta (don't remember exact which beta) it was the same behaviour as 61 but on stable it is what we see on WPT. On OS X I still get the same thing on 62:
Arghh wait a minute, I see it on Linux too, let me look more into it.
Running Browsertime on Ubuntu, testing manually on OS X.
I wanna add it's only Chrome 62 on Windows. I don't get the same on Chrome 62 on OS X or Linux.
I'll add all the graphs here later today for historical reasons.
Ok, it seems to be that on Chrome for WebPageTest the request https://en.wikipedia.org/w/load.php?debug=false&lang=en&modules=jquery%2Cmediawiki%7Cmediawiki.legacy.wikibits&only=scripts&skin=vector&version=0v61to4 is request number 7 when it slow, before it was 53 on the Obama page.
Yep it was 10fps: https://twitter.com/patmeenan/status/920953070946865152
I've asked Pat if it is 30 or 10 for the old Windows agent to be sure.
I said yesterday that I maybe don't see better values with WebPageReplay on c4.large, but not sure, lets see. I'v spin another c4.large instance at the same location as our WebPageTest instance. I've been testing Obama page with 5 and 11 runs, with and without WebPageReplay + testing only using 10fps in the video (WPT windows has 10, new Linux has 30 I think). First I tested with the same connectivity settings (cable as defined by WPT) but then there was a big diff in SpeedIndex (Linux sooo much faster) so I've added latency on Linux.
Wed, Oct 18
Oct 18 2017
This works now as far as I can see.
Oct 17 2017
There's fix that rolled out a couple of days ago, I went through a couple of runs and it looks ok, we just keep this open day more and I will look at some runs.
Oct 16 2017
Last weekend I added Browsertime +- WebPageReplay on another AWS instance to test Obama. I'm not 100% sure replaying adds value in all cases and want to have better metrics on that + I today also added the Facebook page since that seems to have more stable metrics, so we don't only test one URL. I've been finetuning the throttling to get it in the area of what we have with WebPageTest and wanna keep it running now for a week, so we have a long period of metrics.
Oct 13 2017
I've reported it upstream: https://github.com/WPO-Foundation/webpagetest/issues/951
Oct 12 2017
The big dev running without WebPageReplay could be because of a bug in the throttle code. When I tried it before it was turned on the 2/3rd run. Easiest to spot is to download the HAR file and compare the first with 3-4 run and check the latency. I just worked around it by setting the download speed/latency constant between runs and turn it off in Browsertime.