Fri, Feb 21
Ok, there a couple of things I want to start with:
- Since this will run on many different low end devices, I think we should instrument/collect metrics from real users so that we can find potential problems in the future. There's an API called the User Timing AP I that we can use to internally measure our code, its been around since Firefox 30 something so it should work. From the outside I would measure API calls, parse times of content etc. But you that knows the app better maybe have other input. Thinking we could integrate it with https://github.com/wikimedia/wikipedia-kaios/pull/131.
- Its important that you feel that everything is ready when we start to test, do you know when you are? :)
- Since there's no good way to automate performance testing on Kai OS devices (at least that I've found) I think the easiest way it just to compare, so I think we should compare the current mobile version for example https://en.m.wikipedia.org/wiki/Barack_Obama with the Kai OS version, coming from search, direct hit etc. Sounds ok?
Thu, Feb 20
Great lets do it @hueitan ! I'll work on a short list of things that I will add on the task tomorrow, so you have some time to read through.
Wed, Feb 19
@AMuigai I can start doing the performance review (we in the performance team discussed it in our weekly meeting). I think the best way to do it is if we do it together with one of you, then I can share how and what I test, and you can share how it's built so I don't focus on wrong thing. Then I will learn more about the app and one of you learn more about performance? :)
Tue, Feb 18
Hi @AMuigai I want to check if its ready for testing?
Added https://wikitech.wikimedia.org/wiki/Performance/Add_metrics lets skip console.timeStamp.
Mon, Feb 17
Did it with help from @dpifke
I've been running the new setup for some days and I got one false alert. Checkout the graphs:
Fri, Feb 14
I've been trying a new approach the last couple of days on another machine and I think it looks ok now. Instead of taking averages or median over X hours for the graph, I added so we keep the last value. That way will immediately get an alert when the metrics go up.
Wed, Feb 12
We have had downtime on the tests
Mon, Feb 10
This is an issue on emulated mobile:
Fri, Feb 7
I've added the new runner machine and added an example on how to add new URLs to test.
I verified that we have an alert on AWS, it looks ok. Let hope it works the next time :|
Wed, Feb 5
Tue, Feb 4
This has been fixed upstream, I can see tests going through but let me verify that the queue if fully finished before I close the issue.
Restarting the agent doesn't work, it seems it stuck. I tried to stop and it took a long time. I started again, followed https://github.com/WPO-Foundation/webpagetest/issues/929 but there are still all the jobs in the queue.
I've restarted the agent and removed all tests in the queue and I'm waiting so see if it helped.
Mon, Feb 3
Also check if we can add it directly to WPT or if it is an upstream task for that.
Wed, Jan 29
Tue, Jan 28
This was done as part of T219496
With the new problem of Chrome 79 T240723 I think we should avoid it. We can rethink this when we run on new machines in the future.
Moving to the new setup we don't need this. Firefox have their own online and Chrome works fine in the browser (Firefox soon too).
We kind of have implemented this by having more runs testing enwiki and lower runs testing others.
Jan 28 2020
I went through the logs on the three servers and only error now is from edit a page (that doesn't work because the server is blocked) and when testing other web sites that we soon should use to compare against others.
I've made a hack in to fix this now, in the future we could do one base configuration file that we extends ones for Firefox and one for Chrome. Let us do that when we need it the next time.
This was fixed when upgrading to Chrome 80. See https://bugs.chromium.org/p/chromium/issues/detail?id=1035305#c20 and https://bugs.chromium.org/p/chromium/issues/detail?id=1041421. There where some things going on in Chrome that made the browser freeze.
Jan 27 2020
Since I turned it off there's nothing more to do here for us, let me check with Mozilla if they have any recommendations.
I removed using gecko profiler. First view:
We also test that page standalone and it looks like this:
Upgrading fixed the issue so will upgrade to the rest of the servers too.
Jan 23 2020
My guess is that we somehow uses different configs when we record and replay for the browser. But I haven't been able to nail down the problem + it seems an old issue is back. A couple of years ago emulated mode started to resize the screen size in Chrome just before the load ended (I'll need to find that issue) and when I tried now with different device emulations, I could see that again :(
This looks like it was fixed by upgrading but I wanna run it over the weekend to be 100% and then we can upgrade the rest of the servers. I also need to dig into the changelog and understand WHY it works now.
I think the problem is images served as image/webp - yep, I'm pretty sure that's the problem.
I've been trying to fix this again. I can see that we record 24 requests for the Facebook page on mobile but replay 26. So something is wrong.
We moved this to git and the new setup so it is fixed.
This was fixed when we rolled up the new setup before Christmas.
Yep it works now. It needed some new code from Mozilla.
It would be super useful to get your eyes on this @dpifke ! First like a sanity check to see what it looks like now and then we could setup a new instance that you can play around with. We could just setup the tests we run for enwiki and send the metrics to another graphite namespace.
I think this should be a part of bigger thing where we collect more metrics for our synthetic testing machines. @dpifke is that something you would like to do?
There's a couple of things we could check to know if it works: Is the Docker container up and running? How long time has it been running? It has happened in the past that a container is stuck, but we spin up a new one for every test, so if a container run over X minutes, we know something is wrong.
I'm doing a test for the enwiki tests and upgrade to the latest container.
Jan 22 2020
I had a look at my now instance and could see the same pattern there:
This only happens when we test desktop. Here we are looking at the max value for backendtime ("should" be stable). This is Barack Obama using replay desktop:
It looks like this:
It seems to only happen for Chrome. I wonder if there's something with the certificate? I tried last week the net log from Chrome but no luck but I can try again. With the current setup there's not any more info we can get from the browser.
Jan 21 2020
From the trace log:
It seems it is categorized as SSL time:
I added alerts for 200 kb. So if all three pages (on mobile or desktop) increases with 200 kb we will fire an alert. Let us try that limit for a while.
I've prepared everything but haven't added the actual alert:
I think you can see when I did the switch to 80:
Jan 20 2020
This was reported upstream and they could reproduce. Only happens on Linux. However in the next release 72 the performance increased. There's nothing more that we can do about it.
Got some updates: The Chromedriver team says it's in Chrome and cannot get any more information in the driver. I'll keep track of it and see if we can see anything from the trace log from Chrome when it happens.
We got 80 up and running but no change. I'll keep it there until someone on the Chrome team has time to comment on the issue.
We have some CPU time spent for the URLs we test today (like spent parsing HTML, style etc), checkout https://grafana.wikimedia.org/d/000000059/webpagereplay-drilldown?orgId=1&var-base=sitespeed_io&var-path=emulatedMobileReplay&var-group=en_m_wikipedia_org&var-page=_wiki_Barack_Obama&var-browser=chrome&var-connectivity=100&var-function=median and scroll down to the CPU section. That is collected on a AWS server, where we slowdown the CPU to try to look more like what it is on a mobile phone. It's not perfect but will work until we can use real phones.
Jan 17 2020
Great thanks @Pcoombe !
It's the same on mobile:
Attaching an optimized jpg (74% quality) 30 kband a one (81% quality) at 35 kb - I use https://imageoptim.com to compress them.
The problem image is https://upload.wikimedia.org/wikipedia/donate/a/a7/WLE_banner_2010.png - 863 kb no cache time.
Jan 16 2020
Yes something sometimes is wrong with SSL, I've tried to reproduce on my local but no luck. However that seems to happens sometimes in one out five runs and the median is still ok when that happens.
I will try to reproduce this locally with the Chrome net log turned on.
Looking at different traces I could see the difference is in SSL setup. For the fast one it looks like this:
I pushed beta of 80 yesterday and it looks like this:
Jan 15 2020
Jan 14 2020
Updated the documentation.
Jan 13 2020
I don't we will fix this, its too much work at the moment.