Sun, May 2
Fri, Apr 30
Tue, Apr 27
Since they are moving the phones I'm not gonna do anything more.
❯ ps -ef | grep 5037 0 1920 1 0 7:53AM ?? 0:01.29 adb -L tcp:5037 fork-server server --reply-fd 5 501 2317 10076 0 7:55AM ttys000 0:00.00 grep 5037
I changed the three servers for WebPageReplay and the Graphite server.
This is first visual change on Obama. The vertical blu line is when Firefox 88 was rolled out:
Last night it was auto updated.
That release increased downloading the base page with 1 seconds in our tests:
I've added it for WebPageTest and gonna roll it out for the rest later today.
Mon, Apr 26
Let me fix it on the other servers.
Looking at the log I could see:
[2021-04-24 06:45:56] ERROR: Could not upload to S3 RequestTimeTooSkewed: The difference between the request time and th e current time is too large. at Request.extractError (/usr/src/app/node_modules/aws-sdk/lib/services/s3.js:718:35)
And then the local data was not removed.
This happened again. Let me look for the root cause.
Sat, Apr 24
Fri, Apr 23
These are the current metrics:
Let me tune the settings on Monday. Lets start with WebPageTest and WebPageReplay.
I created a dashboard where we have the metric for all tools: https://grafana.wikimedia.org/d/N-K4xrXGk/synthetic-testing-calibration?orgId=1
Wed, Apr 21
Lets wait until WebPageTest is updated, then I'll go through them. I checked today but no luck yet, I think the update script runs once a day maybe.
Tue, Apr 20
WebPageTest hasn't updated to 88 yet.
This was my fault from the beginning setting a directory that Chrome didn't have write privileges to inside the container.
Fri, Apr 9
This is back to normal. Look at serverResponseTime (responseEnd - requestStart):
Yes version 89 is correct, let me re-install my Raspberry and see how it works out!
Thu, Apr 8
Oops, I didn't act fast enough to have a look and now the data is gone. Looking at the current tests it looks like the difference is 0.07 seconds between runs for First Visual Change and that is ok I think. That Ajax URL is still slower than the rest.
We don't need to spend time on this, lets do it if we get the same in the new setup.
All these was caused by the purging. I disabled the purging for now.
Its seems to correlate to when we show an empty banner: https://meta.wikimedia.org/w/index.php?title=Special:BannerLoader&campaign=impression_test_clear&banner=impression_test2&uselang=en&debug=false
This has self healed and doesn't seem to be an issue anymore:
Waiting on input from legal to see what and how we can run it.
This solved itself on WebPageTest.
Aha! I can try that, do you have exact instructions? Do you need to have the phones rooted for that?
I've changed this a couple of days ago to have a policy of keeping the metrics for 2 years. It not so many metrics and we collect them once a day.
Wed, Apr 7
Apr 6 2021
Looking at a specific run it looks like this:
Apr 5 2021
Apr 1 2021
Running testing the Obama page increase the standard deviation (three phones at the same time):
TTFB: 1.67s (±63.00ms), firstPaint: 3.22s (±169.00ms
TTFB: 1.64s (±22.00ms), firstPaint: 3.30s (±136.00ms),
TFB: 1.83s (±178.00ms), firstPaint: 3.46s (±311.00ms)
And then running one phone that is not rooted:
TTFB: 1.63s (±24.00ms), firstPaint: 2.94s (±58.00ms)
TTFB: 1.63s (±28.00ms), firstPaint: 2.98s (±47.00ms)
TTFB: 1.65s (±26.00ms), firstPaint: 2.96s (±51.00ms)
Mar 31 2021
Running one rooted phone looks like this:
TTFB: 1.62s (19.00ms), firstPaint: 2.96s (51.00ms)
TTFB: 1.66s (54.00ms), firstPaint: 2.95s (69.00ms)
TTFB: 1.64s (19.00ms), firstPaint: 2.95s (63.00ms)
I've been trying with three devices (all Moto G5, one rooted) and using throttle 4g. It seems to be something with gnirehtet run (run once per device) that do not work/or I don't understand: it seems like the all phones are not tethering. However starting one instance with gnirehtet autorun makes the job.
Mar 30 2021
I've been running this locally to find out the real problem: Running at home, connecting my Mac through a ethernet connection and then reverse tethering running like this:
I've been trying to reproduce the issues locally by running gnirehtet and throttling as 4g: By doing eleven runs I usually get a standard deviation between 75-120 ms. But I also get those "long runs" where the standard deviation is 700-800 ms and one/two of the runs has TTFB of 3-4 seconds instead of 160 ms. Haven't been able to see anything more yet though.
Actually SimpleRT is abandon so we should not use it.
We had a lot of logs on that machine. I added a max log size SystemMaxUse=50M in /etc/systemd/journald.conf and restarted. Lets me know @dpifke if there's a better way or something else I should do!
Mar 26 2021
I saw this earlier today: https://github.com/WPO-Foundation/wptagent/issues/407 saying that Chrome (maybe) started to do more requests in the backend at the same time of the tests. I don't have my rooted phone with me today so I couldn't check but I have added 30 s sleep before we start our tests for the browser to settle. It reminds me that the Mozilla team actually sleeps for 20 seconds before starting their tests, since Firefox dowload the white/black listings on each browser start.
This worked out really well: