User Details
- User Since
- Aug 17 2015, 6:48 PM (397 w, 4 d)
- Availability
- Available
- IRC Nick
- phedenskog
- LDAP User
- Unknown
- MediaWiki User
- PHedenskog (WMF) [ Global Accounts ]
Yesterday
If we wanna try it out again:
The easiest way to fix this will be that the server that adds the job to BitBar, will check out the repo and the loop through tests and send them one by one. That way we can use that we have multiple phones for the same job and it should be easy to implement.
I tried this out yesterday. One thing is that you cannot pin different CPUs on different frequencies, so for example if if you have 8 CPUs and you want four of them to run on 1 GHz and four of them on 4GHZ that do not work. What I ended cup trying was to run four 1GHZ for Chrome and OS and four for 1GHZ for ffmpeg. I couldn't see any difference and maybe if need to try it some more to see if other configs would work. But for now I think is enough, lets focus on deploying everything on the bare metal server instead.
Thu, Mar 30
We have some tests that runs without Docker, so I'll try on those. Pinning within the container, well I don't know how that works.
Here's the graph of the instability over time:
Wed, Mar 29
Maybe its to early to celebrate but the first test running with a new command line for 112 produces much less instability. These are the updates: https://github.com/sitespeedio/browsertime/pull/1921/files - I wonder if OptimizationHints could have been the problem? Some months ago (half a year) it started to popup requests back to the Google optimisation services. and I blocked that domain. Maybe the domain changed and it become more aggressive? Lets keep this change running until tomorrow.
Tue, Mar 28
I tried the 112 beta (stable 112 is released tomorrow) but no luck, it looks the same there.
I'm gonna verify that this problem also happens in Chrome 112. If that's the case I'll report it to Google, if not we can just roll forward.
I could see that the change is related to Chrome 111. I made sure that we run 110 and I could see benchmark and the first visual change is stable, then upgrading just to 111 made the benchmark metric more unstable. I check if I can see some difference in the Chrome performance log.
I could verify that it's not the configuration. Gonna try with Chrome 110 first and see how that works.
Hmm maybe something needs all to be done on the Grafana side? When @BAbiola-WMF tries to login to Grafana she gets 407:Proxy Authentication Required or UNEXPECTED_PROXY_AUTH
I switched back to the old configuration and it worked good again. However now when I checked I could see that we run different Chrome versions and different sitespeed.io versions, so it's maybe not a configuration issue, either browser or somehow the code?
Mon, Mar 27
I think you are right @Krinkle let me increase the time span and the old will probably just work.
This works better now.
I think this was because of login issues when changing dc:s
What seems strange here though is that the FFMPEG process should end before the CPU benchmark is collected, so maybe we have multiple problems.
Sun, Mar 26
To be sure I just disabled the viewport change. If the benchmark gets back as default, I'm gonna try that FFMPEG fix.
I think the problem is that we test with a larger viewport than the default setting. Running with the AWS setting I can see that we have a much more unstable CPU benchmark:
Fri, Mar 24
I could reproduce it locally, let me have a go next week to fix it.
Thu, Mar 23
I've disabled it with
Wed, Mar 22
Today I turned on all WebPageReplay tests that we run AWS so they also run on bare metal. I used the exact same configuration except that I hacked the start script on the bare metal server to change the Graphite reporting key, so it reports under baremetal. If this looks ok, I think this is a good first step, then we can move WebPageReplay tests to the bare metal server and turn off a couple of AWS servers.
Tue, Mar 21
Looking at the CPU benchmark in our testing infrastructure:
I looked at the CPU benchmark for India. The 75 percentile for mobile is 285 ms (span 259 -352) and 95 percentile 466 (span 394 - 544). For desktop in India the mean 75 p is 147 ms and the 95 p is 349 ms.
Mon, Mar 20
Our user gets this error:
Interesting: T299886 and https://bugs.chromium.org/p/chromium/issues/detail?id=1291502#c63
Fri, Mar 17
I think I need your help here @Krinkle - I've been looking at some metric and having a hard time to know exactly how we should move on. Your trick with max_over_time doesn't work on histograms I think?
One of the problems here that the login gets stuck and not redirected to the main page, instead we get this:
Ok, Chrome and Firefox errors are fixed. There's still one error for the login user journey, but it do not happen all the time. In the visual metrics script it comes down to a error that looks like:
Wed, Mar 15
WebPageReplay tests on Chrome works.. For Firefox the install script works on my Samsung, so lets how I can debug what's going on at BitBar.
For Firefox the install somehow fails. The output from adb install:
Yesterday night I manually updated to Chrome 111 and then this morning BitBar fixed the connection problem. There are two kind of problems left to fix:
Tue, Mar 14
Updated sitespeed.io that included Chromedriver 110 but we still get a lot of errors (Chrome on some phones are still 110):
So it turns out I was running different configuration on AWS and the bare metal. On AWS we have disabled PaintHoldingCrossOrigin, but that is fixed now, lets see if we can see any result.
We got a new Samsung A51 so we can set them up to run two as one, lets do that instead of spend time testing it out in a Moto G5,
Today we have an issue for users with TTFB and I just verified that we can see the same thing with our Prometheus metrics. Looking at TTFB 75p and compare with one/two weeks back:
First let's have a look at our RUM data for First Contentful Paint for 110 and 111 looking at p75:
Mon, Mar 13
I've pushed the changes so we run Chrome 110 in all our testing and by tonight I'll push 111 and then we can see tomorrow what it looks like.
Fri, Mar 10
I pushed to test on the new bare metal server where we use the same code, except that for one we don't change the RLPAGEMODULES. That way we almost get the same TTFB. I'll keep the test running during the weekend and the we can have a look at the differences. It's a little hard to see exactly since we are running "wiki loves".
I got the hack number 3 to work today, but it adds a couple of 100 ms on TTFB, so I need make another version we do the exact same thing in a script (like getting the HTML body) but do not change the HTML, hopefully that can give me a good base line. Also one problem on the desktop version is that when I removed all JS, it always rendered only with the header (that started to happen with vector-2022), then it's harder to verify that some metric stay the same.
Thu, Mar 9
I added the same URLs to run with WebPageReplay, then tomorrow I will sync CPU benchmark/TTFB and document it so we can motivate how and why we run with the configuration.
Wed, Mar 8
I've setup three URLS that we test once every hour and gonna let that run until tomorrow to just collect some the data. Then look at the CPU and TTFB and sync if they should be changed.
I've gone through all alerts on our side and made sure they do not fire (or at least I think I fixed them all), However all of them that fired are still stuck in the old state or "no data" state. However running the alerts preview I can see that they get data and are not firing.
I've been starting to setup the bare metal server in T311983 and this bug is problematic (it always have but its even more now): If we run with default settings we still have the discrepancy between RUM metrics and visual metrics. If I turn off the change (disable-features=PaintHoldingCrossOrigin) it looks like vector-2022 introduced a new way of rendering, wherre sometimes Chrome chooses to render the search bar first. Compare these two:
I also have problems with other alerts. There were to alerts in https://grafana-rw.wikimedia.org/d/000000318/browsertime-alerts that correctly fired because the limits where hit. I increased the limits and run the alert queries in the GUI clicking on the preview button:
I think something else broke with the 9 upgrade with the alerts. I checked one alert and I think somehow the query got corrupted when we converted it:
Tue, Mar 7
We can follow the pattern on how we implemented First Input delay in https://phabricator.wikimedia.org/T238091
Mon, Mar 6
Great, I will try to add that the coming weeks, we need to upgrade the version of browsertime/sitespeed.io on BitBar to be able to mock with the content.
I think we should continue with the long tasks but also know that https://github.com/w3c/longtasks/blob/loaf-explainer/loaf-explainer.md implementation is coming. Firefox has stopped their implementation to wait for what happens with loaf.
We were running out of disk, we have some snaps that automatically gets installed. I've removed as much as possible than I'll focus on move to bare metal so we can rid off this problem.
Feb 24 2023
This is done and documented in https://wikitech.wikimedia.org/wiki/Performance/Synthetic_testing/Bare_metal
Feb 23 2023
Feb 22 2023
Hi! I wonder if we need to do something on our side with the alerts in 9? We have a lot of fired alerts with "No data" but I can see that the data is there (or at least in the same interval as before), so I think there could a difference in how "No data" is handled between the last and the current version, do you know?
Feb 21 2023
Hmm, I added tags when I created the annotation (sorry for not including that in the screenshot). But I see, I tried on another dashboard that do not load any annotations and there it works. There's another bug we have with multiple datasources https://github.com/grafana/grafana/issues/46440 for annotations, maybe its related.
I actually did a quick fix solving the issue for now and then we can put this on the backlog and can take this on later.
Feb 20 2023
Feb 9 2023
I killed the container and restarted everything and after while the tests started to run again.
Feb 6 2023
Hmm this started to happen again, the error we get is: