Have you had a chance to look at it today, now that the change has been deployed?
Actually, that setting is ineffective, setting it to UTC doesn't remove the localization of the dates in the alerts history. This is starting to look like an upstream bug.
You can turn it off for yourself in the Grafana preferences:
Tue, Apr 25
Actually, as the wiki page states, the traffic will switch back on May 1 and May 2. This is what will affect the latency in WPT.
May 3rd 14:00 UTC: https://wikitech.wikimedia.org/wiki/Switch_Datacenter
@Peter can you record another high-quality video? I believe this gets deployed later today. With the switchover there's extra latency right now for WPT that wasn't there when you recorded your reference video.
Mon, Apr 24
@JKatzWMF gave me a detailed response about the success criteria:
I don't think it's possible to tell Grafana something like that in the alert configuration, though, is it? To only alert if the criteria matched N times in a row.
What you're quoting are essays, not research. As reputable as N & N might be in the design world, the articles you're linking to are still opinion pieces and the numbers they provide are pulled out of thin air, with no indication as to how they ended up with those values. A best guess from someone famous is still a best guess.
Fri, Apr 21
Seems to work, I was able to post a comment, which was blocked for me before. Thanks!
I'm sorry to say, but what operating systems do, what other websites do, is completely irrelevant. You put way too much trust on them doing the right thing and having verified it, without linking to any comprehensive study of a feature exactly like this one. Due diligence is very inconsistent for UX design on big web properties and people make mistakes.
Thu, Apr 20
Thanks! The performance team, i.e. the following wikitech usernames:
And the effect looks the same as when this last happened during the DDOS-related traffic rerouting.
I believe it's due to additional latency from these locations due to the switchover. The start of it coincides exactly with when @BBlack switched traffic over (before the actual switchover).
Sure, I think we're done now that we have the email alerts.
Wed, Apr 19
According to the SAL: https://wikitech.wikimedia.org/wiki/Server_Admin_Log Brandon already switched some traffic around at that time in preparation for the switchover. That's the explanation. Similar to what happened when traffic was moved to mitigate the last DDOS attempt.
I don't see any similar rise on Navigation Timing, this seems to be WPT-specific.
Thu, Apr 13
Seems like this was deployed back in January?
Lowering the priority of this. Since we're always going to have process restarts due to unbounded subcommand memory use on some commands, restarts will always happen. And they are being properly retried by nginx now.
After further investigation, the retry mechanism is working properly, and the only ones that end up served back as errors are requests that did fail twice on different upstreams (such as what happens when all instances get restarted at once).
OK, so something that should have been obvious that I just noticed is that it's a bad idea to restart all thumbor instances at once. Because they're all starting up at the same time, and all unavailable to handle requests during the startup period, which means that the retry does happen, but it lands on an equally busy instance starting up. We should avoid restarting all thumbor instances at once in the future and do a graceful rolling restart if possible. I'll create a low-prio task for that.
We actually have only one performance test, as far as I can see. It uses Chrome, which makes it "expensive", but it's justified since it logs in. But there indeed seem to be significant savings to be had on the availability tests ops has set up.
Wed, Apr 12
Tue, Apr 11
Sure thing, it's safe for any wiki to change that setting to 250px.
Mon, Apr 10
It's safe to deploy
SpeedLine does its Speed Index calculation based on screenshots, though, doesn't it? It must be a visual thing anyway, since they optionally offer the filmstrip. Would that require tracelogs?
Sat, Apr 8
Mon, Apr 3
We need to address the flapping alert (Difference in size authenticated).
Thu, Mar 30
Fix available here: https://gerrit.wikimedia.org/r/#/c/345608/
Tested and confirmed as a hotfix on thumbor1001.
I think I've found what's causing it, it's requests' Stream option. Maybe the chunks are too small? Anyway, CPU maxes out, it's slow and everything, instead of reading without streaming. The default chunk size for requests might be bigger than what we use for Thumbor, because my little requests script modified to use Stream takes 90 seconds.
Using requests directly is fast, unlike python-swiftclient fetching the same thing:
Apparently it's a recent addition to Python's SSL support:
Using the script at the end of https://security.stackexchange.com/questions/52150/identify-ssl-version-and-cipher-suite it seems like python is using the second one:
Doesn't seem to help to turn that option off. I've search around the web and people mentioned that some ciphers are a lot slower than others, and which one gets picked can depend on the order the client tries.
There is hope:
Thumbor shoots to 100% CPU usage. The python-swift-client must be doing https communication very inefficiently or something. Will require debugging.
The bulk of temp files are definitely gone. I'm seeing very few images, which are probably being processed. I do see a handful of lingering files prefixed with gs_, which I imagine based on their name are probably created by ghostscript. Some of them are empty and a lot of them have the same size of 10088448 bytes. They contain binary data and file's MIME sniffing doesn't recognize them.
Doesn't seem to work right:
I'll actually enable them when there's ELK integration. Right now added to the existing log entries, it would be too verbose.
I've uploaded the patch so you can see exactly what I'm talking about. It's the version I mention in my last comment.
I've tried simply touching /vagrant/tmp/RELOAD from the role, and adding config.vm.provision :mediawiki_reload if mwv.reload? at the end of the Vagrant.configure('2') block, but it doesn't pick it up until the next vagrant provision (when it actually hits the existing mwv.reload check at the top).
A very in-depth article about preload and how it gets prioritized: https://medium.com/reloading/preload-prefetch-and-priorities-in-chrome-776165961bbf
We should revisit this with link rel="preload", which also supports media queries and can be passed as an HTTP header. In fact I'm pretty sure that https://gerrit.wikimedia.org/r/#/c/215061/2 had all we needed, but was mistakenly using rel="prefetch", where we needed preload. It had the right parameters for preload and everything.