Do you expect that there's a delay for the SSL cert change to affect people? If that's the case then we can certainly see a regression ramping up between the 2 dots.
This is the data for cp3064 over that period.
I think it makes more sense to expose new navtiming metrics with Prometheus instead, especially for things like this that require slicing data by a new dimension.
Mon, Nov 18
Assigning this to you, as it sounds like a very likely root cause to the sizeable performance regression
Sun, Nov 17
@ema have you been switching text caches frequently enough that it could explain this? It seems like the new higher level is the new normal. This would suggest a performance degradation coming from ATS.
What might be happening if frontend timing increasing so drastically while backend is stable? Edit stashing not working properly?
Sat, Nov 16
Stats on users with Chrome Mobile browsing desktop pages:
Thu, Nov 14
New public key:
Corresponding patch: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/550852/
Digging further in the sub-graphs of the search console, I find these potential one-time events that could be on our end and contributed to the loss in the overall URL classification:
Here are the 2 graphs on top of each other for Mobile:
Can you figure out which code these long tasks are coming from?
Tue, Nov 12
Impact of that test in Europe:
Oh ok, perfect. Yes, let's do that.
Further data from our own performance metrics that this is likely a Chrome 77/78 problem. A similar First paint regression seems to happen on desktop Chrome over that period:
As far as I can see, CrUX data doesn't contain browser version. So we won't be able to verify that theory without Google's help. I'll wait until I have access to the search console to get all the details before I file a Chrome bug about this.
On codfw everything seems fine. On both eqiad and codfw arclamp-log restarted on Oct 25, which is when the log files stopped being written on eqiad.
Tue, Nov 5
Just the tasks for the follow-up work, so that I can subscribe to them. But it sounds like they haven't been filed yet?
Mon, Nov 4
Fri, Oct 25
Indeed, nice find! Adding -sstdout=%stderr fixes the issue.
Thu, Oct 24
No worries, it happens. Just keep that principle in mind for MediaWiki config in general, wiki-specific replaces default, it's not adding to it. I don't think there's any other big gotcha for surveys.
By the looks of the survey dashboard, the issue might have gone away, but it's probably because another change to InitialiseSettings.php was just deployed: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/545816/
I've reviewed the data, it's exactly what we had requested and looks completely safe to release. @Nuria can you review these 2 files (takes 5 minutes, really) by Nov 4?
Assigning this to you @Krinkle so you can do more digging on the ResourceLoader side of things.
Yep, confirmed, I see the config now that the startup module has been generated by another server. I've checked on one of the misbehaving app servers and the file on the FS is up to date. I suspect this has to do with PHP not picking up the new version of the file.
I'm still not seeing it in mw.config.get( 'wgEnabledQuickSurveys' ) on ruwiki, I don't know what's up with that. Trying some things out on mwmaint1002, it should pick it up.
Ok, found the issue, @Isaac mistakenly overrode the survey definition for ruwiki by rolling out another survey there: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/539183/
I don't see anything related in Common.js/Common.css so it might not be what happened.
@stjn did the ruwiki community block the survey on Sep 26? If you have a link to the corresponding community discussion for posterity, that would be helpful. There's no need to force the survey down people's throats as I'm not actively working on it. It was a nice-to-have to get an ongoing human performance metric, but I understand if people collectively got sick of seeing it.
Given the very limited bandwidth I have as manager, it seems unrealistic that I will get around to follow-up work on
I will try looking at this in my spare time, but can't promise anything. We need to figure out who's going to maintain Thumbor going forward.
It seems like the ghostscript command used by Thumbor outputs some errors to stdout that end up in the generated JPG, making it invalid and imagemagick is subsequently unable to resize that JPG and turn it into the desired thumbnail.
It is indeed unusual for this to apply to specific pages of a small PDF, even moreso for a PDF automatically generated by Google (which means it probably came from a Linux FLOSS software stack to be generated).
A graph of unique IPs for POSTs would be useful to compare to here, or at least a breakdown of bot vs non-bot requests.
Wed, Oct 23
Also, could it be related to T235872? And was there by any chance a small DDoS-like event around that time?
@elukey do you need something from the perf team for this task?
@Elitre it should be its own task, since it's a PDF failing to render and this task is about an SVG.
Thanks, I will look for another volunteer on the perf team today as things have come up for the rest of the week and I don't have enough time left to do this properly.
@Fsalutari has put together a sanitised version of the dataset(s) according to what we agreed on. It's available as df_releasable_1.csv and df_releasable_2.csv under /home/fsalutari on stat1004. I probably won't get a chance to look at it until the week of November 10, but I figured I'd share it here in case you want to look at it before I get a chance to, @Nuria