Wed, Jul 3
This needs a thumbor-plugins Debian package update with the latest code + deployment.
Sat, Jun 29
Thu, Jun 27
At a glance on a given proxy the same object doesn't occur multiple times in a row. But the same destination object server has timeouts for several objects in a row, over a period of a few seconds.
Alternatively, we could just stick to the JPG we've already generated when a webp is requested and the cwebp command fails. It will probably make for simpler code and foolproof for other unforeseen situations where cwebp fails. What's unclear still is why things are hanging in case of cwebp command error.
It seems like cwebp doesn't like the YCCK color space that file uses:
Error rate hasn't gone down at all, now we're just getting errors that time out at 1s instead of 0.5s...
Cool, the remaining painting is certainly due to the section that appears, but doesn't block the click event handling, therefore keeping the page responsive.
I've seen this one stuck in poolcounter throttling for a while, it's definitely quite hot. I'll check if it's renderable at all. Usually images that can't render get rather throttled by the failure throttle, not by poolcounter.
Tue, Jun 25
So I should do that for that list? Are you ok with me requesting peering from all of these AS?
Remember that x-cache headers are read from right to left. Trying this out right now with a clear cache and a clear local storage I get the following headers for all assets on the page:
It does for text, when your IP gets hashed to a specific Varnish frontend, you get all "small" requests from it. I.e. anything below the size threshold to be stored in Varnish frontends.
A lot of files fail to render for various reasons, and end up as 429s because we don't want to constantly retry rendering a failing file, that would be a waste of resources. See Failure throttling documentation here: https://wikitech.wikimedia.org/wiki/Thumbor#Throttling
Now that the AS report is collecting more data, I've manually compiled a list of AS we could directly peer with (and don't yet), having checked that we have at least one IX in common.
Looking at June 16 - now. Shape of the ramp-up during that period:
Hive queries used, for reference: P8650
The results are in, looking at loadEventEnd since yesterday 14:00 GMT until now.
Yes, it's the overlay at the bottom of the screen that happens when clicking on a reference link. It's affecting everyone, not just beta users.
Mon, Jun 24
75 is widely deployed now
Config value to increase that timeout: https://github.com/openstack/swift/blob/4ee9545805f52ff0da5c56ab04abf6f053b31a50/etc/proxy-server.conf-sample#L148
I suspect it's an issue at the Swift level, possibly a capacity problem with the added Thumbnail miss load. I'm seeing these errors happening frequently on the swift proxy:
We have a higher volume of thumbnail rendering than usual due to the deployment of T216339: Normalize thumbnail request URLs in Varnish to avoid cachebusting, which results in twice the amount of 503s.
Fri, Jun 21
This has caused a spike of thumbor thumbnailing requests, by virtue of making some objects hotter through deduplication, making them pass the webp hotness threshold.
Nevermind, this was the Vagrant patch... I'm going to make the production one now
It can wait. Basically I want to figure out where we're at in regards to that patch, what's actually deployed and running.
@jijiki @fgiunchedi Have Swift proxies been restarted since https://gerrit.wikimedia.org/r/#/c/mediawiki/vagrant/+/489021/ was merged?
@aaron your concern has been addressed now, the Varnish-level thumbnail URL normalization is live. We can now proceed with the header-based expiry plan.
First smart crop test on stat1005 successful (face detection):
Thu, Jun 20
Wed, Jun 19
We should start getting reports onces eswiki and ruwiki have moved onto wmf.10 tomorrow
It's now being recorded:
@jijiki I also need python-opencv installed on that host, thanks :)
Whether we'll be able to revisit this largely depends on the second bug (events occurring on resize) being backported to 76, which is being discussed right now.
Jun 18 2019
Similar issue found in lazy loaded references: T226025: Expensive viewport size access in Reference Drawers
Now, I think I've got the most common slow events covered. There is a long tail and always more to fix, of course.
The next "big" offenders (many order of magnitude smaller than the previous one, which has 976,911 occurrences to date) in terms of slow click handlers are:
loadEventEnd seems to have regressed around the time the change was deployed. In the week-over-week you can see that the curves seem to diverge past that point, increasing the week-over-week difference:
Looking at the newest data, there might be a slight preference towards high priority top image on mobile (86.45% satisfaction vs 85.5%, sample size 39,880). On desktop there's no visible difference.
Filed a task about the most common offender on long click event processing: T225946: [SPIKE 8hrs] Determine Appropriate Action for Mobile Frontend lazy-loading images performance issues
The upstream bugfixes have been committed:
I've done all the analysis that I could think of for this one. It works, it's a pretty good metric, but for us it's not as good as FP/FCP and domInteractive in terms of correlation to user perception.