Fri, Aug 18
I don't think anyone criticized the value of the feature, it's a great feature! But launching it in this state to all users in production was too rushed. Opt-in would have been more sensible and this issue discovered/worked on before deploying it to all enwiki users. Bandwidth is very expensive for some people, a 60% increase of anon first pageview JS weight is considerable. Things written by staff and volunteers alike get rolled back for similar reasons all the time, it's never personal.
Thu, Aug 17
Running the stress test again, requesting about 2000 uncached thumbnails of the same image with a concurrency of 200 requests, I got:
Works fine now: https://en.wikipedia.org/wiki/File:PlayMagnus.webp
Wed, Aug 16
I've filed a revert for the 1-connection-per-backend: https://gerrit.wikimedia.org/r/#/c/372199/ On Vagrant, while reproducing the conditions above, adding the maxconn setting causes 502s instantly. Without the max connection setting, a given thumbor instance can switch between several concurrent requests just fine via yielding. In essence, a given thumbor process is capable of serving concurrent requests and therefore, concurrent connections.
Thank you, lowering priority back to normal
The graph for anons:
Indeed, it seems like this applies to anonymous users as well. It works fine for Sweden, but same as you it seems to be broken for the Sierra Leone article, where the tiles all seem to 404. @TheDJ could this gadget be undeployed until these issues are resolved?
Some relieving news: I've tested a specific lock (per-original) and the event-based async-like behavior of thumbor works perfectly:
OK, so if I'm following that means people are now advised to use other fonts than these ones, right? Meaning it's ok if those specific fonts don't render "right" on https://commons.wikimedia.org/wiki/File:MediaWiki_SVG_fonts.svg? And I'm guessing that this reference SVG will have to be updated to use the new "reference" fonts that are replacing those?
As a note, just looking at yesterday's data, nginx 502s once per minute on average. Much larger old error log files suggest that this might peak at times. We really need to record that in a graph.
Seems like it was just never enabled in Puppet.
I've compared deployment-imagescaler02 again and I see rendering differences for kochi fonts. Isn't it the same issue as the ttf ubuntu fonts?
What I need to verify is whether a thumbor process is truly blocking while waiting on a poolcounter lock for timeout seconds and unable to process other requests (queueing them, I guess). If that's the case, then the thumbor processes waiting for their turn are hitting the worst case scenario where they're waiting more than 8 seconds for that lock, since the actual processes rendering those thumbnails for the given IP-based lock take a while.
I still triggered 502s, that wasn't sufficient.
Tue, Aug 15
I've updated the Thumbor package here: https://github.com/gi11es/thumbor-debian/tree/master/thumbor with the latest master from upstream, that includes my bugfix and the relaxed Pillow version check.
@mmodell what's the process for deploying an update of 3d2png to deployment-imagescaler01?
It's mostly a bandwidth waste, since the download happens after the pageload. I'll let others be the judge of how urgent that makes it.
Right off the bat, the first one with major differences, Century Schoolbook L, comes from the "gsfonts" package, which is found on thumbor1001, deployment-imagescaler01, but not on deployment-imagescaler02. @fgiunchedi is a role missing from deployment-imagescaler02 or something?
The font config files from fonts.pp that you wrote for Jessie are definitely present on deployment-imagescaler02.
It's probably a minor difference in rsvg rendering. 98.8% is very good similarity. Let's double check if the rendering difference is significant.
Installed the new package on deployment-imagescaler01.deployment-prep.eqiad.wmflabs, restarted thumbor, and added a manual config to enable the engine in /etc/thumbor.d/98-3d.conf
I think that the current per-IP PoolCounter limits are just too generous. A single user can hog up to 32 workers right now. IMHO, what matters for a given user is the size of the queue, but they can afford to wait if there are a lot of thumbnails on the page that need rendering.
Mon, Aug 14
All SVGs affected during this bug's timeframe have been purged.
Thu, Aug 10
I did some load testing yesterday that caused 502s, do you have a list of pages with times?
Wed, Aug 9
Fix reference thumbnails dimensions
The reference thumbnails still have incorrect dimensions (square), which will make the tests fail.
What's the status of this security review? It's my understanding that this extension was briefly deployed to all wikis as a beta feature and should be back soon.
Yes, now that Thumbor is serving all thumbnail traffic, T161719 is the blocker for this task. @MarkTraceur and I got a solution working locally yesterday, so I expect that it'll be working in beta and possibly production in the next couple of weeks.
Tue, Aug 8
The reference thumbnails are square. So it fails because the aspect ratio is different. I think you need to regenerate the reference thumbnails with the right dimensions. The aspect ratio in the Mediawiki extension is definitely 640 /480.
This code in 3d2png is getting in the way: https://github.com/wikimedia/3d2png/blob/master/3d2png.js#L68 because the temp file generated by the thumbor plugins is extension-less.
You should be able to test the code on deployment-imagescaler01:
Looks right to me at first glance, you can test the hack by adding the same one to our https loader, and adding a test to test_https_loader.py pointing to an STL file hosted on one of our wikis.
Now that the project is finished, I think that the thumbor instances we have a heavily rigged for our thumbnail traffic (saving all results to swift). It would make more sense to reuse the common infrastructure and deploy thumbor configured differently for those other use cases on other servers. This would also reduce side effects between very different uses of Thumbor. I can provide guidance if needed.
I don't think this is a good time investment. I fixed the leak I could find and couldn't find any more last time I checked. Changing the memory limit seems to have considerably lowered OOMs anyway. I doubt that there's any significant leaking anymore.
Ghostscript currently doesn't do a very good job of deleting temporary files if it exits because of an error; you may have to delete them manually from time to time.
Mon, Aug 7
If you're talking about chaining thumbnails server-side, that's difficult to do without affecting quality due to the conditional sharpening we apply to JPGs. An already sharpened intermediary size is unusable, as the end result would be over-sharpened. This issue is the reason why the last attempt at chaining thumbnails had to be pulled. To respect sharpening, we'd need to store unsharpened intermediary sizes as extra images, which is expensive storage-wise and rendering-wise.
Thumbor has been serving all thumbnail traffic for over a month now. I think this task can be closed as the project is completed and successful. Subsequent bugfixes and improvements are tracked with the Thumbor project.
120 is the 5th most requested size and its average file size is obviously small. Interestingly it's also the highest on the list of cache misses. So yes, it would make absolute sense to add 120 to the pre-rendering list. Those thumbnails are already used a lot and would benefit from being pre-rendered, considering how their miss rate is high. As an already popular size, it's a good choice for mobile to leverage it. 220 is also worth considering, as its the size that gets the most absolute hits and is second for misses after 120.
Fri, Aug 4
Jul 22 2017
Jul 20 2017
@EBernhardson sounds good to me, better chances of combining Varnish hits with another feature is definitely the most important optimization