In the last blog post I described where Thumbor fits in our media thumbnailing stack. Introducing Thumbor replaces an existing service, and as such it's important that it doesn't preform worse than its predecessor. We came up with a strategy to reach feature parity and ensure a launch that would be invisible to end users.
In Wikimedia production, Thumbor was due to interact with several services: Varnish, Swift, Nginx, Memcached, Poolcounter. In order to iron out those interactions, it was important to reproduce them locally during development. Which is why I wrote several roles for the official MediaWiki Vagrant machine, with help from @bd808. Those have already been useful to other developers, with several people reaching out to me about the Varnish and Swift Vagrant roles. While at the time it might have seemed like an unnecessary quest (why not develop straight on a production machine?) it was actually a great learning experience to write the extensive Puppet code required to make it work. While it's a separate codebase, subsequent work to port that over to production Puppet was minimal.
This phase actually represented the bulk of the work, reproducing support for all the media formats and special parameters found in Mediawiki thumbnailing. I dedicated a lot of attention to making sure that the images generated by Thumbor were as good as what MediaWiki was outputting for the same original media. In order to do that, I wrote many integration tests using thumbnails from Wikimedia production, which were used as reference output. Those tests are still part of the Thumbor plugins Debian package and ensure that we avoid regressions. They use a DSSIM algorithm to visually compare images and make sure that what Thumbor outputs doesn't visually diverge from the reference thumbnails. We also compare file size to make sure that the new output isn't significantly heavier than the old.
The next big phase of the project was to create a Debian package for our Thumbor code. I had never done that before and it wasn't as difficult as some people make it out to be (I imagine the tooling has gotten significantly better than it used to be), at least for Python packages. However, in order to be able to ship our code as a Debian package, Thumbor itself needed to have a Debian package. Which wasn't the case at the time. Some people had tried on much older versions of Thumbor but never reached the point where it was put in Debian proper. Since that last attempt, Thumbor added a lot of new dependencies that weren't packaged either. @fgiunchedi and I worked on packaging it all and successfully did so. And with the help of Debian developer Marcelo Jorge Vieira who pushed most of those packaged for us into Debian, we crossed the finish line recently and got Thumbor submitted to Debian unstable.
One advantage of doing this is that it makes deployment of updates really straightforward, with the integration test suite I mentioned earlier running in isolation when the Debian package is built. With those Debian packages done, we were ready to run this on production machines.
But the more important advantage is that by having those Debian packages into Debian itself, other people are using the exact same versions of Thumbor's dependencies and Thumbor itself via Debian, thus greatly expanding the exposure of the software we run in production. This increases the likelihood that security issues we might be exposed to are found and fixed.
Trying to reproduce the production setup locally is always limited. The full complexity of production configuration isn't there, and everything is still running on the same machine. The next step was to convert the Vagrant Puppet code into production Puppet code. Which allowed us to run this on the Beta cluster as a first step, where we could reproduce a setup closer to production with several machines. This was actually an opportunity to improve the Beta cluster to make it have a proper Varnish and Swift setup closer to production than it used to have. Just like the Vagrant improvements, those changes quickly paid off by being useful to others who were working on Beta.
Just like packaging, this new step revealed bugs in the Thumbor plugins Python code that we were able to fix before hitting production.
The Beta wikis only have a small selection of media, and as such we still hadn't been exposed to the variety of content found on production wikis. I was worried that we would run into media files that had special properties in production that we hadn't run into in all the development phase. Which is why I came up with a plan to dual-serve all production requests to the new production Thumbor machines and compare output.
This consisted in modifications to the production Swift proxy plugin code we have in place to rewrite Wikimedia URLs. Instead of sending thumbnail requests to just MediaWiki, I modified it to also send the same requests to Thumbor. At first completely blindly, the Swift proxy would send requests to Thumbor and not even wait to see the outcome.
Then I looked at the Thumbor error logs and found several files that were problematic for Thumbor and not for MediaWiki. This allowed us to fix many bugs that we would have normally found out about during the actual launch. This was also the opportunity to reproduce and iron out the various throttling mechanisms.
To be more thorough, I mage the Swift proxy log HTTP status codes returned by MediaWiki and Thumbor and produced a diff, looking for files that were problematic for one and not the other. This allowed us to find more bugs on the Thumbor side, and a few instances of files that Thumbor could render properly that MediaWiki couldn't!
This is also the phase where under the full production load, our Thumbor configuration started showing significant issues around memory consumption and leaks. We were able to fix all those problems in that fire-and-forget dual serving setup, with no impact at all on production traffic. This was an extremely valuable strategy, as we were able to iterate quickly in the same traffic conditions as if the service had actually launched, without any consequences for users.
With Thumbor running smoothly on production machines, successfully rendering a superset of thumbnails MediaWiki was able to, it was time to launch. The dual-serving logic in the Swift proxy came in very handy: it became a simple toggle between sending thumbnailing traffic to MediaWiki and sending it to Thumbor. And so we did switch. We did that gradually, having more and more wikis's thumbnails rendered by Thumbor over the course of a couple of weeks. The load was handled fine (predictable, since we were handling the same load in the dual-serving mode). The success rate of requests based on HTTP status codes was the same before and after.
However after some time we started getting reports of issues around EXIF orientation. A feature we had integration tests for. But the tests only covered 180 degrees rotation and not 90 degrees (doh!). The Swift proxy switch allowed us to quickly switch traffic back to MediaWiki. We did so because it's quite a prevalent feature in JPGs. We fixed that one large bug, switched the traffic back to Thumbor and that was it.
Some minor bugs surfaced later regarding much less common files with special properties, that we were able to fix very quickly. And deploy fixes for safely and easily with the Debian package. But we could have avoided all of those bugs too if we had been more thorough in the dual-serving phase. We were only comparing HTTP status codes between MediaWiki and Thumbor. However, rendering a thumbnail successfully doesn't mean that the visual contents are right! The JPG orientation could be wrong, for example. If I had to do it again, I would have run DSSIM visual comparisons on the live dual-served production traffic between the MediaWiki and Thumbor outputs. That would have definitely surfaced the handful of bugs that appeared post-launch.
All in all, if you do your homework and are very thorough in testing locally and on production traffic, you can achieve a very smooth launch replacing a core part of infrastructure with completely different software. Despite the handful of avoidable bugs that appeared around the launch, the switch to Thumbor went largely unnoticed by users, which was the original intent, as we were looking for feature parity and ease of swapping the new solution in. Thumbor has been happily serving all Wikimedia production thumbnail traffic since June 2017 in a very stable fashion. This concludes our journey to Thumbor :)
Thanks for sharing, @Gilles, and thanks for all the work you've put into this project. I really enjoyed reading this series and learning a lot about what went into this rather large undertaking. You've documented some valuable lessons and a deployment strategy that we should utilize as much as possible when deploying a new service (or even a new version of an existing service!)