Page MenuHomePhabricator

Push thumbnails to both data centers
Closed, ResolvedPublic

Description

From an email by @akosiaris

During the switchover scheduling meeting we had the other day, an
interesting thing crept up. So, in the previous switchover, the
imagescaling infrastructure was powered by mediawiki which was pushing
the generated thumbnails to both datacenters. Thumbor however does not
do that, at least as far as we can tell. Now, we do have a form of a
safety net that was placed there many years ago but it's rather old
and crude and we would like to keep it as a safety net and not rely on
it so much. That safety net would be swiftrepl, a python program
written years ago by Mark to replicate swift containers. So, what are
the chances of making thumbor write the thumbnails to both datacenters
before the switchover?

Event Timeline

From a quick glance, I don't think it would be too bad to do this; it's not work that we had scheduled, though.

Out of curiosity, what's the implication of not doing this? Obviously it means that we wouldn't have pre-generated thumbnails available when we switch over, but I don't know offhand how much that matters.

Not having pre-generated thumbnails at switchover time will have a significant impact in the sense that thumbor in codfw can get overloaded with thumbnails missing from both varnish and swift. Additionally it'd mean we'd have to keep relying on swiftrepl as more than a safety net as originally was designed for, hope that helps!

I see two options for exact replication:

  • On misses, Varnish sends a fire-and-forget identical request to the inactive DC's Swift Proxy
  • Swift proxy sends a fire-and-forget identical request to the other DC's Swift proxy

That being said, I don't think there's anything stopping us from making Thumbnail traffic active/active right now. Not having any exact request replication means that there could be cache differences developing over time between both DCs. It's hard to tell how different they would grow without trying and whether there would still be significant miss spikes on switchovers. One way to tell would be to make thumbnail traffic active/active, run swiftrepl completely once. Then run it again after X months and see how much it replicates both ways, then we'll have an idea of how big a spikes we might be looking at during a switchover if we just let each DC develop its independent cache.

I suspect that making thumbnail traffic active/active might actually require less effort than the request replication, but we can of course do both as well.

I see two options for exact replication:

  • On misses, Varnish sends a fire-and-forget identical request to the inactive DC's Swift Proxy
  • Swift proxy sends a fire-and-forget identical request to the other DC's Swift proxy

When choosing between these two options I think Swift proxy should do it, to keep varnish simple.

That being said, I don't think there's anything stopping us from making Thumbnail traffic active/active right now. Not having any exact request replication means that there could be cache differences developing over time between both DCs. It's hard to tell how different they would grow without trying and whether there would still be significant miss spikes on switchovers. One way to tell would be to make thumbnail traffic active/active, run swiftrepl completely once. Then run it again after X months and see how much it replicates both ways, then we'll have an idea of how big a spikes we might be looking at during a switchover if we just let each DC develop its independent cache.

Incidentally at one of the last switchover meetings it came up how to monitor the drift in number of thumbnails and I've proposed https://gerrit.wikimedia.org/r/c/operations/puppet/+/455553. tl;dr is that there was a >10% difference before I ran swiftrepl, all new uploads for sure.

I suspect that making thumbnail traffic active/active might actually require less effort than the request replication, but we can of course do both as well.

For the purposes of the next switchover changing Swift proxy is indeed something doable for sure and a good idea to move thumbs active/active.
Going forward though I think we should be removing complexity from Swift proxy. Between changing Thumbor or Swift proxy the latter seems easier at this stage, not sure if it'd be a comparable effort to change Thumbor instead? Another desirable property of having Thumbor write to both datacenters is that it mirrors what MW FileBackend does.

We can do it at the Thumbor level, but it seems to me like it's the wrong logical layer to take care of this. It's baking more logic into the layer that should be the dumbest and not worry about storage. I've never liked the fact that Thumbor has been made responsible for storing output, as it already returns it as part of the HTTP request.

The Swift proxy logic should be extracted into a standalone service, but that's a separate project, imho. We're reminded of this every time we need to add something to that layer, but it's a project that needs resourcing, not something we can do ad hoc.

This "facilitator service" would absorb most - if not all - the swift proxy code, as well as a sizable chunk of the thumbor plugins code (storing results, possibly request throttling, url translation and probably more).

In the context of getting this ready in the next 2 weeks, I think the path of least resistance is adding this to the Swift proxy, especially since we used to have near identical code when we were sending requests to both mediawiki and Thumbor. It's a matter of resurrecting that already tested code pointing to a different domain. Instead of writing something new in Thumbor with potential mistakes in it.

Gilles renamed this task from Investigate having thumbor push thumbnails to both data centers, and not just eqiad to Push thumbnails to both data centers.Aug 29 2018, 3:27 PM
Gilles triaged this task as High priority.

Change 456167 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet@production] Send blind thumbnail requests to inactive DC

https://gerrit.wikimedia.org/r/456167

Change 456167 merged by Filippo Giunchedi:
[operations/puppet@production] Send blind thumbnail requests to inactive DC

https://gerrit.wikimedia.org/r/456167

Mentioned in SAL (#wikimedia-operations) [2018-08-30T08:52:38Z] <godog> roll-restart swift-proxy to send requests to thumbor in eqiad and codfw - T201858

Change 456366 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet@production] Catch and log inactive DC HTTP errors in Swift Proxy

https://gerrit.wikimedia.org/r/456366

Change 456366 merged by Filippo Giunchedi:
[operations/puppet@production] Catch and log inactive DC HTTP errors in Swift Proxy

https://gerrit.wikimedia.org/r/456366

This is live as of yesterday, thumbnails are being generated in both datacenters. Early next week we'll switch thumbnails to be active/active as a test.

Change 457365 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] varnish: switch swift_thumbs to active/active

https://gerrit.wikimedia.org/r/457365

Change 457370 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] varnish: switch swift_thumbs to active/active

https://gerrit.wikimedia.org/r/457370

Change 457365 abandoned by Filippo Giunchedi:
varnish: switch swift_thumbs to active/active

Reason:
Replaced by https://gerrit.wikimedia.org/r/c/operations/puppet/ /457370/

https://gerrit.wikimedia.org/r/457365

Change 457370 merged by Filippo Giunchedi:
[operations/puppet@production] varnish: switch swift_thumbs to active/active

https://gerrit.wikimedia.org/r/457370

Mentioned in SAL (#wikimedia-operations) [2018-09-03T14:23:25Z] <godog> switch swift thumbnails active/active - T201858 T199073

Thumbnails are now active/active and no issues have been identified and/or reported!