Page MenuHomePhabricator

upload LB: retry swift 404s cross-cluster
Closed, DeclinedPublic

Description

(tentatively assigning to BBlack for routing within Traffic)

While I'm not sure this is feasible, here's a suggestion for some defense-in-depth from errors on media upload: when ATS gets a 404 response from one of the core media storage clusters (eqiad/codfw), it should retry the request on the other cluster.

Event Timeline

CDanis created this task.Aug 23 2019, 6:15 PM
CDanis renamed this task from upload LB: retry 404s cross-cluster to upload LB: retry swift 404s cross-cluster.Aug 23 2019, 8:57 PM
Fano removed a subscriber: Fano.Aug 24 2019, 3:16 AM
ema triaged this task as Normal priority.Aug 27 2019, 10:06 AM
ema moved this task from Triage to Caching on the Traffic board.
CDanis added a subscriber: ema.Sep 9 2019, 11:10 PM

@BBlack @ema can you weigh in at some point soon with how feasible this seems? I'm pretty unfamiliar with the current setup here.

@ema would know better about how difficult such things are with ATS in particular. I tend not to like this idea in general, though. In the case of some failure causing lots of temporary pointless 404s, it might double up traffic, and it seems like a hacky crutch which we'd come to rely on instead of fixing the real underlying issues. If others feel strongly about it and it's feasible and reasonably-temporary, I can be convinced, though!

Yeah, all fair points. We don't seem to be experiencing too many of these
404s (a handful per day), and other mitigations are available, so it's
probably not worth it.

CDanis closed this task as Declined.Sep 9 2019, 11:19 PM