A few times in recent history, including twice this week, we've suffered outages due to images hosted on our infrastructure hotlinked by popular websites or apps. The bottleneck is usually network bandwidth, either over our peering/transit connections or at the individual cp host, or both.
(Note there's been some discussion in the past about whether to permit hotlinking at all; see e.g. T152091. I don't want to get into that policy question here, only the technical issue of keeping the site available for other users when image requests exceed our available bandwidth.)
As currently implemented we can't use requestctl as a protective measure (T317794) and other proposed solutions (https://gerrit.wikimedia.org/r/768723) won't work as-is for similar reasons. Either approach might be adapted to work, but in the meantime, we need a tool available for oncallers to use to protect the infrastructure when this happens. Some possible approaches:
- Prioritize the VCL changes needed for T317794
- Prioritize the VCL changes needed for https://gerrit.wikimedia.org/r/768723
- Apply automatic rate limiting at the haproxy layer (T306580 wouldn't apply here, because the per-client concurrency is probably ~1, so we'd need per-URL ratelimiting or better yet bpslimiting, but haproxy is a good place for it, for the reasons described in that task)
- Add a knob for manual rate limiting at the haproxy layer, so oncallers can respond quickly to hotlink-induced outages (enabling requestctl would be preferable, just to keep all the controls in the same place -- but if we can implement this solution more quickly, let's do it)

