Page MenuHomePhabricator

Rate limiting for hotlinked images
Open, HighPublic

Description

A few times in recent history, including twice this week, we've suffered outages due to images hosted on our infrastructure hotlinked by popular websites or apps. The bottleneck is usually network bandwidth, either over our peering/transit connections or at the individual cp host, or both.

(Note there's been some discussion in the past about whether to permit hotlinking at all; see e.g. T152091. I don't want to get into that policy question here, only the technical issue of keeping the site available for other users when image requests exceed our available bandwidth.)

As currently implemented we can't use requestctl as a protective measure (T317794) and other proposed solutions (https://gerrit.wikimedia.org/r/768723) won't work as-is for similar reasons. Either approach might be adapted to work, but in the meantime, we need a tool available for oncallers to use to protect the infrastructure when this happens. Some possible approaches:

  • Prioritize the VCL changes needed for T317794
  • Prioritize the VCL changes needed for https://gerrit.wikimedia.org/r/768723
  • Apply automatic rate limiting at the haproxy layer (T306580 wouldn't apply here, because the per-client concurrency is probably ~1, so we'd need per-URL ratelimiting or better yet bpslimiting, but haproxy is a good place for it, for the reasons described in that task)
  • Add a knob for manual rate limiting at the haproxy layer, so oncallers can respond quickly to hotlink-induced outages (enabling requestctl would be preferable, just to keep all the controls in the same place -- but if we can implement this solution more quickly, let's do it)

Event Timeline

RLazarus created this task.

Change 768723 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] C:varnish: Rate limit hotlinking

https://gerrit.wikimedia.org/r/768723

Change 832268 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] C:varnish: Add cluster_fe_hit and cluster_fe_ratelimit_hits subroutines

https://gerrit.wikimedia.org/r/832268

Change 832621 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] C:varnish: Rate limit hotlinking

https://gerrit.wikimedia.org/r/832621

Change 832268 merged by Jbond:

[operations/puppet@production] C:varnish: Add cluster_fe_hit and cluster_fe_ratelimit_hits subroutines

https://gerrit.wikimedia.org/r/832268

Here's my jupyter notebook with a rough analysis of a very impactful hotlink incident (on 2022-09-13) and our biggest organic traffic surge to date (Queen Elizabeth's passing on 2022-09-08):

F35546836

ayounsi added a subscriber: ayounsi.

[clinic duty] tagging the teams I think are relevant to this task, please change the tags as needed