Page MenuHomePhabricator

Rate limiting for hotlinked images
Open, HighPublic

Description

A few times in recent history, including twice this week, we've suffered outages due to images hosted on our infrastructure hotlinked by popular websites or apps. The bottleneck is usually network bandwidth, either over our peering/transit connections or at the individual cp host, or both.

(Note there's been some discussion in the past about whether to permit hotlinking at all; see e.g. T152091. I don't want to get into that policy question here, only the technical issue of keeping the site available for other users when image requests exceed our available bandwidth.)

As currently implemented we can't use requestctl as a protective measure (T317794) and other proposed solutions (https://gerrit.wikimedia.org/r/768723) won't work as-is for similar reasons. Either approach might be adapted to work, but in the meantime, we need a tool available for oncallers to use to protect the infrastructure when this happens. Some possible approaches:

  • Prioritize the VCL changes needed for T317794
  • Prioritize the VCL changes needed for https://gerrit.wikimedia.org/r/768723
  • Apply automatic rate limiting at the haproxy layer (T306580 wouldn't apply here, because the per-client concurrency is probably ~1, so we'd need per-URL ratelimiting or better yet bpslimiting, but haproxy is a good place for it, for the reasons described in that task)
  • Add a knob for manual rate limiting at the haproxy layer, so oncallers can respond quickly to hotlink-induced outages (enabling requestctl would be preferable, just to keep all the controls in the same place -- but if we can implement this solution more quickly, let's do it)

Event Timeline

RLazarus created this task.

Change 768723 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] C:varnish: Rate limit hotlinking

https://gerrit.wikimedia.org/r/768723

Change 832268 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] C:varnish: Add cluster_fe_hit and cluster_fe_ratelimit_hits subroutines

https://gerrit.wikimedia.org/r/832268

Change 832621 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] C:varnish: Rate limit hotlinking

https://gerrit.wikimedia.org/r/832621

Change 832268 merged by Jbond:

[operations/puppet@production] C:varnish: Add cluster_fe_hit and cluster_fe_ratelimit_hits subroutines

https://gerrit.wikimedia.org/r/832268

Here's my jupyter notebook with a rough analysis of a very impactful hotlink incident (on 2022-09-13) and our biggest organic traffic surge to date (Queen Elizabeth's passing on 2022-09-08):

F35546836

ayounsi subscribed.

[clinic duty] tagging the teams I think are relevant to this task, please change the tags as needed

akosiaris subscribed.

Removing SRE, this has been already triaged to 2 different SRE subteams

Volans subscribed.

Removing I/F as all the proposed solutions falls into the Traffic realm.

Change 919862 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] hiera: Switch cp4052 to HAProxy 2.7 branch

https://gerrit.wikimedia.org/r/919862

Change 919862 merged by Vgutierrez:

[operations/puppet@production] hiera: Switch cp4052 to HAProxy 2.7 branch

https://gerrit.wikimedia.org/r/919862

Change 920207 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] cache::haproxy: Set nbthreads on the first global section

https://gerrit.wikimedia.org/r/920207

Change 920207 merged by Vgutierrez:

[operations/puppet@production] cache::haproxy: Set nbthreads on the first global section

https://gerrit.wikimedia.org/r/920207

Change 920212 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] cache::haproxy: Fix missing socket variable

https://gerrit.wikimedia.org/r/920212

Change 920212 merged by Vgutierrez:

[operations/puppet@production] cache::haproxy: Fix missing socket variable

https://gerrit.wikimedia.org/r/920212

Change 920217 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] hiera: Use HAProxy 2.7.x on cp5032

https://gerrit.wikimedia.org/r/920217

Change 920217 merged by Vgutierrez:

[operations/puppet@production] hiera: Use HAProxy 2.7.x on cp5032

https://gerrit.wikimedia.org/r/920217

Mentioned in SAL (#wikimedia-operations) [2023-05-16T10:33:45Z] <vgutierrez> testing HAProxy 2.7.8 in cp4052 and cp5032 (upload) - T317799

Mentioned in SAL (#wikimedia-operations) [2023-06-08T11:40:35Z] <vgutierrez> depooling cp4052 for some HAProxy tests - T317799

Mentioned in SAL (#wikimedia-operations) [2023-06-08T12:03:43Z] <vgutierrez> restore cp4052 HAProxy configuration - T317799

Change 928541 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] haproxy: Add support for filter bwlim-(in|out)

https://gerrit.wikimedia.org/r/928541

Change 928548 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] hiera: Test HAProxy bw limits per URL on cp4052

https://gerrit.wikimedia.org/r/928548

Change 961333 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] hiera: Move HAProxy 2.7 experiments to cp4051

https://gerrit.wikimedia.org/r/961333

Change 961333 merged by Vgutierrez:

[operations/puppet@production] hiera: Move HAProxy 2.7 experiments to cp4051

https://gerrit.wikimedia.org/r/961333

Mentioned in SAL (#wikimedia-operations) [2023-09-27T08:21:25Z] <vgutierrez> update HAProxy to version 2.7.10 in cp4051 - T317799

Change 928541 merged by Vgutierrez:

[operations/puppet@production] haproxy: Add support for filter bwlim-(in|out)

https://gerrit.wikimedia.org/r/928541

Change 928548 merged by Vgutierrez:

[operations/puppet@production] hiera: Test HAProxy bw limits per URL on cp4051

https://gerrit.wikimedia.org/r/928548

Change 961373 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] haproxy: Fix filter bwlim syntax

https://gerrit.wikimedia.org/r/961373

Change 961373 merged by Vgutierrez:

[operations/puppet@production] haproxy: Fix filter bwlim syntax

https://gerrit.wikimedia.org/r/961373

Change 961800 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/puppet@production] hiera: Test HAProxy bw limits per URL on cp5030

https://gerrit.wikimedia.org/r/961800

Mentioned in SAL (#wikimedia-operations) [2023-09-28T14:02:15Z] <cdanis> depooling cp5030 for haproxy upgrade & testing T317799

Change 961800 merged by CDanis:

[operations/puppet@production] hiera: Test HAProxy bw limits per URL on cp5030

https://gerrit.wikimedia.org/r/961800

Mentioned in SAL (#wikimedia-operations) [2023-09-28T14:08:48Z] <cdanis> repooling cp5030 after haproxy upgrade & config deploy T317799

Change 971253 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] hiera: Disable haproxy limit-by-path experiments on cp4041 and cp5030

https://gerrit.wikimedia.org/r/971253

Change 971253 merged by Vgutierrez:

[operations/puppet@production] hiera: Disable haproxy limit-by-path experiments on cp4041 and cp5030

https://gerrit.wikimedia.org/r/971253

Change 768723 abandoned by Jbond:

[operations/puppet@production] C:varnish: Rate limit hotlinking dry-run

Reason:

thi is been handled by requestctl (or will be)

https://gerrit.wikimedia.org/r/768723

Change 832621 abandoned by Jbond:

[operations/puppet@production] C:varnish: Rate limit hotlinking

Reason:

thi is been handled by requestctl (or will be)

https://gerrit.wikimedia.org/r/832621