Page MenuHomePhabricator

support the haproxy silent-drop hysteresis gadget from requestctl
Closed, ResolvedPublic

Description

Status quo

Right now, we generate haproxy config language like this to implement silent-drop:

# Pseudo-backends used only for statistics tracking.
backend httpreqrate
    stick-table type ipv6  size 1m expire 300s store http_req_rate(10s),gpc0_rate(300s)
backend httpreqrate_http
    stick-table type ipv6  size 1m expire 300s store http_req_rate(10s),gpc0_rate(300s)

listen tls
    http-request track-sc0 src table httpreqrate 
    # the instantaneous value
    acl too_many_concurrent_queries sc0_trackers(httpreqrate) ge 500
    # a counter incremented the first time instantaneous above a threshold within a given "rate" window (per above, 300s)
    acl too_much_recent_concurrency sc0_gpc0_rate(httpreqrate) gt 0  
    # yes, this next one is mutating to evaluate, and yes, that is haproxy-idiomatic
    acl mark_as_too_much_concurrency sc0_inc_gpc0(httpreqrate) gt 0  

    # "Edge-triggered" logging: log the first time the constraint is violated, then stop
    http-request set-var(req.dummy_silent_drop) src,debug(silent-drop_for_300s,stderr) if too_many_concurrent_queries !too_much_recent_concurrency

    # Enforcement
    http-request silent-drop if too_much_recent_concurrency || too_many_concurrent_queries mark_as_too_much_concurrency

This device is a modification of one often shown in official haproxy configuration examples. It adds short-term memory by using gpc0_rate -- the rate-of-change-over-fixed-time-window of a general-purpose counter. This allows us to log a message only the first time that a given src exceeds its allowed concurrency, which greatly limits the number of log messages produced.

Make it better

We can generalize this device, and allow for multiple different threshold values with the same underlying stick-table and stickycounter (sc0).

We can also allow for a "dry run" mode, so we can test new thresholds or different triggering rules in production without actually affecting traffic.

These are two features we've wanted for a long time; see T329331 and T353910

Implementation proposal
  • change the stick-table ... store ...,gpc0_rate(300s) definitions to instead gpc_rate(N,300s). This allows for an array of N elements. To start with we could choose something like 10?
    • Expanding past 10 in the future is not hard, if we need to
    • This change is backwards-compatible with the current ACLs as written, so it can be rolled out in Puppet any time (well, maybe we need to first double-check that changing a stick-table store definition is allowed at runtime)
  • Keep stick-table definitions and any necessary trackers like track-sc0 defined statically in puppet
  • Add to requestctl a new flavor of haproxy_action that can generate the correct acl definitions and the necessary http-request verbs
    • Dry run can be implemented by moving the stanzas now after the || of the final line into its own set-var
    • This would allow for us to still log to syslog that the concurrency limit would have kicked in, without being spammy.
    • Use one array index in the gpc for hysteresis for a given action.
    • Requestctl can persist the meaning of the array numbers, likely as extra state in etcd.
      • Upon activation, an action receives an array index known to be free, persist this in etcd. Error out if we can't find one available to allocate.
      • Upon de-activation of an action, we note the timestamp of the deactivation, and persist in etcd. Then the index is free to re-use at any time 5 minutes after that timestamp.
Open questions
  • Do we want some way to express haproxy sample fetches like sc0_trackers as requestctl patterns? Or would it be better to have statically-defined or defined-in-config patterns that provide an API surface for concurrency & any other things we are tracking in stick-tables?
    • Or should we do a very minimal option for now by essentially hardcoding the concurrency measurement "pattern" within the new action?

Event Timeline

Before I think of implementations, I'd like to understand better what you want to offer:

  • A way to limit concurrency of any request corresponding to a specific pattern? So allow at most X concurrent request to say url X from user-agent Y
  • A way to limit concurrency from specific IPs, only for requests corresponding to a specific pattern? So limit concurrency for a user over the threshold only if they're hitting url X with user-agent Y?
  • Something else?

To remove even more doubt, can you make an example of what you'd expect in haproxy dsl with the following rule:

expression: ipblock@cloud/aws AND pattern@ua/requests
concurrency_limit: true
concurrency_threshold: 50

It would make it easier to discuss how to implement it in requestctl.

A way to limit concurrency from specific IPs, only for requests corresponding to a specific pattern? So limit concurrency for a user over the threshold only if they're hitting url X with user-agent Y?

Not quite this one, but close.

The only metric tracked would be per-IP concurrency, independent of any properties of the requests or anything. In fact to avoid confusion I think we should probably disallow anything but ipblocks in the expression for these actions.

Let's imagine this hypothetical hypothetical configuration (a hypothetical once-removed).

apiVersion: automation.bettercallvolans.org/v1
kind: Action
metadata:
  name: per-ip-default-concurrency
spec:
  expression: pattern@all
  per_ip_concurrency_limit:
    logging: true
    enforcement: true
    threshold: 500
---
apiVersion: automation.bettercallvolans.org/v1
kind: Action
metadata:
  name: per-ip-sussy-concurrency
spec:
  expression: ipblock@known/foo
  per_ip_concurrency_limit:
    logging: true
    enforcement: true
    threshold: 25
---
apiVersion: automation.bettercallvolans.org/v1
kind: Action
metadata:
  name: per-ip-aws-concurrency
spec:
  expression: ipblock@cloud/aws
  per_ip_concurrency_limit:
    logging: true
    enforcement: false
    threshold: 50

That would materialize into an haproxy configuration like this:

1### All of this from Puppet, as usual
2
3# Pseudo-backends used only for statistics tracking.
4backend httpreqrate
5 stick-table type ipv6 size 1m expire 300s store http_req_rate(10s),gpc_rate(10,300s)
6backend httpreqrate_http
7 stick-table type ipv6 size 1m expire 300s store http_req_rate(10s),gpc_rate(10,300s)
8
9listen tls
10 http-request track-sc0 src table httpreqrate
11### All of the above from Puppet, as usual
12
13### All of the below from requestctl
14 ## per-ip-default-concurrency
15 acl per-ip-default-concurrency_too-high-now sc0_trackers(httpreqrate) ge 500
16 acl per-ip-default-concurrency_too-high-recently sc_gpc_rate(0,0,httpreqrate) gt 0
17 acl per-ip-default-concurrency_mark-as-too-high sc_inc_gpc(0,0,httpreqrate)
18
19 # per-ip-default-concurrency logging enabled
20 http-request set-var(req.dummy) src,debug(silent-drop_for_300s/per-ip-default-concurrency) if per-ip-default-concurrency_too-high-now !per-ip-default-concurrency_too-high-recently
21
22 # per-ip-default-concurrency mark: (logging OR enforcement) enabled
23 http-request set-var(req.dummy) src if per-ip-default-concurrency_too-high-now per-ip-default-concurrency_mark-as-too-high
24
25 # per-ip-default-concurrency enforcement enabled
26 http-request silent-drop if per-ip-default-concurrency_too-high-recently
27
28
29
30 ## per-ip-sussy-concurrency
31 acl per-ip-sussy-concurrency_too-high-now sc0_trackers(httpreqrate) ge 50
32 acl per-ip-sussy-concurrency_too-high-recently sc_gpc_rate(0,1,httpreqrate) gt 0
33 acl per-ip-sussy-concurrency_mark-as-too-high sc_inc_gpc(0,1,httpreqrate)
34
35 # per-ip-sussy-concurrency logging enabled
36 http-request set-var(req.dummy) src,debug(silent-drop_for_300s/per-ip-sussy-concurrency) if ipblock_known_sussy per-ip-sussy-concurrency_too-high-now !per-ip-sussy-concurrency_too-high-recently
37
38 # per-ip-sussy-concurrency mark: (logging OR enforcement) enabled
39 http-request set-var(req.dummy) src if ipblock_known_sussy per-ip-sussy-concurrency_too-high-now per-ip-sussy-concurrency_mark-as-too-high
40
41 # per-ip-sussy-concurrency enforcement enabled
42 http-request silent-drop if ipblock_known_sussy per-ip-sussy-concurrency_too-high-recently
43
44
45
46 ## per-ip-aws-concurrency
47 acl per-ip-aws-concurrency_too-high-now sc0_trackers(httpreqrate) ge 25
48 acl per-ip-aws-concurrency_too-high-recently sc_gpc_rate(0,2,httpreqrate) gt 0
49 acl per-ip-aws-concurrency_mark-as-too-high sc_inc_gpc(0,2,httpreqrate)
50
51 # per-ip-aws-concurrency logging enabled
52 http-request set-var(req.dummy) src,debug(silent-drop_for_300s/per-ip-aws-concurrency) if ipblock_cloud_aws per-ip-aws-concurrency_too-high-now !per-ip-aws-concurrency_too-high-recently
53
54 # per-ip-aws-concurrency mark: (logging OR enforcement) enabled
55 http-request set-var(req.dummy) src if ipblock_cloud_aws per-ip-aws-concurrency_too-high-now per-ip-aws-concurrency_mark-as-too-high
56
57 # per-ip-aws-concurrency enforcement disabled

Thanks for the thorough explanation! I know the traffic folks were a bit worried about controlling stick tables from requestctl but I think this format is ok.

I think the easiest way to do this is to do as follows:

  • Add the concurrency_limit (bool) and concurrency_threshold (int) and priority (int) to haproxy_action
  • In commit, collect all the actions that have concurrency_limit set to true and count them. If they're less or equal the maximum number allowed, proceed, else raise an error. If they're less, order them by priority and then alphabetically to match a specific ID. We can also just save this value in a special state object but I think this is probably our simplest solution
  • It should be reasonably easy to generate the DSL above at this point

I am honestly wary of adding state objects to etcd to save things like the ID mappings for actions, and/or the last time an action was taken. I'd rather build an audit log and be able to extract this information from there, but I'm open to alternatives.

Just a small note on your code:

acl per-ip-aws-concurrency_too-high-now per-ip-aws-concurrency_too-high-now_0 per-ip-aws-concurrency_too-high-now_1

this will be satisfied when either condition is true, which I think isn't what you wanted here?

I think the easiest way to do this is to do as follows:

  • Add the concurrency_limit (bool) and concurrency_threshold (int) and priority (int) to haproxy_action
  • In commit, collect all the actions that have concurrency_limit set to true and count them. If they're less or equal the maximum number allowed, proceed, else raise an error. If they're less, order them by priority and then alphabetically to match a specific ID. We can also just save this value in a special state object but I think this is probably our simplest solution

So we'd rely on users setting priority to preserve the ordering? I think it would be simpler to do manual assignments in haproxy_action, in that case.

I am honestly wary of adding state objects to etcd to save things like the ID mappings for actions, and/or the last time an action was taken. I'd rather build an audit log and be able to extract this information from there, but I'm open to alternatives.

An audit log would be fine I guess, but where would you want to store it?

I guess another option is we could attach the current state to the generated configuration in some way (perhaps a special machine-readable comment line?) That would avoid any extra state objects.

Just a small note on your code:

acl per-ip-aws-concurrency_too-high-now per-ip-aws-concurrency_too-high-now_0 per-ip-aws-concurrency_too-high-now_1

this will be satisfied when either condition is true, which I think isn't what you wanted here?

Yes sorry, I need both AND'd together, so these need to go together in each action.

Change #1075633 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/puppet@production] haproxy: switch to array-type gpc_rate counters

https://gerrit.wikimedia.org/r/1075633

Change #1075633 merged by CDanis:

[operations/puppet@production] eqsin: haproxy: switch to array-type gpc_rate counters

https://gerrit.wikimedia.org/r/1075633

Change #1082236 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/puppet@production] haproxy: gpc_rate arrays to all clusters

https://gerrit.wikimedia.org/r/1082236

cdanis opened https://gitlab.wikimedia.org/repos/sre/conftool/-/merge_requests/39

Draft: feat: Enhance debug output for per-IP concurrency monitoring in HaProxyDSLView

Change #1082236 merged by CDanis:

[operations/puppet@production] haproxy: gpc_rate arrays to all clusters

https://gerrit.wikimedia.org/r/1082236

In the task description, I wrote:
  • Requestctl can persist the meaning of the array numbers, likely as extra state in etcd.
    • Upon activation, an action receives an array index known to be free, persist this in etcd. Error out if we can't find one available to allocate.
    • Upon de-activation of an action, we note the timestamp of the deactivation, and persist in etcd. Then the index is free to re-use at any time 5 minutes after that timestamp.

This was solving a problem that didn't exist. The counters are not shared in any way between config reloads (there's a separate top-level process forked). So no need to track this state in conftool or wait any number of minutes.

oblivian merged https://gitlab.wikimedia.org/repos/sre/conftool/-/merge_requests/39

feat: Enhance debug output for per-IP concurrency monitoring in HaProxyDSLView

Change #1118146 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/software/hiddenparma/deploy@main] new HIDDENPARMA release

https://gerrit.wikimedia.org/r/1118146

Change #1118146 merged by CDanis:

[operations/software/hiddenparma/deploy@main] new HIDDENPARMA release

https://gerrit.wikimedia.org/r/1118146

CDanis claimed this task.

F58372977