Explore ingress filtering for Lift Wing
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	klausman
	Jan 27 2022, 3:21 PM

Description

For LW, even if we get filtering-by-UA and similar functionality from the API GW, we should examine the same functionality in the context of our own Istio ingress

Details

	Subject	Repo	Branch	Lines +/-
	Add basic rate-limit capabilities to ML clusters	operations/deployment-charts	master	+65 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T272917 Lift Wing proof of concept
Resolved	klausman	T288789 API Gateway Integration
Resolved	elukey	T300259 Explore ingress filtering for Lift Wing

Event Timeline

klausman created this task.Jan 27 2022, 3:21 PM

While checking how to limit egress connections, I found https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/local_rate_limit_filter#config-http-filters-local-rate-limit that looks nice.

Istio supports the definiton of custom Envoy filters via https://istio.io/latest/docs/reference/config/networking/envoy-filter/

Removing assignee so we can prioritize/schedule the task during the next grooming.

calbon moved this task from Active Tasks to Unsorted on the Machine-Learning-Team board.Oct 25 2022, 6:18 PM

calbon edited projects, added Machine-Learning-Team; removed Machine-Learning-Team (Active Tasks).

calbon assigned this task to elukey.Nov 15 2022, 3:18 PM

calbon moved this task from Unsorted to Backlog/Lift Wing on the Machine-Learning-Team board.

Very interesting explanation: https://learncloudnative.com/blog/2022-09-08-ratelimit-istio

I tested the local rate limit in staging briefly, and it seems working nicely. The main drawback is that it is a coarse filter (for example, it is difficult to tune the limit only for connections with UA XYZ and header A matching "batman") but it gives us a basic high level protection (for example, rate limit if the number of requests is higher than X/s or X/min etc..).

The following has been tested in staging with various thresholds (100 rps, 50 rps, etc..) and it worked fine:

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: filter-local-ratelimit-svc
  namespace: istio-system
spec:
  workloadSelector:
    labels:
      app-wmf: kserve-inference
  configPatches:
    - applyTo: HTTP_FILTER
      match:
        context: SIDECAR_INBOUND
        listener:
          filterChain:
            filter:
              name: "envoy.filters.network.http_connection_manager"
      patch:
        operation: INSERT_BEFORE
        value:
          name: envoy.filters.http.local_ratelimit
          typed_config:
            "@type": type.googleapis.com/udpa.type.v1.TypedStruct
            type_url: type.googleapis.com/envoy.extensions.filters.http.local_ratelimit.v3.LocalRateLimit
            value:
              stat_prefix: http_local_rate_limiter
              token_bucket:
                max_tokens: 100
                tokens_per_fill: 100
                fill_interval: 1s
              filter_enabled:
                runtime_key: local_rate_limit_enabled
                default_value:
                  numerator: 100
                  denominator: HUNDRED
              filter_enforced:
                runtime_key: local_rate_limit_enforced
                default_value:
                  numerator: 100
                  denominator: HUNDRED
              response_headers_to_add:
                - append: false
                  header:
                    key: x-local-rate-limit
                    value: 'true'

The other main caveat is that this is a per-pod rate limit, not a global one. So the rps threshold set is related to what every pod can sustain (even better for us in my opinion).

I'd avoid the global rate limit for the moment, it needs an extra Redis support and it doesn't really give us a lot more than the local rate limit.

@klausman lemme know if the above makes sense, already applied manually in staging. If so I'll test metrics (they need to be whitelisted) and I'll add a simple template for it in deployment-charts.

elukey moved this task from Backlog/Lift Wing to In Progress on the Machine-Learning-Team board.Nov 24 2022, 5:02 PM

Change 860925 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] Add basic rate-limit capabilities to ML clusters

https://gerrit.wikimedia.org/r/860925

gerritbot added a project: Patch-For-Review.Nov 25 2022, 5:29 PM

Tested also metrics:

elukey@ml-staging2002:~$ sudo nsenter -t 1653602 -n curl -s localhost:15000/stats/prometheus | grep rate_limit
# TYPE envoy_http_local_rate_limiter_http_local_rate_limit_enabled counter
envoy_http_local_rate_limiter_http_local_rate_limit_enabled{} 292
# TYPE envoy_http_local_rate_limiter_http_local_rate_limit_enforced counter
envoy_http_local_rate_limiter_http_local_rate_limit_enforced{} 0
# TYPE envoy_http_local_rate_limiter_http_local_rate_limit_ok counter
envoy_http_local_rate_limiter_http_local_rate_limit_ok{} 292
# TYPE envoy_http_local_rate_limiter_http_local_rate_limit_rate_limited counter
envoy_http_local_rate_limiter_http_local_rate_limit_rate_limited{} 0

The only sad thing is that we need to kill/respawn pods get the envoy config updated :(

Change 860925 merged by Elukey:

[operations/deployment-charts@master] Add basic rate-limit capabilities to ML clusters

https://gerrit.wikimedia.org/r/860925

Maintenance_bot removed a project: Patch-For-Review.Nov 30 2022, 2:31 PM

Deployed rate limits and new istio configs for showing up their metrics. The only downside is that the new metrics require a restart of all the pods, so for the moment let's wait for another broad deployment since it is not really urgent.

Next steps: