Page MenuHomePhabricator

api-gateway: improve ratelimit metrics mappings
Closed, ResolvedPublic

Description

The prometheus mappings we currently collect from the ratelimit services don't contain any metrics.

Status Quo

We are collecting two kinds of metrics: metrics for anonymous visitors, and metrics for authenticated clients.

For anon visitors, the metrics keys have the form ratelimit_service_rate_limit_wikimedia_route_name_default_rate_user_class_anon_fallback_anon_client_ip_{outcome}, where outcome one of total_hist, over_limit, near_limit, within_limit (the mapping for the shadow_mode suffix is missing).

For authenticated requests, the metrics keys have the form ratelimit_service_rate_limit_wikimedia_route_name_default_rate_client_id_{client-id}_user_id_{user-id}_{outcome}. This means we have a separate metric for each user! It looks like we are collecting metrics in detailed_metric mode, but it's not clear where that is activated. The expected form of the key, without detailed metrics, would be simply ratelimit_service_rate_limit_wikimedia_route_name_default_rate_client_id_user_id_{outcome}, with the numbers for all users summed together.

Proposal

Using many separate metrics is undesirable because it makes it hard to write queries and combine metrics across different keys. There is currently no way to get metrics across all authenticated requests. It would be better to use a mapping that produces only one metric per outcome, and uses labels for further distinctions. The metric names would have the form ratelimit_service_api_gateway_{outcome} and would use labels for the route_name (e.g. "default_rate") and the user_class (e.g. "anon" or "jwt-user").

It would be possible to make the outcome a label as well, but that would only work for over_limit, within_limit as these are mutually excludive (shadow_mode overlaps with over_limit and near_limit overlaps with within_limit). It's not clear what to do with near_limit and shadow_mode in this case - should they be separate metrics?

It may be desriable to have the client_id in a label (but not the user_id, to avoid exploding the cardinality of the metric). This could be done if we turn on detailed_metric mode and then extract the client-id from the key, while ignoring the user-id.

Event Timeline

grafik.png (625×1 px, 156 KB)

This is what thanos shows for the ratelimit_service_rate_limit_wikimedia_route_name_default_rate_client_id prefix...

Change #1201599 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[operations/deployment-charts@master] api-gateway: improve metrics mapping

https://gerrit.wikimedia.org/r/1201599

Change #1201599 merged by jenkins-bot:

[operations/deployment-charts@master] api-gateway: improve metrics mapping

https://gerrit.wikimedia.org/r/1201599

Change #1202200 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] api-gateway: Fix regex for api-gateway metrics

https://gerrit.wikimedia.org/r/1202200

Change #1202200 merged by jenkins-bot:

[operations/deployment-charts@master] api-gateway: Fix regex for api-gateway metrics

https://gerrit.wikimedia.org/r/1202200

daniel claimed this task.

Deployed and confirmed