Page MenuHomePhabricator

Create Grafana dashboard for link recommendation service and document it on wikitech
Closed, ResolvedPublic

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@akosiaris version looks better to me (thank you for making it!) but not sure if there are any other things that need to be added. Also, @Tgr mentioned that the external traffic service was returning 503 errors (T274198#6889990) but the dashboard doesn't show any 500 errors -- maybe that's due to an error farther upstream from the link recommendation service itself.

@akosiaris version looks better to me (thank you for making it!) but not sure if there are any other things that need to be added. Also, @Tgr mentioned that the external traffic service was returning 503 errors (T274198#6889990) but the dashboard doesn't show any 500 errors -- maybe that's due to an error farther upstream from the link recommendation service itself.

It does show 500 errors, see https://grafana-rw.wikimedia.org/d/CI6JRnLMz/linkrecommendation-alex?viewPanel=15&orgId=1&from=1614680042127&to=1614765595068.

That being said, gunicorn stats aren't particularly well structured for querying them easily, e.g. every status code is part of the metric so we have metrics like

  • linkrecommendation_gunicorn_request_status_500
  • linkrecommendation_gunicorn_request_status_405
  • linkrecommendation_gunicorn_request_status_302

etc, which makes it pretty difficult and expensive prometheus wise to group them by class (all 5xx, all 4xx, etc) and means that new status codes, especially at the beginning may not have been added to the dashboard.

There are also envoy specific stats at https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=linkrecommendation&var-destination=All&from=now-7d&to=now You can see some errors there too (the same errors in fact).

@akosiaris version looks better to me (thank you for making it!) but not sure if there are any other things that need to be added. Also, @Tgr mentioned that the external traffic service was returning 503 errors (T274198#6889990) but the dashboard doesn't show any 500 errors -- maybe that's due to an error farther upstream from the link recommendation service itself.

It does show 500 errors, see https://grafana-rw.wikimedia.org/d/CI6JRnLMz/linkrecommendation-alex?viewPanel=15&orgId=1&from=1614680042127&to=1614765595068.

That being said, gunicorn stats aren't particularly well structured for querying them easily, e.g. every status code is part of the metric so we have metrics like

  • linkrecommendation_gunicorn_request_status_500
  • linkrecommendation_gunicorn_request_status_405
  • linkrecommendation_gunicorn_request_status_302

etc, which makes it pretty difficult and expensive prometheus wise to group them by class (all 5xx, all 4xx, etc) and means that new status codes, especially at the beginning may not have been added to the dashboard.

There are also envoy specific stats at https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=linkrecommendation&var-destination=All&from=now-7d&to=now You can see some errors there too (the same errors in fact).

Thanks for looking @akosiaris and for the links; I don't see the 503s that @Tgr reported on March 7 though (from a script on beta cluster POSTing to the external traffic release).

@akosiaris version looks better to me (thank you for making it!) but not sure if there are any other things that need to be added. Also, @Tgr mentioned that the external traffic service was returning 503 errors (T274198#6889990) but the dashboard doesn't show any 500 errors -- maybe that's due to an error farther upstream from the link recommendation service itself.

It does show 500 errors, see https://grafana-rw.wikimedia.org/d/CI6JRnLMz/linkrecommendation-alex?viewPanel=15&orgId=1&from=1614680042127&to=1614765595068.

That being said, gunicorn stats aren't particularly well structured for querying them easily, e.g. every status code is part of the metric so we have metrics like

  • linkrecommendation_gunicorn_request_status_500
  • linkrecommendation_gunicorn_request_status_405
  • linkrecommendation_gunicorn_request_status_302

etc, which makes it pretty difficult and expensive prometheus wise to group them by class (all 5xx, all 4xx, etc) and means that new status codes, especially at the beginning may not have been added to the dashboard.

There are also envoy specific stats at https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=linkrecommendation&var-destination=All&from=now-7d&to=now You can see some errors there too (the same errors in fact).

Thanks for looking @akosiaris and for the links; I don't see the 503s that @Tgr reported on March 7 though (from a script on beta cluster POSTing to the external traffic release).

Then chances are that it's some component upstream that is issuing them. e.g. api-gateway [1] does have some 5xx errors in the 7th, but without a more definite timeframe, I can't tell if those are the ones we care for.

[1] https://grafana.wikimedia.org/d/UOH-5IDMz/api-gateway?viewPanel=26&orgId=1&from=1615075200000&to=1615161599000

Then chances are that it's some component upstream that is issuing them. e.g. api-gateway [1] does have some 5xx errors in the 7th, but without a more definite timeframe, I can't tell if those are the ones we care for.

I generated some more a few seconds ago. Let me know if any other information would be helpful, they are easy to reproduce.

Then chances are that it's some component upstream that is issuing them. e.g. api-gateway [1] does have some 5xx errors in the 7th, but without a more definite timeframe, I can't tell if those are the ones we care for.

I generated some more a few seconds ago. Let me know if any other information would be helpful, they are easy to reproduce.

Ah, this time around it is obvious in envoy graphs too. Requests time out https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&from=1615218405331&to=1615218877013&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=linkrecommendation&var-destination=local_service

How many rps are you sending? It seems like you are hitting capacity limits.

A single thread is sending requests sequentially, without any delay or throttling. Responses are pretty slow though, 10+ sec typically (except for the 503s which end up somewhere in the 10/sec range). The way it seems to work is that there are 3-4 requests that succeed or fail with 504 after a long time (a timeout enforced in envoy, I imagine), then the next 10-15 requests fail with 503, then a few requests succeed again.

So to be clear there are two distinct issues:

  • requests time out (fail with 504) after about 15 seconds (the service can be quite slow so requests taking longer than that is not uncommon). On the MediaWiki side this is a cronjob so long requests are fine; it would be nice to relax the timeout.
  • there is some sort of throttling, which kills most of the requests with an 503. This seems to be a per-minute thing (the 503s stop roughly at the end of every minute, from starting the script) but the trigger does not seem regular: sometimes 3 requests succeed in a row, sometimes 6.

So to be clear there are two distinct issues:

  • requests time out (fail with 504) after about 15 seconds (the service can be quite slow so requests taking longer than that is not uncommon). On the MediaWiki side this is a cronjob so long requests are fine; it would be nice to relax the timeout.
  • there is some sort of throttling, which kills most of the requests with an 503. This seems to be a per-minute thing (the 503s stop roughly at the end of every minute, from starting the script) but the trigger does not seem regular: sometimes 3 requests succeed in a row, sometimes 6.

I created T277297: 504 timeout and 503 errors when accessing linkrecommendation service for this.

So to be clear there are two distinct issues:

  • requests time out (fail with 504) after about 15 seconds (the service can be quite slow so requests taking longer than that is not uncommon). On the MediaWiki side this is a cronjob so long requests are fine; it would be nice to relax the timeout.
  • there is some sort of throttling, which kills most of the requests with an 503. This seems to be a per-minute thing (the 503s stop roughly at the end of every minute, from starting the script) but the trigger does not seem regular: sometimes 3 requests succeed in a row, sometimes 6.

Sorry for letting this fall through the cracks. AFAIK yes there is some rate-limiting in the api-gateway, unsure if you are hitting it or not (not even sure if it is configured for all endpoints). Would you be so kind as to post how to reproduce this? Is it just a for loop of curls?

So to be clear there are two distinct issues:

  • requests time out (fail with 504) after about 15 seconds (the service can be quite slow so requests taking longer than that is not uncommon). On the MediaWiki side this is a cronjob so long requests are fine; it would be nice to relax the timeout.
  • there is some sort of throttling, which kills most of the requests with an 503. This seems to be a per-minute thing (the 503s stop roughly at the end of every minute, from starting the script) but the trigger does not seem regular: sometimes 3 requests succeed in a row, sometimes 6.

Sorry for letting this fall through the cracks. AFAIK yes there is some rate-limiting in the api-gateway, unsure if you are hitting it or not (not even sure if it is configured for all endpoints). Would you be so kind as to post how to reproduce this? Is it just a for loop of curls?

I wrote in T277297 with some reproduction steps for both the 504 and 503.

So to be clear there are two distinct issues:

  • requests time out (fail with 504) after about 15 seconds (the service can be quite slow so requests taking longer than that is not uncommon). On the MediaWiki side this is a cronjob so long requests are fine; it would be nice to relax the timeout.
  • there is some sort of throttling, which kills most of the requests with an 503. This seems to be a per-minute thing (the 503s stop roughly at the end of every minute, from starting the script) but the trigger does not seem regular: sometimes 3 requests succeed in a row, sometimes 6.

Sorry for letting this fall through the cracks. AFAIK yes there is some rate-limiting in the api-gateway, unsure if you are hitting it or not (not even sure if it is configured for all endpoints). Would you be so kind as to post how to reproduce this? Is it just a for loop of curls?

I wrote in T277297 with some reproduction steps for both the 504 and 503.

Oops, missed that. Thanks! I 'll comment there.