Create Grafana dashboard for link recommendation service and document it on wikitech
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	kostajh
	Mar 8 2021, 11:30 AM

Description

We have two attempts so far:

https://grafana.wikimedia.org/d/XkUDlMLGz/linkrecommendation?orgId=1&refresh=10s&from=now-24h&to=now created by @kostajh in cloning some other service dashboard (termbox, IIRC)
https://grafana.wikimedia.org/d/CI6JRnLMz/linkrecommendation-alex?orgId=1 created by @akosiaris

Related Objects
Search...

Status	Assigned	Task
Resolved	MMiller_WMF	T252822 [EPIC] Growth: "add a link" structured task 1.0
Resolved	kostajh	T266437 Add a link engineering: backend product specifications
Resolved	kostajh	T261396 Add a link: engineering tasks for initial release
Resolved	kostajh	T276769 Create Grafana dashboard for link recommendation service and document it on wikitech

Event Timeline

kostajh created this task.Mar 8 2021, 11:30 AM

Restricted Application added a project: Growth-Team. · View Herald TranscriptMar 8 2021, 11:30 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@akosiaris version looks better to me (thank you for making it!) but not sure if there are any other things that need to be added. Also, @Tgr mentioned that the external traffic service was returning 503 errors (T274198#6889990) but the dashboard doesn't show any 500 errors -- maybe that's due to an error farther upstream from the link recommendation service itself.

In T276769#6891719, @kostajh wrote:

@akosiaris version looks better to me (thank you for making it!) but not sure if there are any other things that need to be added. Also, @Tgr mentioned that the external traffic service was returning 503 errors (T274198#6889990) but the dashboard doesn't show any 500 errors -- maybe that's due to an error farther upstream from the link recommendation service itself.

It does show 500 errors, see https://grafana-rw.wikimedia.org/d/CI6JRnLMz/linkrecommendation-alex?viewPanel=15&orgId=1&from=1614680042127&to=1614765595068.

That being said, gunicorn stats aren't particularly well structured for querying them easily, e.g. every status code is part of the metric so we have metrics like

linkrecommendation_gunicorn_request_status_500
linkrecommendation_gunicorn_request_status_405
linkrecommendation_gunicorn_request_status_302

etc, which makes it pretty difficult and expensive prometheus wise to group them by class (all 5xx, all 4xx, etc) and means that new status codes, especially at the beginning may not have been added to the dashboard.

There are also envoy specific stats at https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=linkrecommendation&var-destination=All&from=now-7d&to=now You can see some errors there too (the same errors in fact).

In T276769#6892006, @akosiaris wrote:

In T276769#6891719, @kostajh wrote:

@akosiaris version looks better to me (thank you for making it!) but not sure if there are any other things that need to be added. Also, @Tgr mentioned that the external traffic service was returning 503 errors (T274198#6889990) but the dashboard doesn't show any 500 errors -- maybe that's due to an error farther upstream from the link recommendation service itself.

It does show 500 errors, see https://grafana-rw.wikimedia.org/d/CI6JRnLMz/linkrecommendation-alex?viewPanel=15&orgId=1&from=1614680042127&to=1614765595068.

That being said, gunicorn stats aren't particularly well structured for querying them easily, e.g. every status code is part of the metric so we have metrics like

linkrecommendation_gunicorn_request_status_500

linkrecommendation_gunicorn_request_status_405

linkrecommendation_gunicorn_request_status_302

etc, which makes it pretty difficult and expensive prometheus wise to group them by class (all 5xx, all 4xx, etc) and means that new status codes, especially at the beginning may not have been added to the dashboard.

There are also envoy specific stats at https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=linkrecommendation&var-destination=All&from=now-7d&to=now You can see some errors there too (the same errors in fact).

Thanks for looking @akosiaris and for the links; I don't see the 503s that @Tgr reported on March 7 though (from a script on beta cluster POSTing to the external traffic release).

In T276769#6892034, @kostajh wrote:

In T276769#6892006, @akosiaris wrote:

In T276769#6891719, @kostajh wrote:

@akosiaris version looks better to me (thank you for making it!) but not sure if there are any other things that need to be added. Also, @Tgr mentioned that the external traffic service was returning 503 errors (T274198#6889990) but the dashboard doesn't show any 500 errors -- maybe that's due to an error farther upstream from the link recommendation service itself.

It does show 500 errors, see https://grafana-rw.wikimedia.org/d/CI6JRnLMz/linkrecommendation-alex?viewPanel=15&orgId=1&from=1614680042127&to=1614765595068.

That being said, gunicorn stats aren't particularly well structured for querying them easily, e.g. every status code is part of the metric so we have metrics like

linkrecommendation_gunicorn_request_status_500

linkrecommendation_gunicorn_request_status_405

linkrecommendation_gunicorn_request_status_302

etc, which makes it pretty difficult and expensive prometheus wise to group them by class (all 5xx, all 4xx, etc) and means that new status codes, especially at the beginning may not have been added to the dashboard.

There are also envoy specific stats at https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=linkrecommendation&var-destination=All&from=now-7d&to=now You can see some errors there too (the same errors in fact).

Thanks for looking @akosiaris and for the links; I don't see the 503s that @Tgr reported on March 7 though (from a script on beta cluster POSTing to the external traffic release).

Then chances are that it's some component upstream that is issuing them. e.g. api-gateway [1] does have some 5xx errors in the 7th, but without a more definite timeframe, I can't tell if those are the ones we care for.

[1] https://grafana.wikimedia.org/d/UOH-5IDMz/api-gateway?viewPanel=26&orgId=1&from=1615075200000&to=1615161599000

Tgr mentioned this in T276795: Monitoring for GrowthExperiments link recommendation task pool.Mar 8 2021, 3:16 PM

In T276769#6892052, @akosiaris wrote:

Then chances are that it's some component upstream that is issuing them. e.g. api-gateway [1] does have some 5xx errors in the 7th, but without a more definite timeframe, I can't tell if those are the ones we care for.

I generated some more a few seconds ago. Let me know if any other information would be helpful, they are easy to reproduce.

In T276769#6892705, @Tgr wrote:

In T276769#6892052, @akosiaris wrote:

Then chances are that it's some component upstream that is issuing them. e.g. api-gateway [1] does have some 5xx errors in the 7th, but without a more definite timeframe, I can't tell if those are the ones we care for.

I generated some more a few seconds ago. Let me know if any other information would be helpful, they are easy to reproduce.

Ah, this time around it is obvious in envoy graphs too. Requests time out https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&from=1615218405331&to=1615218877013&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=linkrecommendation&var-destination=local_service

How many rps are you sending? It seems like you are hitting capacity limits.

A single thread is sending requests sequentially, without any delay or throttling. Responses are pretty slow though, 10+ sec typically (except for the 503s which end up somewhere in the 10/sec range). The way it seems to work is that there are 3-4 requests that succeed or fail with 504 after a long time (a timeout enforced in envoy, I imagine), then the next 10-15 requests fail with 503, then a few requests succeed again.

So to be clear there are two distinct issues:

requests time out (fail with 504) after about 15 seconds (the service can be quite slow so requests taking longer than that is not uncommon). On the MediaWiki side this is a cronjob so long requests are fine; it would be nice to relax the timeout.
there is some sort of throttling, which kills most of the requests with an 503. This seems to be a per-minute thing (the 503s stop roughly at the end of every minute, from starting the script) but the trigger does not seem regular: sometimes 3 requests succeed in a row, sometimes 6.

In T276769#6893448, @Tgr wrote:

So to be clear there are two distinct issues:

requests time out (fail with 504) after about 15 seconds (the service can be quite slow so requests taking longer than that is not uncommon). On the MediaWiki side this is a cronjob so long requests are fine; it would be nice to relax the timeout.

there is some sort of throttling, which kills most of the requests with an 503. This seems to be a per-minute thing (the 503s stop roughly at the end of every minute, from starting the script) but the trigger does not seem regular: sometimes 3 requests succeed in a row, sometimes 6.

I created T277297: 504 timeout and 503 errors when accessing linkrecommendation service for this.

In T276769#6893448, @Tgr wrote:

So to be clear there are two distinct issues:

requests time out (fail with 504) after about 15 seconds (the service can be quite slow so requests taking longer than that is not uncommon). On the MediaWiki side this is a cronjob so long requests are fine; it would be nice to relax the timeout.

there is some sort of throttling, which kills most of the requests with an 503. This seems to be a per-minute thing (the 503s stop roughly at the end of every minute, from starting the script) but the trigger does not seem regular: sometimes 3 requests succeed in a row, sometimes 6.

Sorry for letting this fall through the cracks. AFAIK yes there is some rate-limiting in the api-gateway, unsure if you are hitting it or not (not even sure if it is configured for all endpoints). Would you be so kind as to post how to reproduce this? Is it just a for loop of curls?

In T276769#6908482, @akosiaris wrote:

In T276769#6893448, @Tgr wrote:

So to be clear there are two distinct issues:

requests time out (fail with 504) after about 15 seconds (the service can be quite slow so requests taking longer than that is not uncommon). On the MediaWiki side this is a cronjob so long requests are fine; it would be nice to relax the timeout.

there is some sort of throttling, which kills most of the requests with an 503. This seems to be a per-minute thing (the 503s stop roughly at the end of every minute, from starting the script) but the trigger does not seem regular: sometimes 3 requests succeed in a row, sometimes 6.

Sorry for letting this fall through the cracks. AFAIK yes there is some rate-limiting in the api-gateway, unsure if you are hitting it or not (not even sure if it is configured for all endpoints). Would you be so kind as to post how to reproduce this? Is it just a for loop of curls?

I wrote in T277297 with some reproduction steps for both the 504 and 503.

In T276769#6908483, @kostajh wrote:

In T276769#6908482, @akosiaris wrote:

In T276769#6893448, @Tgr wrote:

So to be clear there are two distinct issues:

requests time out (fail with 504) after about 15 seconds (the service can be quite slow so requests taking longer than that is not uncommon). On the MediaWiki side this is a cronjob so long requests are fine; it would be nice to relax the timeout.

there is some sort of throttling, which kills most of the requests with an 503. This seems to be a per-minute thing (the 503s stop roughly at the end of every minute, from starting the script) but the trigger does not seem regular: sometimes 3 requests succeed in a row, sometimes 6.

Sorry for letting this fall through the cracks. AFAIK yes there is some rate-limiting in the api-gateway, unsure if you are hitting it or not (not even sure if it is configured for all endpoints). Would you be so kind as to post how to reproduce this? Is it just a for loop of curls?

I wrote in T277297 with some reproduction steps for both the 504 and 503.

Oops, missed that. Thanks! I 'll comment there.

kostajh moved this task from Backlog to Post-release backlog on the Add-Link board.Mar 22 2021, 10:38 AM

This is done, thank you @akosiaris

I added https://grafana.wikimedia.org/d/CI6JRnLMz/linkrecommendation?orgId=1 to https://wikitech.wikimedia.org/wiki/Add_Link#Monitoring

Create Grafana dashboard for link recommendation service and document it on wikitechClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Create Grafana dashboard for link recommendation service and document it on wikitech
Closed, ResolvedPublic
Actions

Related Objects
Search...