504 timeout and 503 errors when accessing linkrecommendation service
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	kostajh
	Mar 12 2021, 2:11 PM

Description

As noted in T276769: Create Grafana dashboard for link recommendation service and document it on wikitech:

In T276769#6893448, @Tgr wrote:

So to be clear there are two distinct issues:

requests time out (fail with 504) after about 15 seconds (the service can be quite slow so requests taking longer than that is not uncommon). On the MediaWiki side this is a cronjob so long requests are fine; it would be nice to relax the timeout.

there is some sort of throttling, which kills most of the requests with an 503. This seems to be a per-minute thing (the 503s stop roughly at the end of every minute, from starting the script) but the trigger does not seem regular: sometimes 3 requests succeed in a row, sometimes 6.

For the 504, this can sometimes be reproduced by issuing a GET to https://api.wikimedia.org/service/linkrecommendation/v0/linkrecommendations/cswiki/Lipsko

The 503 is reproducible on the beta cluster by executing mwscript --wiki=cswiki extensions/GrowthExperiments/maintenance/refreshLinkRecommendations.php

Details

Subject	Repo	Branch	Lines +/-
api-gateway: make discovery service timeouts configurable per service	operations/deployment-charts	master	+5 -2
linkrecommendation: Set timeout to 15s	operations/deployment-charts	master	+2 -1
linkrecommendation: Bump version, adjust timeout, disable cron in staging	operations/deployment-charts	master	+3 -4
linkrecommendation: Add environment variable for gunicorn timeout	operations/deployment-charts	master	+6 -2
Return available recommendations before timeout is reached	research/mwaddlink	main	+42 -15
linkrecommendation: Bump requests memory limit and image version	operations/deployment-charts	master	+2 -2
Move gunicorn configuration to conf file, adjust config	research/mwaddlink	main	+12 -1

Customize query in gerrit

Related Objects

Mentioned In: T280634: linkrecommendation: WORKER TIMEOUT (pid:4958)
T278718: For external traffic release, perform lookup in GrowthExperiments cache before generating recommendations
rRMWAae260b5b9b5b: Return available recommendations before timeout is reached
T278083: Define SLIs/SLOs for link recommendation service
rRMWAcc7b64273972: Move gunicorn configuration to conf file, adjust config
T269581: Add Link engineering: Allow external traffic to linkrecommendation service
T276217: Use Envoy for making GET requests to lang.wikipedia.org/api.php
T277342: Experiment with and/or document ngram parameters
T276769: Create Grafana dashboard for link recommendation service and document it on wikitech
Mentioned Here: rRMWAcc7b64273972: Move gunicorn configuration to conf file, adjust config
rDEPLOYCHARTSe1fab29abbbf: linkrecommendation: Bump requests memory limit and image version
T278083: Define SLIs/SLOs for link recommendation service
rRMWA2d9e6ab760eb: Lower max ngram length and implement in process cache
T276769: Create Grafana dashboard for link recommendation service and document it on wikitech

Event Timeline

kostajh created this task.Mar 12 2021, 2:11 PM

Restricted Application added a project: Growth-Team. · View Herald TranscriptMar 12 2021, 2:11 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

kostajh mentioned this in T276769: Create Grafana dashboard for link recommendation service and document it on wikitech.Mar 12 2021, 2:11 PM

kostajh updated the task description. (Show Details)

Ah, so a simple

for i in 1..10
do
curl -s https://api.wikimedia.org/service/linkrecommendation/v0/linkrecommendations/cswiki/Lipsko
done

should reproduce it, then. I 'll try it out.

In T277297#6908492, @akosiaris wrote:

Ah, so a simple

for i in 1..10
do
curl -s https://api.wikimedia.org/service/linkrecommendation/v0/linkrecommendations/cswiki/Lipsko
done

should reproduce it, then. I 'll try it out.

For the 504s, yes. For the 503s you'll need to do POST with valid data (so the service spends time working on the response), and that's probably most easily done from the beta cluster for now.

kostajh updated the task description. (Show Details)Mar 12 2021, 2:26 PM

kostajh updated the task description. (Show Details)

Ahem..

$ ab -n 100 -c 1 https://api.wikimedia.org/service/linkrecommendation/v0/linkrecommendations/cswiki/Lipsko

Server Software:        envoy
Server Hostname:        api.wikimedia.org
Server Port:            443
SSL/TLS Protocol:       TLSv1.2,ECDHE-ECDSA-AES256-GCM-SHA384,256,256
Server Temp Key:        X25519 253 bits
TLS Server Name:        api.wikimedia.org

Document Path:          /service/linkrecommendation/v0/linkrecommendations/cswiki/Lipsko
Document Length:        57 bytes

Concurrency Level:      1
Time taken for tests:   198.583 seconds
Complete requests:      100
Failed requests:        94
   (Connect: 0, Receive: 0, Length: 94, Exceptions: 0)
Non-2xx responses:      100
Total transferred:      89765 bytes
HTML transferred:       11389 bytes
Requests per second:    0.50 [#/sec] (mean)
Time per request:       1985.827 [ms] (mean)
Time per request:       1985.827 [ms] (mean, across all concurrent requests)
Transfer rate:          0.44 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:      192  196   4.4    195     219
Processing:   236 1790 3483.0    909   15823
Waiting:      236 1789 3483.0    909   15823
Total:        431 1986 3483.0   1105   16022

Kind of the opposite of stellar.

Bypassing the api-gateway (and the services proxy) doesn't fix this in any way.

Logs aren't helpful either, they just say:

[2021-03-12 14:44:02 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:1631)
{"written_at": "2021-03-12T14:44:02.996Z", "written_ts": 1615560242996805000, "msg": "Worker exiting (pid: 1631)", "type": "log", "logger": "gunicorn.error", "thread": "MainThread", "level": "INFO", "module": "glogging", "line_no": 273, "correlation_id": "6379c452-8341-11eb-8663-5e584defd1ae"}
[2021-03-12 14:44:03 +0000] [1726] [INFO] Booting worker with pid: 1726

for this request.

In T277297#6908552, @akosiaris wrote:

Bypassing the api-gateway (and the services proxy) doesn't fix this in any way.

Logs aren't helpful either, they just say:

[2021-03-12 14:44:02 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:1631)
{"written_at": "2021-03-12T14:44:02.996Z", "written_ts": 1615560242996805000, "msg": "Worker exiting (pid: 1631)", "type": "log", "logger": "gunicorn.error", "thread": "MainThread", "level": "INFO", "module": "glogging", "line_no": 273, "correlation_id": "6379c452-8341-11eb-8663-5e584defd1ae"}
[2021-03-12 14:44:03 +0000] [1726] [INFO] Booting worker with pid: 1726

for this request.

I think that might be a different issue than the 504 we see when routed via the gateway.

The service can timeout when processing "larger" articles like Lipsko on cswiki. Using ?max_recommendations=1&threshold=0.1 should work without a timeout when bypassing the gateway. The solution for that particular problem is to set a longer worker timeout for gunicorn, but again I think this is separate from what we are seeing in the 504s when going via api-gateway.

kostajh mentioned this in T277342: Experiment with and/or document ngram parameters.Mar 12 2021, 9:03 PM

kostajh mentioned this in T276217: Use Envoy for making GET requests to lang.wikipedia.org/api.php.Mar 15 2021, 2:38 PM

@akosiaris maybe the keepalive value should be set to something higher than 2 seconds?

As an aside, we also have some options for configuring Worker processes, would you mind having a look and letting us know your thoughts on what we should set for --workers and --threads?

kostajh edited projects, added Growth-Team (Sprint 0 (Growth Team)); removed Growth-Team.Mar 15 2021, 2:44 PM

kostajh mentioned this in T269581: Add Link engineering: Allow external traffic to linkrecommendation service.Mar 15 2021, 2:49 PM

In T277297#6908496, @kostajh wrote:

For the 504s, yes. For the 503s you'll need to do POST with valid data (so the service spends time working on the response), and that's probably most easily done from the beta cluster for now.

Also those requests use OAuth authorization for the API gateway, maybe that's causing the problem, or failing to prevent it. The header we are using in beta is
{P14859}
(obtained by creating a client with no special permissions on the API portal)

Actually the 503 is reproducible with GET as well:

time curl -s https://api.wikimedia.org/service/linkrecommendation/v0/linkrecommendations/cswiki/Lipsko
{"httpReason":"upstream request timeout","httpCode":504}
curl -s   0.02s user 0.01s system 0% cpu 15.392 total
time curl -s https://api.wikimedia.org/service/linkrecommendation/v0/linkrecommendations/cswiki/Lipsko
{"httpCode":503,"httpReason":"upstream connect error or disconnect/reset before headers. reset reason: local reset"}
curl -s   0.02s user 0.01s system 2% cpu 1.140 total

The 503 occurs if the GET is executed shortly after the 504.

The 504s seem to consistently occur just after 15 seconds have elapsed. @hnowlan is there possibly some config in api-gateway that terminates requests which take longer than 15 seconds to process?

In T277297#6908552, @akosiaris wrote:

Bypassing the api-gateway (and the services proxy) doesn't fix this in any way.

Logs aren't helpful either, they just say:

[2021-03-12 14:44:02 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:1631)
{"written_at": "2021-03-12T14:44:02.996Z", "written_ts": 1615560242996805000, "msg": "Worker exiting (pid: 1631)", "type": "log", "logger": "gunicorn.error", "thread": "MainThread", "level": "INFO", "module": "glogging", "line_no": 273, "correlation_id": "6379c452-8341-11eb-8663-5e584defd1ae"}
[2021-03-12 14:44:03 +0000] [1726] [INFO] Booting worker with pid: 1726

for this request.

Maybe that is https://docs.gunicorn.org/en/stable/faq.html#why-are-workers-silently-killed :

This particular failure case is usually due to a SIGKILL being received, as it’s not possible to catch this signal silence is usually a common side effect! A common cause of SIGKILL is when OOM killer terminates a process due to low memory condition.

This is increasingly common in container deployments where memory limits are enforced by cgroups, you’ll usually see evidence of this from dmesg:

dmesg | grep gunicorn
Memory cgroup out of memory: Kill process 24534 (gunicorn) score 1506 or sacrifice child
Killed process 24534 (gunicorn) total-vm:1016648kB, anon-rss:550160kB, file-rss:25824kB, shmem-rss:0kB

In these instances adjusting the memory limit is usually your best bet, it’s also possible to configure OOM not to send SIGKILL by default.

Change 673004 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[research/mwaddlink@main] Move gunicorn configuration to conf file, adjust config

https://gerrit.wikimedia.org/r/673004

gerritbot added a project: Patch-For-Review.Mar 17 2021, 12:00 PM

Change 673006 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[operations/deployment-charts@master] linkrecommendation: Bump memory limit

https://gerrit.wikimedia.org/r/673006

kostajh moved this task from Incoming to Code Review on the Growth-Team (Sprint 0 (Growth Team)) board.Mar 17 2021, 12:28 PM

kostajh claimed this task.Mar 17 2021, 12:30 PM

In T277297#6908667, @kostajh wrote:
In T277297#6908552, @akosiaris wrote:
Bypassing the api-gateway (and the services proxy) doesn't fix this in any way.

Logs aren't helpful either, they just say:
[2021-03-12 14:44:02 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:1631)
{"written_at": "2021-03-12T14:44:02.996Z", "written_ts": 1615560242996805000, "msg": "Worker exiting (pid: 1631)", "type": "log", "logger": "gunicorn.error", "thread": "MainThread", "level": "INFO", "module": "glogging", "line_no": 273, "correlation_id": "6379c452-8341-11eb-8663-5e584defd1ae"}
[2021-03-12 14:44:03 +0000] [1726] [INFO] Booting worker with pid: 1726
for this request.
I think that might be a different issue than the 504 we see when routed via the gateway.

This related. See the following examples:

akosiaris@deploy1002:~$ time curl -v 'http://10.64.65.145:8000/v0/linkrecommendations/cswiki/Lipsko'
* Expire in 0 ms for 6 (transfer 0x55b9ed340f90)
*   Trying 10.64.65.145...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x55b9ed340f90)
* Connected to 10.64.65.145 (10.64.65.145) port 8000 (#0)
> GET /v0/linkrecommendations/cswiki/Lipsko HTTP/1.1
> Host: 10.64.65.145:8000
> User-Agent: curl/7.64.0
> Accept: */*
> 
* Empty reply from server
* Connection #0 to host 10.64.65.145 left intact
curl: (52) Empty reply from server

real	1m0.128s
user	0m0.014s
sys	0m0.013s

So, no HTTP response whatsoever (that's actually a bug, a response should be returned, you might want to look into graceful_timeout and react to receiving SIGTERM). However, go through the sidecar envoy services proxy (not the api-gateway) and you get a proper HTTP 503 code for the same request. That's normal, envoy knows that it needs to issue back a proper HTTP status code and it does (503 is service unavailable, it's the standard error code for these cases)

akosiaris@deploy1002:~$ time curl -kv 'https://10.64.65.145:4006/v0/linkrecommendations/cswiki/Lipsko'
* Expire in 0 ms for 6 (transfer 0x55f7818e7f90)
*   Trying 10.64.65.145...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x55f7818e7f90)
* Connected to 10.64.65.145 (10.64.65.145) port 4006 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: none
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server did not agree to a protocol
* Server certificate:
*  subject: CN=linkrecommendation.discovery.wmnet
*  start date: Dec  3 15:25:16 2020 GMT
*  expire date: Dec  3 15:25:16 2025 GMT
*  issuer: CN=Puppet CA: palladium.eqiad.wmnet
*  SSL certificate verify ok.
> GET /v0/linkrecommendations/cswiki/Lipsko HTTP/1.1
> Host: 10.64.65.145:4006
> User-Agent: curl/7.64.0
> Accept: */*
> 
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
< HTTP/1.1 503 Service Unavailable
< content-length: 95
< content-type: text/plain
< date: Thu, 18 Mar 2021 13:49:20 GMT
< server: external-tls
< 
* Connection #0 to host 10.64.65.145 left intact
upstream connect error or disconnect/reset before headers. reset reason: connection termination
real	0m30.350s
user	0m0.018s
sys	0m0.011s

Moving to the api-gateway

akosiaris@deploy1002:~$ time curl -v -H "Host: api.wikimedia.org" https://api-gateway.discovery.wmnet:8087/service/linkrecommendation/v0/linkrecommendations/cswiki/Lipsko
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x5579e3ad4f90)
* Connected to api-gateway.discovery.wmnet (10.2.2.55) port 8087 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: none
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server did not agree to a protocol
* Server certificate:
*  subject: CN=api-gateway.discovery.wmnet
*  start date: Aug 10 12:19:53 2020 GMT
*  expire date: Aug 10 12:19:53 2025 GMT
*  subjectAltName: host "api-gateway.discovery.wmnet" matched cert's "api-gateway.discovery.wmnet"
*  issuer: CN=Puppet CA: palladium.eqiad.wmnet
*  SSL certificate verify ok.
> GET /service/linkrecommendation/v0/linkrecommendations/cswiki/Lipsko HTTP/1.1
> Host: api.wikimedia.org
> User-Agent: curl/7.64.0
> Accept: */*
> 
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
< HTTP/1.1 504 Gateway Timeout
< content-length: 57
< content-type: application/json
< date: Thu, 18 Mar 2021 14:07:59 GMT
< server: envoy
< 
{"httpCode":504,"httpReason":"upstream request timeout"}
* Connection #0 to host api-gateway.discovery.wmnet left intact

real	0m15.034s
user	0m0.018s
sys	0m0.010s

Note the 15 seconds timeout of the api-gateway vs the 30 seconds timeout of the sidecar proxy.

So essentially, the request takes a pretty long time to be responded to.

The service can timeout when processing "larger" articles like Lipsko on cswiki. Using ?max_recommendations=1&threshold=0.1 should work without a timeout when bypassing the gateway.

It works generally in all 3 cases. See

akosiaris@deploy1002:~$ time curl -H "Host: api.wikimedia.org" 'https://api-gateway.discovery.wmnet:8087/service/linkrecommendation/v0/linkrecommendations/cswiki/Lipsko?max_recommendations=1&threshold=0.1'
{"links":[{"context_after":", které js","context_before":" v ","link_index":0,"link_target":"Povrchový důl","link_text":"povrchových dolech","match_index":0,"score":0.2428925484418869,"wikitext_offset":2445}],"links_count":1,"page_title":"Lipsko","pageid":23151,"revid":19602785}

real	0m2.200s
user	0m0.024s
sys	0m0.005s

akosiaris@deploy1002:~$ time curl 'http://10.64.65.145:8000/v0/linkrecommendations/cswiki/Lipsko?max_recommendations=1&threshold=0.1'
{"links":[{"context_after":", které js","context_before":" v ","link_index":0,"link_target":"Povrchový důl","link_text":"povrchových dolech","match_index":0,"score":0.2428925484418869,"wikitext_offset":2445}],"links_count":1,"page_title":"Lipsko","pageid":23151,"revid":19602785}

real	0m2.039s
user	0m0.012s
sys	0m0.011s

akosiaris@deploy1002:~$ time curl -k 'https://10.64.65.145:4006/v0/linkrecommendations/cswiki/Lipsko?max_recommendations=1&threshold=0.1'
{"links":[{"context_after":", které js","context_before":" v ","link_index":0,"link_target":"Povrchový důl","link_text":"povrchových dolech","match_index":0,"score":0.2428925484418869,"wikitext_offset":2445}],"links_count":1,"page_title":"Lipsko","pageid":23151,"revid":19602785}

real	0m1.873s
user	0m0.015s
sys	0m0.010s

All 3 response pretty fast and ok.

The solution for that particular problem is to set a longer worker timeout for gunicorn

Is it? Do you really want GET requests for the general public that consume connections for >15s or even >30s to complete? I 'd say not.

It's fine for asynchronous internal requests that are e.g. part of a job queuing system which is why I +1ed the change.

In T277297#6913899, @kostajh wrote:

@akosiaris maybe the keepalive value should be set to something higher than 2 seconds?

Per the docs, sync workers (which is the default and what is now used), is ignored. It would not make a diference.

As an aside, we also have some options for configuring Worker processes, would you mind having a look and letting us know your thoughts on what we should set for --workers and --threads?

As general rules, we should stick with the sync worker class for now as from what I understand the code is rather CPU based, so event driven patterns would not benefit it particulalry. Your patch already has a sane value for workers (5), threads (set to 1) is fine for now unless we spot during a stress test a low CPU usage.

ab -n 100 -c 2 'https://api.wikimedia.org/service/linkrecommendation/v0/linkrecommendations/cswiki/Lipsko?max_recommendations=1&threshold=0.1'
This is ApacheBench, Version 2.3 <$Revision: 1843412 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking api.wikimedia.org (be patient).....done


Server Software:        envoy
Server Hostname:        api.wikimedia.org
Server Port:            443
SSL/TLS Protocol:       TLSv1.2,ECDHE-ECDSA-AES256-GCM-SHA384,256,256
Server Temp Key:        X25519 253 bits
TLS Server Name:        api.wikimedia.org

Document Path:          /service/linkrecommendation/v0/linkrecommendations/cswiki/Lipsko?max_recommendations=1&threshold=0.1
Document Length:        284 bytes

Concurrency Level:      2
Time taken for tests:   159.670 seconds
Complete requests:      100
Failed requests:        33
   (Connect: 0, Receive: 0, Length: 33, Exceptions: 0)
Non-2xx responses:      33
Total transferred:      99131 bytes
HTML transferred:       22896 bytes
Requests per second:    0.63 [#/sec] (mean)
Time per request:       3193.393 [ms] (mean)
Time per request:       1596.696 [ms] (mean, across all concurrent requests)
Transfer rate:          0.61 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:      216  219   2.6    219     232
Processing:   541 2943 1648.2   3733    9061
Waiting:      540 2943 1648.2   3733    9061
Total:        759 3163 1648.1   3953    9280

Percentage of the requests served within a certain time (ms)
  50%   3953
  66%   4129
  75%   4190
  80%   4230
  90%   4392
  95%   4709
  98%   9167
  99%   9280
 100%   9280 (longest request)

This looks definitely better with the adding of '?max_recommendations=1&threshold=0.1'

In T277297#6924984, @akosiaris wrote:

Do you really want GET requests for the general public that consume connections for >15s or even >30s to complete?

Not the general public necessarily, although it would be useful for QA and product people validating the output of the algorithm; the main use case we have for it though is MediaWiki in the beta cluster. Is there a way to differentiate between that and the general public (by IP, maybe, or OAuth) and use more permissive settings?

Also I still not see where the behavior of a few slow requests then very fast 503s until the end of the minute comes from. Is it that the service OOMs or otherwise stops, gets restarted every minute by systemd, and in the meantime envoy fails to connect and replies with 503s?

There are currently icinga alerts flapping I'm guessing because of this:

11:07:51 <+icinga-wm> PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1008.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1015.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
11:09:07 <+icinga-wm> PROBLEM - LVS linkrecommendation-external eqiad port 4006/tcp - Link Recommendation- public release- linkrecommendation.svc.eqiad.wmnet IPv4 on linkrecommendation.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.23 and port 4006: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems

Change 673004 merged by jenkins-bot:
[research/mwaddlink@main] Move gunicorn configuration to conf file, adjust config

https://gerrit.wikimedia.org/r/673004

kostajh mentioned this in rRMWAcc7b64273972: Move gunicorn configuration to conf file, adjust config.Mar 18 2021, 6:27 PM

In T277297#6925614, @Tgr wrote:

In T277297#6924984, @akosiaris wrote:

Do you really want GET requests for the general public that consume connections for >15s or even >30s to complete?

Not the general public necessarily, although it would be useful for QA and product people validating the output of the algorithm; the main use case we have for it though is MediaWiki in the beta cluster. Is there a way to differentiate between that and the general public (by IP, maybe, or OAuth) and use more permissive settings?

What do you mean by more permissive? Increasing the upstream proxy timeout based on something on the request, like a header of the source IP? I don't think that's possible.

I am a bit unclear on the mediawiki in the beta cluster thing though. I would expect that to use something similar as a setup to what is in production. Although I would totally understand it if it's not possible to have a similar setup in beta and instead rely on this, beta would end up exercising a different path that production, wouldn't it?

Also I still not see where the behavior of a few slow requests then very fast 503s until the end of the minute comes from. Is it that the service OOMs or otherwise stops, gets restarted every minute by systemd, and in the meantime envoy fails to connect and replies with 503s?

Not systemd, but gunicorn itself as evidenced by the logs and not "every minute" but when the worker process is killed by gunicorn. But otherwise your thinking is correct.

https://gerrit.wikimedia.org/r/673004 should alleviate that by allowing more concurrent requests. The other thing that would alleviate it is if we could do some capacity planning for this, as in what kind of request rate do you want this to be able to serve?

All of this is also related to the internal service somewhat. It's not behind the api-gateway so the 15s timeout doesn't apply, but the 30s timeout of gunicorn does apply.

In T277297#6926226, @Legoktm wrote:

There are currently icinga alerts flapping I'm guessing because of this:

11:07:51 <+icinga-wm> PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1008.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1015.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
11:09:07 <+icinga-wm> PROBLEM - LVS linkrecommendation-external eqiad port 4006/tcp - Link Recommendation- public release- linkrecommendation.svc.eqiad.wmnet IPv4 on linkrecommendation.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.23 and port 4006: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems

Yes you are right, I think we can schedule some extended downtime for those while figuring this out. I 've set a downtime of 2 weeks for this.

Change 673006 merged by jenkins-bot:
[operations/deployment-charts@master] linkrecommendation: Bump requests memory limit and image version

https://gerrit.wikimedia.org/r/673006

In T277297#6927762, @akosiaris wrote:

What do you mean by more permissive? Increasing the upstream proxy timeout based on something on the request, like a header of the source IP? I don't think that's possible.

Something like that, yes. Wasn't the idea behind the API portal that you can use it with low quotas anonymously, or you register a client (possibly pay for it in other scenarios), authenticate with an OAuth bearer token, and get higher quotas?

I am a bit unclear on the mediawiki in the beta cluster thing though. I would expect that to use something similar as a setup to what is in production. Although I would totally understand it if it's not possible to have a similar setup in beta and instead rely on this, beta would end up exercising a different path that production, wouldn't it?

I think we mainly care about the MediaWiki part being similar, as we don't intend to do lots of changes to the service, so having a beta instance of the service is not worth it. And beta cluster appservers can only connect to the external service, so making it possible to use similar quotas on the external service instance as production has on the internal ones would make it more similar.
(But AIUI apart from the 15 sec vs. 30 sec difference the permissions are the same, and the difference we are seeing is from production, where we are only running on testwiki using the simplewiki language model, having simpler language models than most of the beta wikis. So this point might moot.)

Also in hindsight assuming that we won't need a beta instance of the service might have been somewhat naive.

https://gerrit.wikimedia.org/r/673004 should alleviate that by allowing more concurrent requests. The other thing that would alleviate it is if we could do some capacity planning for this, as in what kind of request rate do you want this to be able to serve?

The current approach (which might or might not prove viable) is to send a single sequence of requests where the next one starts as soon as the previous one finishes. (Apart from people occasionally using the external service for manually checking out the recommendations for some articles, we'd only use the service from a maintenance script, not from the MediaWiki app or jobs.) So the rate depends on how long the requests take - that seems to be 15-20 sec now so something like 3-4 per minute, but of course that would change if we manage to speed up the service or if the wiki is smaller.

Maintenance_bot removed a project: Patch-For-Review.Mar 19 2021, 10:10 AM

In T277297#6928013, @Tgr wrote:

In T277297#6927762, @akosiaris wrote:

What do you mean by more permissive? Increasing the upstream proxy timeout based on something on the request, like a header of the source IP? I don't think that's possible.

Something like that, yes. Wasn't the idea behind the API portal that you can use it with low quotas anonymously, or you register a client (possibly pay for it in other scenarios), authenticate with an OAuth bearer token, and get higher quotas?

Yes, I think that was the idea, but we aren't talking about quotas here but rather the upstream server (the linkrecommendation endpoint) timeout, which can't be set per user/request.

I am a bit unclear on the mediawiki in the beta cluster thing though. I would expect that to use something similar as a setup to what is in production. Although I would totally understand it if it's not possible to have a similar setup in beta and instead rely on this, beta would end up exercising a different path that production, wouldn't it?

I think we mainly care about the MediaWiki part being similar, as we don't intend to do lots of changes to the service, so having a beta instance of the service is not worth it. And beta cluster appservers can only connect to the external service, so making it possible to use similar quotas on the external service instance as production has on the internal ones would make it more similar.
(But AIUI apart from the 15 sec vs. 30 sec difference the permissions are the same, and the difference we are seeing is from production, where we are only running on testwiki using the simplewiki language model, having simpler language models than most of the beta wikis. So this point might moot.)

Well, it will be 15 vs 60s IIRC after @kostajh 's change is merged, but everything else stands exactly as you said.

Also in hindsight assuming that we won't need a beta instance of the service might have been somewhat naive.

Or not. beta isn't known to be a low maintenance job. I guess time will tell.

https://gerrit.wikimedia.org/r/673004 should alleviate that by allowing more concurrent requests. The other thing that would alleviate it is if we could do some capacity planning for this, as in what kind of request rate do you want this to be able to serve?

The current approach (which might or might not prove viable) is to send a single sequence of requests where the next one starts as soon as the previous one finishes. (Apart from people occasionally using the external service for manually checking out the recommendations for some articles, we'd only use the service from a maintenance script, not from the MediaWiki app or jobs.) So the rate depends on how long the requests take - that seems to be 15-20 sec now so something like 3-4 per minute, but of course that would change if we manage to speed up the service or if the wiki is smaller.

So essentially some batches of requests based on cron-like pattern, right? That's a very spiky pattern (3-4 rpm then 0 for some time), not easy to do capacity planning for. However, it also means that flattening the curve (pun intended) would help a lot with being able to serve that. So, a fast and easy way out is to just back off a bit when receiving the first error and wait a number of seconds before sending the next one.

Now, as to what probably happens is that one of those requests takes way too long time wise, gunicorn kills the worker servicing it and returns back an empty (and thus malformed response). The proxies along the chain can't do anything more than propagate this error upstream. More capacity (e.g. the patch bumping the workers to 5 as well as increasing the number of pods service the external release) will alleviate this although it does remain problematic as a pattern.

It would be nicer and definitely more scalable, if the worker was already somehow capping the max execution time responding back with partial results (i.e. have max_recommendation and threshold parameters be in an allowed range with a sane default) or at the very least issue a 503 on its own without being killed.

Note that most of this doesn't necessarily apply to the internal service. It would definitely benefit from not having gunicorn kill the workers, but there we control the clients, their rate of requests and needs and we don't rely on a public facing component (i.e. the api-gateway) that needs to enforce it's own rules regarding timeouts. The internal component is way easier to reason about.

In T277297#6928483, @akosiaris wrote:

So essentially some batches of requests based on cron-like pattern, right? That's a very spiky pattern (3-4 rpm then 0 for some time), not easy to do capacity planning for.

That's right, but it also means there will always only be a single request at a time (on the internal instance; on the external one we can't of course guarantee that). Isn't that basically the lowest possible capacity? Assuming there isn't any memory leak, capacity planning couldn't really benefit from having more pause between the requests, could it?

So, a fast and easy way out is to just back off a bit when receiving the first error and wait a number of seconds before sending the next one.

That makes sense, will add that.

It would be nicer and definitely more scalable, if the worker was already somehow capping the max execution time responding back with partial results (i.e. have max_recommendation and threshold parameters be in an allowed range with a sane default) or at the very least issue a 503 on its own without being killed.

That also makes sense, although then we'd need a min_recommendation paramater as well since we can't use recommendations with less than a certain number of links.
(Or some form of continuation, but that might be quite wasteful.)

Note that most of this doesn't necessarily apply to the internal service. It would definitely benefit from not having gunicorn kill the workers, but there we control the clients, their rate of requests and needs and we don't rely on a public facing component (i.e. the api-gateway) that needs to enforce it's own rules regarding timeouts. The internal component is way easier to reason about.

In practice I think they will be much the same: the external service will receive requests from the beta cluster cronjob, the internal service from the production cluster cronjob. The latter will need a lot more requests, but since they are sent in a single line, that doesn't make much difference. The external service will of course have any human users on top of that, but I don't expect that to be substantial.

It might turn out that this is too slow and we need to parallelize requests somehow (as it is now, we won't take much advantage of multiple workers since there is just one cron job running at a time, and it only issues one request at a time), but that's a separate discussion. The immediate issue is just making sure most requests don't fail.

Legoktm unsubscribed.Mar 19 2021, 5:37 PM

In T277297#6928830, @Tgr wrote:

In T277297#6928483, @akosiaris wrote:

So essentially some batches of requests based on cron-like pattern, right? That's a very spiky pattern (3-4 rpm then 0 for some time), not easy to do capacity planning for.

That's right, but it also means there will always only be a single request at a time (on the internal instance; on the external one we can't of course guarantee that). Isn't that basically the lowest possible capacity? Assuming there isn't any memory leak, capacity planning couldn't really benefit from having more pause between the requests, could it?

In theory? yes. In practice, the assumption of not having memory leaks, cpu stalls, deadlocks, timeouts etc doesn't hold as you witnessed. In theory, that can be modeled during capacity planning.

More to the point, in the external case, the service is pretty underpowered as I had no idea how much capacity to allocate, so I allocated the bare mininum. The internal service has more capacity but if you are having trouble with that as well, either the capacity allocate isn't enough, or the deeper problem of gunicorn killing timeouts is having a very adverse effect. I am leaning towards the latter right now.

So, a fast and easy way out is to just back off a bit when receiving the first error and wait a number of seconds before sending the next one.

That makes sense, will add that.

Thanks!

It would be nicer and definitely more scalable, if the worker was already somehow capping the max execution time responding back with partial results (i.e. have max_recommendation and threshold parameters be in an allowed range with a sane default) or at the very least issue a 503 on its own without being killed.

That also makes sense, although then we'd need a min_recommendation paramater as well since we can't use recommendations with less than a certain number of links.

Sure, +1.

(Or some form of continuation, but that might be quite wasteful.)

Note that most of this doesn't necessarily apply to the internal service. It would definitely benefit from not having gunicorn kill the workers, but there we control the clients, their rate of requests and needs and we don't rely on a public facing component (i.e. the api-gateway) that needs to enforce it's own rules regarding timeouts. The internal component is way easier to reason about.

In practice I think they will be much the same: the external service will receive requests from the beta cluster cronjob, the internal service from the production cluster cronjob. The latter will need a lot more requests, but since they are sent in a single line, that doesn't make much difference. The external service will of course have any human users on top of that, but I don't expect that to be substantial.

It might turn out that this is too slow and we need to parallelize requests somehow (as it is now, we won't take much advantage of multiple workers since there is just one cron job running at a time, and it only issues one request at a time), but that's a separate discussion. The immediate issue is just making sure most requests don't fail.

OK. Let me know how things fare now that the worker # bump has been deployed.

In T277297#6929802, @akosiaris wrote:

OK. Let me know how things fare now that the worker # bump has been deployed.

Seems essentially fixed, even without the backoff:

Refreshing link recommendations...
processing topic biography...
1 new tasks needed
fetching 500 tasks...
checking candidate Trojka_(televizní_kanál)... number of good links too small (3)
checking candidate Věra_Nerušilová... link recommendation already stored
checking candidate Josef_Blahož... link recommendation already stored
checking candidate Martin_Stránský... success, updating index
task pool filled
processing topic women...
20 new tasks needed
fetching 500 tasks...
checking candidate Taylor_Swift... fetching recommendation failed
There was a problem during the HTTP request: 504 Gateway Timeout
checking candidate Maryam_d'Abo... number of good links too small (1)
checking candidate Olga_Lounová... link recommendation already stored
checking candidate Tereza_Tobiášová... number of good links too small (1)
checking candidate Avril_Lavigne... fetching recommendation failed
There was a problem during the HTTP request: 504 Gateway Timeout
checking candidate Marie_Meklenbursko-Zvěřínská... fetching recommendation failed
There was a problem during the HTTP request: 504 Gateway Timeout
checking candidate Martina_Hillová... fetching recommendation failed
There was a problem during the HTTP request: 504 Gateway Timeout
checking candidate Barbora_Mudrová... number of good links too small (0)
checking candidate Barbara_Maria_Willi... link recommendation already stored
checking candidate Kelly_Clarkson... number of good links too small (0)
checking candidate Gabriela_Gunčíková... success, updating index
checking candidate Stanislava_Jachnická... number of good links too small (3)
checking candidate Eleanor_Daleyová... link recommendation already stored
checking candidate The_Bella_Twins... number of good links too small (1)
checking candidate Sandra_Pogodová... number of good links too small (0)
checking candidate Sóley... link recommendation already stored
checking candidate Driulis_Gonzálezová... number of good links too small (1)
checking candidate Katja_Woywoodová... link recommendation already stored
checking candidate Květoslava_Vonešová... number of good links too small (3)
checking candidate Ilona_Svobodová... link recommendation already stored
checking candidate Alžběta_II.... fetching recommendation failed
There was a problem during the HTTP request: 504 Gateway Timeout
checking candidate Anna_Gavendová... number of good links too small (1)
checking candidate Barbara_Sobotta... number of good links too small (1)
checking candidate Noriko_Annová... number of good links too small (2)
checking candidate Katy_Perry... fetching recommendation failed
There was a problem during the HTTP request: 504 Gateway Timeout
checking candidate Uršula_Kluková... link recommendation already stored
checking candidate Mien_Ruys... fetching recommendation failed
There was a problem during the HTTP request: 504 Gateway Timeout
checking candidate Carmel_Buckingham... number of good links too small (0)
checking candidate Maja_Velšicová... number of good links too small (1)
checking candidate Veronica_Campbellová-Brownová... number of good links too small (0)
checking candidate Anne-Sophie_Mutter... number of good links too small (3)
checking candidate Lucie_Leišová... success, updating index

The timeouts are still an issue, but that's beta-specific, and infrequent enough that we can live with it.

In T277297#6931958, @Tgr wrote:

In T277297#6929802, @akosiaris wrote:

OK. Let me know how things fare now that the worker # bump has been deployed.

Seems essentially fixed, even without the backoff

I'm still confused whether this change was actually deployed. In looking at this logstash item I'd expect to see data about in_process_cache_access_count when handling a request per rRMWA2d9e6ab760eb: Lower max ngram length and implement in process cache, but I don't see that in the logs. Looking at the docker image that's deployed, I do see that application code exists and running it locally results in in_process_cache_access_count appearing in the logs.

In T277297#6931958, @Tgr wrote:

The timeouts are still an issue, but that's beta-specific, and infrequent enough that we can live with it.

If that's fine by you, fine by me. The service level is up to the service owner anyway.

And now that I mention that, you should come up with SLIs/SLOs[1] for this service to communicate to the rest of the movement the expected level of service as well as set expectations as to what constitutes an outage or not for this service so that SREs know when and how to react. SRE Service Ops will provide information and a walk through for this.

[1] https://sre.google/sre-book/service-level-objectives/

In T277297#6933465, @akosiaris wrote:

In T277297#6931958, @Tgr wrote:

The timeouts are still an issue, but that's beta-specific, and infrequent enough that we can live with it.

If that's fine by you, fine by me. The service level is up to the service owner anyway.

And now that I mention that, you should come up with SLIs/SLOs[1] for this service to communicate to the rest of the movement the expected level of service as well as set expectations as to what constitutes an outage or not for this service so that SREs know when and how to react. SRE Service Ops will provide information and a walk through for this.

[1] https://sre.google/sre-book/service-level-objectives/

Thanks, filed as T278083

Change 673978 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[research/mwaddlink@main] Stop processing page before timeout is reached

https://gerrit.wikimedia.org/r/673978

gerritbot added a project: Patch-For-Review.Mar 22 2021, 10:28 AM

kostajh moved this task from Backlog to In progress on the Add-Link board.Mar 22 2021, 10:38 AM

In T277297#6932358, @kostajh wrote:

In T277297#6931958, @Tgr wrote:

In T277297#6929802, @akosiaris wrote:

OK. Let me know how things fare now that the worker # bump has been deployed.

Seems essentially fixed, even without the backoff

I'm still confused whether this change was actually deployed. In looking at this logstash item I'd expect to see data about in_process_cache_access_count when handling a request per rRMWA2d9e6ab760eb: Lower max ngram length and implement in process cache, but I don't see that in the logs. Looking at the docker image that's deployed, I do see that application code exists and running it locally results in in_process_cache_access_count appearing in the logs.

Depends on which change you are referring to ;)

The gunicron.conf.py change, also known as cc7b642, aka https://gerrit.wikimedia.org/r/c/research/mwaddlink/+/673004/ created an image with tags 2021-03-18-182730-production, stable.
The ngram change, aka 2d9e6ab, aka https://gerrit.wikimedia.org/r/c/research/mwaddlink/+/672681 that adds in_process_cache_access_count was merged after the above one and produced an image with tags 2021-03-18-200235-production, stable.

The deployed change per e1fab29abbbf1 is the first one, not the second one.

In T277297#6934504, @akosiaris wrote:

In T277297#6932358, @kostajh wrote:

In T277297#6931958, @Tgr wrote:

In T277297#6929802, @akosiaris wrote:

OK. Let me know how things fare now that the worker # bump has been deployed.

Seems essentially fixed, even without the backoff

I'm still confused whether this change was actually deployed. In looking at this logstash item I'd expect to see data about in_process_cache_access_count when handling a request per rRMWA2d9e6ab760eb: Lower max ngram length and implement in process cache, but I don't see that in the logs. Looking at the docker image that's deployed, I do see that application code exists and running it locally results in in_process_cache_access_count appearing in the logs.

Depends on which change you are referring to ;)

The gunicron.conf.py change, also known as cc7b642, aka https://gerrit.wikimedia.org/r/c/research/mwaddlink/+/673004/ created an image with tags 2021-03-18-182730-production, stable.

The ngram change, aka 2d9e6ab, aka https://gerrit.wikimedia.org/r/c/research/mwaddlink/+/672681 that adds in_process_cache_access_count was merged after the above one and produced an image with tags 2021-03-18-200235-production, stable.

The deployed change per e1fab29abbbf1 is the first one, not the second one.

Oh right. That all makes sense now. When I was picking the tag to deploy in operations/deployment-charts, I guess I was looking at a cached page because there was only a single 3-18 tag at the time and I (wrongly) assumed it had the changes I wanted.

In T277297#6928483, @akosiaris wrote:

In T277297#6928013, @Tgr wrote:

In T277297#6927762, @akosiaris wrote:

What do you mean by more permissive? Increasing the upstream proxy timeout based on something on the request, like a header of the source IP? I don't think that's possible.

Something like that, yes. Wasn't the idea behind the API portal that you can use it with low quotas anonymously, or you register a client (possibly pay for it in other scenarios), authenticate with an OAuth bearer token, and get higher quotas?

Yes, I think that was the idea, but we aren't talking about quotas here but rather the upstream server (the linkrecommendation endpoint) timeout, which can't be set per user/request.

I am a bit unclear on the mediawiki in the beta cluster thing though. I would expect that to use something similar as a setup to what is in production. Although I would totally understand it if it's not possible to have a similar setup in beta and instead rely on this, beta would end up exercising a different path that production, wouldn't it?

I think we mainly care about the MediaWiki part being similar, as we don't intend to do lots of changes to the service, so having a beta instance of the service is not worth it. And beta cluster appservers can only connect to the external service, so making it possible to use similar quotas on the external service instance as production has on the internal ones would make it more similar.
(But AIUI apart from the 15 sec vs. 30 sec difference the permissions are the same, and the difference we are seeing is from production, where we are only running on testwiki using the simplewiki language model, having simpler language models than most of the beta wikis. So this point might moot.)

Well, it will be 15 vs 60s IIRC after @kostajh 's change is merged, but everything else stands exactly as you said.

Back to the 15 second timeout. I think we do want a higher value for the API gateway / external traffic release of the service. I have a patch in progress that will return whatever link recommendations are possible within a defined period of time, so while that should allow us to avoid hitting the timeout errors, it means requesting link recommendations for an article might yield only one or two results. That's might not be especially helpful for community members and QA testers of the service. So I'd prefer to have a higher value like 60 seconds, if that is OK from the service infrastructure side of things.

In T277297#6934621, @kostajh wrote:

In T277297#6928483, @akosiaris wrote:

In T277297#6928013, @Tgr wrote:

In T277297#6927762, @akosiaris wrote:

What do you mean by more permissive? Increasing the upstream proxy timeout based on something on the request, like a header of the source IP? I don't think that's possible.

Something like that, yes. Wasn't the idea behind the API portal that you can use it with low quotas anonymously, or you register a client (possibly pay for it in other scenarios), authenticate with an OAuth bearer token, and get higher quotas?

Yes, I think that was the idea, but we aren't talking about quotas here but rather the upstream server (the linkrecommendation endpoint) timeout, which can't be set per user/request.

I am a bit unclear on the mediawiki in the beta cluster thing though. I would expect that to use something similar as a setup to what is in production. Although I would totally understand it if it's not possible to have a similar setup in beta and instead rely on this, beta would end up exercising a different path that production, wouldn't it?

I think we mainly care about the MediaWiki part being similar, as we don't intend to do lots of changes to the service, so having a beta instance of the service is not worth it. And beta cluster appservers can only connect to the external service, so making it possible to use similar quotas on the external service instance as production has on the internal ones would make it more similar.
(But AIUI apart from the 15 sec vs. 30 sec difference the permissions are the same, and the difference we are seeing is from production, where we are only running on testwiki using the simplewiki language model, having simpler language models than most of the beta wikis. So this point might moot.)

Well, it will be 15 vs 60s IIRC after @kostajh 's change is merged, but everything else stands exactly as you said.

Back to the 15 second timeout. I think we do want a higher value for the API gateway / external traffic release of the service. I have a patch in progress that will return whatever link recommendations are possible within a defined period of time, so while that should allow us to avoid hitting the timeout errors, it means requesting link recommendations for an article might yield only one or two results. That's might not be especially helpful for community members and QA testers of the service. So I'd prefer to have a higher value like 60 seconds, if that is OK from the service infrastructure side of things.

I 'd defer to @hnowlan for this. On the service side, we can set whatever timeout you want, but the api-gateway (which is where the 15s timeout is), is not serviceops that maintains it but rather Platform. I suppose they have the 15s limit for a reason.

Change 674529 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[operations/deployment-charts@master] [WIP] linkrecommendation: Vary gunicorn timeout by environment

https://gerrit.wikimedia.org/r/674529

Change 674562 had a related patch set uploaded (by Hnowlan; owner: Hnowlan):
[operations/deployment-charts@master] api-gateway: make discovery service timeouts configurable per service

https://gerrit.wikimedia.org/r/674562

Change 674562 abandoned by Hnowlan:
[operations/deployment-charts@master] api-gateway: make discovery service timeouts configurable per service

Reason:
Wrong change to make

https://gerrit.wikimedia.org/r/674562

Change 674562 restored by Hnowlan:
[operations/deployment-charts@master] api-gateway: make discovery service timeouts configurable per service

https://gerrit.wikimedia.org/r/674562

For reference the 15s timeout is the Envoy default for upstream responses. I've just filed a CR to allow the timeout to be set on a per-service basis.

Subscribing to follow along. May have implications for image recommendations.

In T277297#6941092, @hnowlan wrote:

For reference the 15s timeout is the Envoy default for upstream responses. I've just filed a CR to allow the timeout to be set on a per-service basis.

FWIW it's probably fine, my only worry would be upstream connection resource starvation in the api-gateway. That being said, 30s isn't that much greater from 15s with regard to that.

Change 673978 merged by jenkins-bot:
[research/mwaddlink@main] Return available recommendations before timeout is reached

https://gerrit.wikimedia.org/r/673978

Tgr mentioned this in rRMWAae260b5b9b5b: Return available recommendations before timeout is reached.Mar 24 2021, 5:36 PM

Change 674529 merged by jenkins-bot:
[operations/deployment-charts@master] linkrecommendation: Add environment variable for gunicorn timeout

https://gerrit.wikimedia.org/r/674529

I think the 504/503 issue is solved although I now get intermittent 404s 😭

    checking candidate Picture_Post... success, updating index
    checking candidate CSS3... number of good links too small (0)
    checking candidate České_filmové_nebe... link recommendation already stored
    checking candidate Muzikál_ze_základní... success, updating index
    checking candidate Will.i.am... link recommendation already stored
    checking candidate Stahování_hudby... link recommendation already stored
    checking candidate Imagine_Dragons... link recommendation already stored
    checking candidate Zelená_kniha... number of good links too small (0)
    checking candidate HipHop_for_PHP... success, updating index
    checking candidate Baldur's_Gate_2... link recommendation already stored
    checking candidate Nervosa... link recommendation already stored
    checking candidate BoLs/sLoB... fetching recommendation failed
There was a problem during the HTTP request: 404 Not Found

In T277297#6944649, @kostajh wrote:

I think the 504/503 issue is solved although I now get intermittent 404s 😭

    checking candidate Picture_Post... success, updating index
    checking candidate CSS3... number of good links too small (0)
    checking candidate České_filmové_nebe... link recommendation already stored
    checking candidate Muzikál_ze_základní... success, updating index
    checking candidate Will.i.am... link recommendation already stored
    checking candidate Stahování_hudby... link recommendation already stored
    checking candidate Imagine_Dragons... link recommendation already stored
    checking candidate Zelená_kniha... number of good links too small (0)
    checking candidate HipHop_for_PHP... success, updating index
    checking candidate Baldur's_Gate_2... link recommendation already stored
    checking candidate Nervosa... link recommendation already stored
    checking candidate BoLs/sLoB... fetching recommendation failed
There was a problem during the HTTP request: 404 Not Found

It's this article it fails for https://cs.wikipedia.org/wiki/BoLs/sLoB, right ? Weird, I have nothing to offer.

But this one

checking candidate Baldur's_Gate_2... link recommendation already stored

made me smile, thanks!

I guess this can be resolved?

We are probably not URL-encoding the / properly.

In T277297#6948724, @Tgr wrote:

We are probably not URL-encoding the / properly.

No, this is coming from the service:
https://api.wikimedia.org/service/linkrecommendation/v1/linkrecommendations/wikipedia/cs/BoLs%2FsLoB?threshold=0.5&max_recommendations=15
(It definitely has to do with the / in the title though, normal 404s look different: https://api.wikimedia.org/service/linkrecommendation/v1/linkrecommendations/wikipedia/cs/BoLs?threshold=0.5&max_recommendations=15 )
Is it possible that the API portal proxy is decoding the slash?

In T277297#6948710, @akosiaris wrote:

I guess this can be resolved?

I'm waiting for https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/674562 to be merged + deployed, then I need to update the GUNICORN_TIMEOUT value for the external traffic release. We could untag serviceops though. Thank you for your help with this @akosiaris!

kostajh moved this task from In progress to March 29 - April 2 on the Add-Link board.Mar 29 2021, 11:26 AM

Change 674562 merged by jenkins-bot:
[operations/deployment-charts@master] api-gateway: make discovery service timeouts configurable per service

https://gerrit.wikimedia.org/r/674562

kostajh mentioned this in T278718: For external traffic release, perform lookup in GrowthExperiments cache before generating recommendations.Mar 29 2021, 8:25 PM

Change 675900 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[operations/deployment-charts@master] linkrecommendation: Bump version, adjust timeout, disable cron in staging

https://gerrit.wikimedia.org/r/675900

Change 675900 merged by jenkins-bot:

[operations/deployment-charts@master] linkrecommendation: Bump version, adjust timeout, disable cron in staging

https://gerrit.wikimedia.org/r/675900

In T277297#6953927, @gerritbot wrote:

Change 674562 merged by jenkins-bot:
[operations/deployment-charts@master] api-gateway: make discovery service timeouts configurable per service

https://gerrit.wikimedia.org/r/674562

I bumped GUNICORN_TIMEOUT to 30 seconds in https://gerrit.wikimedia.org/r/675900, but it looks like the API gateway is still terminating requests at 15s:

 time curl -X GET "https://api.wikimedia.org/service/linkrecommendation/v1/linkrecommendations/wikipedia/cs/Lipsko?threshold=0.5&max_recommendations=15" -H  "accept: application/json"
 {"httpReason":"upstream request timeout","httpCode":504}
curl -X GET  -H "accept: application/json"  0.02s user 0.01s system 0% cpu 15.290 total

Change 675903 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[operations/deployment-charts@master] linkrecommendation: Set timeout to 15s

https://gerrit.wikimedia.org/r/675903

Change 675903 merged by jenkins-bot:

[operations/deployment-charts@master] linkrecommendation: Set timeout to 15s

https://gerrit.wikimedia.org/r/675903

In T277297#6958663, @kostajh wrote:
In T277297#6953927, @gerritbot wrote:

Change 674562 merged by jenkins-bot:
[operations/deployment-charts@master] api-gateway: make discovery service timeouts configurable per service

https://gerrit.wikimedia.org/r/674562

I bumped GUNICORN_TIMEOUT to 30 seconds in https://gerrit.wikimedia.org/r/675900, but it looks like the API gateway is still terminating requests at 15s:
 time curl -X GET "https://api.wikimedia.org/service/linkrecommendation/v1/linkrecommendations/wikipedia/cs/Lipsko?threshold=0.5&max_recommendations=15" -H  "accept: application/json"
 {"httpReason":"upstream request timeout","httpCode":504}
curl -X GET  -H "accept: application/json"  0.02s user 0.01s system 0% cpu 15.290 total

The api-gateway patch wasn't fully deployed, now that it is, and the GUNICORN_TIMEOUT value is set to 30, we can process requests up to 30s.

(base) ➜  ~ time curl -X GET "https://api.wikimedia.org/service/linkrecommendation/v1/linkrecommendations/wikipedia/cs/Lipsko?threshold=0.5&max_recommendations=20" -H  "accept: application/json"
{"links":[{"context_after":" 1943 (bri","context_before":"provedeny ","link_index":0,"link_target":"4. prosinec","link_text":"4. prosince","match_index":0,"score":0.7247642874717712,"wikitext_offset":5328},{"context_after":" 1944 (","context_before":") a ","link_index":1,"link_target":"7. červenec","link_text":"7. července","match_index":0,"score":0.8392090797424316,"wikitext_offset":5377},{"context_after":" 1945. Pod","context_before":"ila město ","link_index":2,"link_target":"18. duben","link_text":"18. dubna","match_index":0,"score":0.8711312413215637,"wikitext_offset":5517},{"context_after":" 1945 před","context_before":" ho ","link_index":3,"link_target":"2. červenec","link_text":"2. července","match_index":0,"score":0.7780224680900574,"wikitext_offset":5597},{"context_after":" \"","context_before":"větoznámý ","link_index":4,"link_target":"Pěvecký sbor","link_text":"pěvecký sbor","match_index":0,"score":0.5680157542228699,"wikitext_offset":9510},{"context_after":". Denně jí","context_before":" v ","link_index":5,"link_target":"Evropa","link_text":"Evropě","match_index":0,"score":0.643818736076355,"wikitext_offset":10764},{"context_after":" (1916), p","context_before":"v průběhu ","link_index":6,"link_target":"První světová válka","link_text":"1. světové války","match_index":0,"score":0.6768380999565125,"wikitext_offset":11237},{"context_after":" (Naturkun","context_before":" Muzeum ","link_index":7,"link_target":"Přírodní vědy","link_text":"přírodních věd","match_index":0,"score":0.7242420315742493,"wikitext_offset":12651},{"context_after":")\n\n","context_before":"nologie a ","link_index":8,"link_target":"Hudební nástroj","link_text":"hudebních nástrojů","match_index":0,"score":0.5047877430915833,"wikitext_offset":12779},{"context_after":" postupně ","context_before":"ace se od ","link_index":9,"link_target":"1990–1999","link_text":"90. let","match_index":0,"score":0.6058660745620728,"wikitext_offset":14219},{"context_after":" (Hamburg–","context_before":"asy vlaků ","link_index":10,"link_target":"Intercity-Express","link_text":"Intercity-Express","match_index":0,"score":0.5736081600189209,"wikitext_offset":15713},{"context_after":" a házená.","context_before":"orty jsou ","link_index":11,"link_target":"Fotbal","link_text":"fotbal","match_index":0,"score":0.6937174797058105,"wikitext_offset":16691},{"context_after":" a matemat","context_before":"k, vědec, ","link_index":12,"link_target":"Diplomat","link_text":"diplomat","match_index":0,"score":0.5368956923484802,"wikitext_offset":17312}],"links_count":13,"page_title":"Lipsko","pageid":23151,"revid":19602785}
curl -X GET  -H "accept: application/json"  0.02s user 0.01s system 0% cpu 17.928 total

Thanks @hnowlan and @akosiaris!

kostajh mentioned this in T280634: linkrecommendation: WORKER TIMEOUT (pid:4958).Apr 20 2021, 9:09 AM

504 timeout and 503 errors when accessing linkrecommendation serviceClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

504 timeout and 503 errors when accessing linkrecommendation service
Closed, ResolvedPublic
Actions