Page MenuHomePhabricator

Some public PAWS files give 504 errors
Closed, ResolvedPublic

Description

See my notes on this here and here (the latter with some musings on whether file size may play a role - perhaps indirectly).

Examples

They are from the same folder, with identical permissions.

Event Timeline

File size could definitely be part of it since there are multiple proxies. I get the same results. I'll take a look at the layers here and see if I can find it.

Some notes: I've tracked this error down to the paws public pods.
The nbserve pod reports:

2021/01/06 18:19:36 [error] 6#6: *97205 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.33.131, server: , request: "GET /12410844/100days/100days-Day-088.ipynb HTTP/1.1", upstream: "uwsgi://10.96.182.172:8000", host: "public.paws.wmcloud.org"
192.168.33.131 - - [06/Jan/2021:18:19:36 +0000] "GET /12410844/100days/100days-Day-088.ipynb HTTP/1.1" 504 575 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"

The renderer pod, where the error actually happens has:

Wed Jan  6 18:15:38 2021 - SIGPIPE: writing to a closed pipe/socket/fd (probably the client disconnected) on request /12410844/100days/100days-Day-088.ipynb (ip 192.168.33.131) !!!
Wed Jan  6 18:15:38 2021 - uwsgi_response_write_body_do(): Broken pipe [core/writer.c line 341] during GET /12410844/100days/100days-Day-088.ipynb (192.168.33.131)
OSError: write error
[pid: 10|app: 0|req: 11439/32846] 192.168.33.131 () {64 vars in 1244 bytes} [Wed Jan  6 18:13:07 2021] GET /12410844/100days/100days-Day-088.ipynb => generated 76256 bytes in 151124 msecs (HTTP/1.1 200) 2 headers in 84 bytes (1 switches on core 0)
Wed Jan  6 18:21:05 2021 - SIGPIPE: writing to a closed pipe/socket/fd (probably the client disconnected) on request /12410844/100days/100days-Day-088.ipynb (ip 192.168.33.131) !!!
Wed Jan  6 18:21:05 2021 - uwsgi_response_write_body_do(): Broken pipe [core/writer.c line 341] during GET /12410844/100days/100days-Day-088.ipynb (192.168.33.131)
OSError: write error
[pid: 9|app: 0|req: 8713/32847] 192.168.33.131 () {66 vars in 1269 bytes} [Wed Jan  6 18:18:36 2021] GET /12410844/100days/100days-Day-088.ipynb => generated 76256 bytes in 149434 msecs (HTTP/1.1 200) 2 headers in 84 bytes (1 switches on core 0)
[pid: 9|app: 0|req: 8715/32849] 192.168.33.131 () {50 vars in 1011 bytes} [Wed Jan  6 18:21:20 2021] GET /12410844/100days/100days-Day-080.ipynb => generated 729139 bytes in 322 msecs (HTTP/1.1 200) 2 headers in 83 bytes (10 switches on core 0)

I think the renderer is just not performant enough for whatever is going on in that notebook, so the nbserve (which is an nginx) pod times it out. It seems reasonable for what PAWS does to increase the timeout.

No dice so far after that.

2021/01/06 23:33:26 [error] 13#13: *1 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.33.131, server: , request: "GET /12410844/100days/100days-Day-088.ipynb HTTP/1.1", upstream: "uwsgi://10.96.182.172:8000", host: "public.paws.wmcloud.org"
192.168.33.131 - - [06/Jan/2021:23:33:26 +0000] "GET /12410844/100days/100days-Day-088.ipynb HTTP/1.1" 504 575 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"

Or progress, depending on how you look at it. This changed the error slightly.

On the renderer side, it's the same, though.

Wed Jan  6 23:34:57 2021 - SIGPIPE: writing to a closed pipe/socket/fd (probably the client disconnected) on request /12410844/100days/100days-Day-088.ipynb (ip 192.168.33.131) !!!
Wed Jan  6 23:34:57 2021 - uwsgi_response_write_body_do(): Broken pipe [core/writer.c line 341] during GET /12410844/100days/100days-Day-088.ipynb (192.168.33.131)
OSError: write error
[pid: 10|app: 0|req: 11484/33020] 192.168.33.131 () {64 vars in 1238 bytes} [Wed Jan  6 23:32:26 2021] GET /12410844/100days/100days-Day-088.ipynb => generated 76256 bytes in 151354 msecs (HTTP/1.1 200) 2 headers in 84 bytes (1 switches on core 0)

So, it generated 76KB in 151 seconds. Our read timeout is now at 180 seconds, so that would seem to be inline but the connect timeout is still at the default of 60s. I'll expand another timeout and see how it does.

That didn't work either. I still saw the whole stack timeout at 60s. The key is to find which timeout did that and which proxy.

2021/01/07 18:17:15 [error] 6#6: *10 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.122.131, server: , request: "GET /12410844/100days/100days-Day-088.ipynb HTTP/1.1", upstream: "uwsgi://10.96.182.172:8000", host: "public.paws.wmcloud.org"
192.168.122.131 - - [07/Jan/2021:18:17:15 +0000] "GET /12410844/100days/100days-Day-088.ipynb HTTP/1.1" 504 575 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"

Since it is clearly using a default (60s), I think the one I haven't messed with is uwsgi_read_timeout. Plus I should probably promote these out of a location block.

That did it https://public.paws.wmcloud.org/12410844/100days/100days-Day-088.ipynb

If you concoct something that takes even longer to render, it will fail. I've expanded it to 180 seconds (3x default). Updating the renderer to something faster will need to happen one of these days, but there are limits to everything.

Bstorm claimed this task.