Some public PAWS files give 504 errors
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Daniel_Mietchen
	Dec 25 2020, 7:39 AM

Description

See my notes on this here and here (the latter with some musings on whether file size may play a role - perhaps indirectly).

Examples

this one works right now for me
this one does not work right now for me

They are from the same folder, with identical permissions.

Related Objects

Mentioned In: rPAWSb53261dcf705: nbserve timeouts: add timeout and hoist all of them to server block
rPAWS55cd2c064a3a: paws-public: expand the connect timeout on the nbserve process
rPAWSf04e0fc8dca4: paws-public: increase the timeout for read at the nginx proxy

Event Timeline

Daniel_Mietchen created this task.Dec 25 2020, 7:39 AM

Daniel_Mietchen triaged this task as High priority.Dec 29 2020, 3:31 PM

File size could definitely be part of it since there are multiple proxies. I get the same results. I'll take a look at the layers here and see if I can find it.

Some notes: I've tracked this error down to the paws public pods.
The nbserve pod reports:

2021/01/06 18:19:36 [error] 6#6: *97205 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.33.131, server: , request: "GET /12410844/100days/100days-Day-088.ipynb HTTP/1.1", upstream: "uwsgi://10.96.182.172:8000", host: "public.paws.wmcloud.org"
192.168.33.131 - - [06/Jan/2021:18:19:36 +0000] "GET /12410844/100days/100days-Day-088.ipynb HTTP/1.1" 504 575 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"

The renderer pod, where the error actually happens has:

Wed Jan  6 18:15:38 2021 - SIGPIPE: writing to a closed pipe/socket/fd (probably the client disconnected) on request /12410844/100days/100days-Day-088.ipynb (ip 192.168.33.131) !!!
Wed Jan  6 18:15:38 2021 - uwsgi_response_write_body_do(): Broken pipe [core/writer.c line 341] during GET /12410844/100days/100days-Day-088.ipynb (192.168.33.131)
OSError: write error
[pid: 10|app: 0|req: 11439/32846] 192.168.33.131 () {64 vars in 1244 bytes} [Wed Jan  6 18:13:07 2021] GET /12410844/100days/100days-Day-088.ipynb => generated 76256 bytes in 151124 msecs (HTTP/1.1 200) 2 headers in 84 bytes (1 switches on core 0)
Wed Jan  6 18:21:05 2021 - SIGPIPE: writing to a closed pipe/socket/fd (probably the client disconnected) on request /12410844/100days/100days-Day-088.ipynb (ip 192.168.33.131) !!!
Wed Jan  6 18:21:05 2021 - uwsgi_response_write_body_do(): Broken pipe [core/writer.c line 341] during GET /12410844/100days/100days-Day-088.ipynb (192.168.33.131)
OSError: write error
[pid: 9|app: 0|req: 8713/32847] 192.168.33.131 () {66 vars in 1269 bytes} [Wed Jan  6 18:18:36 2021] GET /12410844/100days/100days-Day-088.ipynb => generated 76256 bytes in 149434 msecs (HTTP/1.1 200) 2 headers in 84 bytes (1 switches on core 0)
[pid: 9|app: 0|req: 8715/32849] 192.168.33.131 () {50 vars in 1011 bytes} [Wed Jan  6 18:21:20 2021] GET /12410844/100days/100days-Day-080.ipynb => generated 729139 bytes in 322 msecs (HTTP/1.1 200) 2 headers in 83 bytes (10 switches on core 0)

I think the renderer is just not performant enough for whatever is going on in that notebook, so the nbserve (which is an nginx) pod times it out. It seems reasonable for what PAWS does to increase the timeout.

https://github.com/toolforge/paws/pull/56

• Bstorm mentioned this in rPAWSf04e0fc8dca4: paws-public: increase the timeout for read at the nginx proxy.Jan 6 2021, 11:58 PM

No dice so far after that.

2021/01/06 23:33:26 [error] 13#13: *1 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.33.131, server: , request: "GET /12410844/100days/100days-Day-088.ipynb HTTP/1.1", upstream: "uwsgi://10.96.182.172:8000", host: "public.paws.wmcloud.org"
192.168.33.131 - - [06/Jan/2021:23:33:26 +0000] "GET /12410844/100days/100days-Day-088.ipynb HTTP/1.1" 504 575 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"

Or progress, depending on how you look at it. This changed the error slightly.

On the renderer side, it's the same, though.

Wed Jan  6 23:34:57 2021 - SIGPIPE: writing to a closed pipe/socket/fd (probably the client disconnected) on request /12410844/100days/100days-Day-088.ipynb (ip 192.168.33.131) !!!
Wed Jan  6 23:34:57 2021 - uwsgi_response_write_body_do(): Broken pipe [core/writer.c line 341] during GET /12410844/100days/100days-Day-088.ipynb (192.168.33.131)
OSError: write error
[pid: 10|app: 0|req: 11484/33020] 192.168.33.131 () {64 vars in 1238 bytes} [Wed Jan  6 23:32:26 2021] GET /12410844/100days/100days-Day-088.ipynb => generated 76256 bytes in 151354 msecs (HTTP/1.1 200) 2 headers in 84 bytes (1 switches on core 0)

So, it generated 76KB in 151 seconds. Our read timeout is now at 180 seconds, so that would seem to be inline but the connect timeout is still at the default of 60s. I'll expand another timeout and see how it does.

• Bstorm mentioned this in rPAWS55cd2c064a3a: paws-public: expand the connect timeout on the nbserve process.Jan 7 2021, 12:58 AM

That didn't work either. I still saw the whole stack timeout at 60s. The key is to find which timeout did that and which proxy.

2021/01/07 18:17:15 [error] 6#6: *10 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.122.131, server: , request: "GET /12410844/100days/100days-Day-088.ipynb HTTP/1.1", upstream: "uwsgi://10.96.182.172:8000", host: "public.paws.wmcloud.org"
192.168.122.131 - - [07/Jan/2021:18:17:15 +0000] "GET /12410844/100days/100days-Day-088.ipynb HTTP/1.1" 504 575 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"

Since it is clearly using a default (60s), I think the one I haven't messed with is uwsgi_read_timeout. Plus I should probably promote these out of a location block.

• Bstorm mentioned this in rPAWSb53261dcf705: nbserve timeouts: add timeout and hoist all of them to server block.Jan 7 2021, 6:48 PM

That did it https://public.paws.wmcloud.org/12410844/100days/100days-Day-088.ipynb

If you concoct something that takes even longer to render, it will fail. I've expanded it to 180 seconds (3x default). Updating the renderer to something faster will need to happen one of these days, but there are limits to everything.

• Bstorm closed this task as Resolved.Jan 7 2021, 10:10 PM

• Bstorm claimed this task.

Some public PAWS files give 504 errorsClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Some public PAWS files give 504 errors
Closed, ResolvedPublic
Actions