Page MenuHomePhabricator

Harbor uploads sometimes fail due to tmpfs space on project-proxy
Closed, ResolvedPublic

Description

I just got this error message when building an updated pywikibot-buildservice image:

[step-export] 2023-12-30T16:02:44.228444405Z ERROR: failed to export: failed to write image to the following tags: [tools-harbor.wmcloud.org/tool-pywikibot/pywikibot-scripts-stable:latest: PATCH https://tools-harbor.wmcloud.org/v2/tool-pywikibot/pywikibot-scripts-stable/blobs/uploads/e07b5429-fcb2-4009-9eb3-d0dc5a492e84?_state=REDACTED: unexpected status code 500 Internal Server Error: <html>
[step-export] 2023-12-30T16:02:44.228501698Z <head><title>500 Internal Server Error</title></head>
[step-export] 2023-12-30T16:02:44.228513968Z <body>
[step-export] 2023-12-30T16:02:44.228522933Z <center><h1>500 Internal Server Error</h1></center>
[step-export] 2023-12-30T16:02:44.228532921Z <hr><center>nginx/1.18.0</center>
[step-export] 2023-12-30T16:02:44.228562400Z </body>
[step-export] 2023-12-30T16:02:44.228570800Z </html>
[step-export] 2023-12-30T16:02:44.228578157Z ]
[step-results] 2023-12-30T16:02:45.199470257Z 2023/12/30 16:02:45 Skipping step because a previous step failed

and in proxy-03 logs I see this:

2023/12/30 16:02:36 [crit] 3404031#3404031: *58485319 pwrite() "/var/lib/nginx/body/0000463514" failed (28: No space left on device), client: 172.16.2.172, server: , request: "PATCH /v2/tool-pywikibot/pywikibot-scripts-stable/blobs/uploads/ef3e3395-9cf2-4b78-a802-be688e9fdfeb?_state=REDACTED HTTP/2.0", host: "tools-harbor.wmcloud.org"

/var/lib/nginx is a 1G tmpfs device:

root@proxy-03:~# df -h /var/lib/nginx
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           1.0G     0  1.0G   0% /var/lib/nginx

The image was published successfully after a retry.

Event Timeline

Looks like it :/, will have to experiment a bit more.

There's documentation about the buffering being disabled if the client uses HTTP 1.1 with chunked transfer:
"""
When HTTP/1.1 chunked transfer encoding is used to send the original request body, the request body will be buffered regardless of the directive value unless HTTP/1.1 is enabled for proxying.
""
https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_request_buffering

But it seems the client is using 2.0, I have not found yet any mention of how http 2.0 is handled.

There's documentation about the buffering being disabled if the client uses HTTP 1.1 with chunked transfer:
"""
When HTTP/1.1 chunked transfer encoding is used to send the original request body, the request body will be buffered regardless of the directive value unless HTTP/1.1 is enabled for proxying.
""
https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_request_buffering

But it seems the client is using 2.0, I have not found yet any mention of how http 2.0 is handled.

It'd be interesting to play with this a bit. I was using this repo https://github.com/NdibeRaymond/5GB to reproduce the error as initially reported. This is likely caused by a different case

@dcaro do you have any idea on how to reproduce this issue?

The only idea I have is to try to upload several things in parallel.

Another option to try the same, would be to setup an nginx locally, with a very small tmp folder, and play with that, you might be able to trigger it better.

Otherwise, I would close this for now, and continue the effort if/when we get more reports of failures.

Change 1012728 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] dynamicproxy: use http 1.1 for backend connections

https://gerrit.wikimedia.org/r/1012728

Change #1012728 merged by Majavah:

[operations/puppet@production] dynamicproxy: use http 1.1 for backend connections

https://gerrit.wikimedia.org/r/1012728

Waiting for user input to see if this happens again (might go and check the logs too)

I'm thinking that it might be another cache that fills up, we might want to try setting:

https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_cache_path

That allow setting a min_free parameter to manage cache cleanup

Today it got full again:

root@proxy-03:~# df -h /var/lib/nginx                                                                                                                                                                                                                                                                                                                                                         
Filesystem      Size  Used Avail Use% Mounted on                                                                                                                                               
tmpfs           1.0G  1.0G     0 100% /var/lib/nginx

and a du -hs sows a usage of 0 Kib of space:

root@proxy-03:/var/lib/nginx# du -hs *                                                                                                                                                                                                                                                                                                                                                        
0       body                                                                                                                                                                                   
0       fastcgi                                                                                                                                                                                                                                                                                                                                                                               
0       proxy                                                                                                                                                                                                                                                                                                                                                                                 
0       scgi                                                                                                                                                                                   
0       uwsgi

A recursive ls shows only directories (mostly under proxy), but no files:

root@proxy-03:/var/lib/nginx# ls -R . > /root/ls_recursive.txt

dcaro changed the task status from Open to In Progress.Apr 16 2024, 9:39 AM
dcaro claimed this task.
dcaro moved this task from Next Up to In Progress on the Toolforge (Toolforge iteration 08) board.

Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/1020767 to disable file buffering for responses, let's see how it goes.

So far so good:

root@proxy-03:/etc/nginx# lsof -n | grep var/lib/nginx
root@proxy-03:/etc/nginx#

btw. an lsof showed that the files were deleted already:

root@proxy-03:/etc/nginx# lsof -n | grep var/lib/nginx                                                                                                                                                                                                                                                                                                                                        
nginx     2268187                           www-data  511u      REG               0,38    126544    9281617 /var/lib/nginx/proxy/4/25/0000002254 (deleted)                                                                                                                                                                                                                                    
nginx     2268187                           www-data  535u      REG               0,38   8727816    9281618 /var/lib/nginx/proxy/5/25/0000002255 (deleted)                                                                                                                                                                                                                                    
nginx     2268187                           www-data  797u      REG               0,38 103800832    9280705 /var/lib/nginx/proxy/2/34/0000001342 (deleted)                                                                                                                                                                                                                                    
nginx     2268187                           www-data  850u      REG               0,38    638976    9281616 /var/lib/nginx/proxy/3/25/0000002253 (deleted)

That's why du would not show any usage.

I'm declaring this a win, it has not used any space after the change:

root@proxy-03:/etc/nginx# lsof -n | grep var/lib/nginx
root@proxy-03:/etc/nginx# df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            2.0G     0  2.0G   0% /dev
tmpfs           394M  516K  393M   1% /run
/dev/sda1        20G  7.5G   12G  40% /
tmpfs           2.0G     0  2.0G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sdb        9.8G   24M  9.3G   1% /srv/backup
/dev/sda15      124M   11M  114M   9% /boot/efi
tmpfs           394M     0  394M   0% /run/user/0
tmpfs           1.0G     0  1.0G   0% /var/lib/nginx
tmpfs           394M     0  394M   0% /run/user/25603

And I've rebuild my whole lima-kilo setup twice (downloading big container images).
Will open a new task if other issues arise.

dcaro moved this task from In Progress to Done on the Toolforge (Toolforge iteration 09) board.