Page MenuHomePhabricator

OpenStack browser homepage gives 502 error
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

What happens?:

  • long load time
  • eventually, 502 Bad Gateway

image.png (253×950 px, 8 KB)

What should have happened instead?:

  • load a homepage explaining what OpenStack Browser is and does, and perhaps listing some other things to click on

Software version (skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):

Details

TitleReferenceAuthorSource BranchDest Branch
Quick hack to work around /etc/novaobserver.yaml bugtoolforge-repos/openstack-browser!8bd808work/bd808/bad-yaml-hackmaster
Procfile: increase gunicorn timeout to 5 minutestoolforge-repos/openstack-browser!7bd808work/bd808/timeoutmaster
Customize query in GitLab

Event Timeline

Saw this in passing, did a bare minimum webservice restart — seems to have improved for some pages, but the reported bug is still occuring. webservice logs doesn't seem overly helpful, but relevant (?) below:

2023-05-22T22:14:58+00:00 [openstack-browser-6956c6b998-crwr9] [2023-05-22 22:14:58 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:20)
2023-05-22T22:14:58+00:00 [openstack-browser-6956c6b998-crwr9] [2023-05-22 22:14:58 +0000] [20] [INFO] Worker exiting (pid: 20)
2023-05-22T22:14:59+00:00 [openstack-browser-6956c6b998-crwr9] [2023-05-22 22:14:59 +0000] [22] [INFO] Booting worker with pid: 22
2023-05-22T22:15:25+00:00 [openstack-browser-6956c6b998-crwr9] [2023-05-22 22:15:25 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:19)
2023-05-22T22:15:25+00:00 [openstack-browser-6956c6b998-crwr9] [2023-05-22 22:15:25 +0000] [19] [INFO] Worker exiting (pid: 19)
2023-05-22T22:15:26+00:00 [openstack-browser-6956c6b998-crwr9] [2023-05-22 22:15:26 +0000] [23] [INFO] Booting worker with pid: 23
2023-05-22T22:15:45+00:00 [openstack-browser-6956c6b998-crwr9] [2023-05-22 22:15:45 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:18)
2023-05-22T22:15:45+00:00 [openstack-browser-6956c6b998-crwr9] [2023-05-22 22:15:45 +0000] [18] [INFO] Worker exiting (pid: 18)
2023-05-22T22:15:45+00:00 [openstack-browser-6956c6b998-crwr9] [2023-05-22 22:15:45 +0000] [24] [INFO] Booting worker with pid: 24
2023-05-22T22:17:47+00:00 [openstack-browser-6956c6b998-crwr9] 192.168.247.64 - - [22/May/2023:22:17:47 +0000] "GET /project/ HTTP/1.1" 200 21927 "https://openstack-browser.toolforge.org/" "Mozilla/5.0 (Macintosh; Intel Mac OS X [...]"

https://openstack-browser.toolforge.org/project/ will almost always return quickly.

The 502 for the landing page is from the tool's own ingress. The general problem is that the set of OpenStack API calls that I dreamed up to compute the data for how many instances are alive and how many resources they consume are just too slow these days. They never were particularly fast, but things used to work because of a daily cron job that would hit the page with parameters that told it to refill the cache. That job still exists, but because it is implemented as a curl call to the app, I think it also times out more often than it succeeds these days.

The simple "fix" is probably to remove the expensive aggregate stats from the main page. The more involved fix would be to look for ways to either make fewer API calls or to speed up the API calls that are already in use.

Mentioned in SAL (#wikimedia-cloud) [2023-06-23T22:28:19Z] <wm-bot> <bd808> Updated to 2aecd1b2 (T337265)

This restart has made everything worse due to an unrelated bug in a shared config file. https://gerrit.wikimedia.org/r/c/operations/puppet/+/932516 should fix that problem when I can get a root to merge it and then restart the app again once the file is fixed across the kubernetes worker fleet.

$ time curl -sS 'https://openstack-browser.toolforge.org/?purge'
[...snip...]
real    3m9.844s
user    0m0.028s
sys     0m0.017s

So yeah, a 60s timeout was probably just not enough time to do the work

bd808 claimed this task.

Giving gunicorn a long timeout seems to have given the cache filling job enough time to run. Let's call this "fixed" for now.