OpenStack browser homepage gives 502 error
Closed, ResolvedPublicBUG REPORT
Actions

Assigned To

Authored By

	Novem_Linguae
	May 22 2023, 7:33 PM

Description

Steps to replicate the issue (include links if applicable):

Visit https://openstack-browser.toolforge.org/proxy/
Click "Home" on the top left

What happens?:

long load time
eventually, 502 Bad Gateway

What should have happened instead?:

load a homepage explaining what OpenStack Browser is and does, and perhaps listing some other things to click on

Software version (skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):

Details

	Title	Reference	Author	Source Branch	Dest Branch
	Quick hack to work around /etc/novaobserver.yaml bug	toolforge-repos/openstack-browser!8	bd808	work/bd808/bad-yaml-hack	master
	Procfile: increase gunicorn timeout to 5 minutes	toolforge-repos/openstack-browser!7	bd808	work/bd808/timeout	master

Customize query in GitLab

Related Objects

Mentioned In: R2073:2aecd1b21323: Procfile: increase gunicorn timeout to 5 minutes

Event Timeline

Novem_Linguae created this task.May 22 2023, 7:33 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 22 2023, 7:33 PM

Saw this in passing, did a bare minimum webservice restart — seems to have improved for some pages, but the reported bug is still occuring. webservice logs doesn't seem overly helpful, but relevant (?) below:

2023-05-22T22:14:58+00:00 [openstack-browser-6956c6b998-crwr9] [2023-05-22 22:14:58 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:20)
2023-05-22T22:14:58+00:00 [openstack-browser-6956c6b998-crwr9] [2023-05-22 22:14:58 +0000] [20] [INFO] Worker exiting (pid: 20)
2023-05-22T22:14:59+00:00 [openstack-browser-6956c6b998-crwr9] [2023-05-22 22:14:59 +0000] [22] [INFO] Booting worker with pid: 22
2023-05-22T22:15:25+00:00 [openstack-browser-6956c6b998-crwr9] [2023-05-22 22:15:25 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:19)
2023-05-22T22:15:25+00:00 [openstack-browser-6956c6b998-crwr9] [2023-05-22 22:15:25 +0000] [19] [INFO] Worker exiting (pid: 19)
2023-05-22T22:15:26+00:00 [openstack-browser-6956c6b998-crwr9] [2023-05-22 22:15:26 +0000] [23] [INFO] Booting worker with pid: 23
2023-05-22T22:15:45+00:00 [openstack-browser-6956c6b998-crwr9] [2023-05-22 22:15:45 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:18)
2023-05-22T22:15:45+00:00 [openstack-browser-6956c6b998-crwr9] [2023-05-22 22:15:45 +0000] [18] [INFO] Worker exiting (pid: 18)
2023-05-22T22:15:45+00:00 [openstack-browser-6956c6b998-crwr9] [2023-05-22 22:15:45 +0000] [24] [INFO] Booting worker with pid: 24
2023-05-22T22:17:47+00:00 [openstack-browser-6956c6b998-crwr9] 192.168.247.64 - - [22/May/2023:22:17:47 +0000] "GET /project/ HTTP/1.1" 200 21927 "https://openstack-browser.toolforge.org/" "Mozilla/5.0 (Macintosh; Intel Mac OS X [...]"

https://openstack-browser.toolforge.org/project/ will almost always return quickly.

The 502 for the landing page is from the tool's own ingress. The general problem is that the set of OpenStack API calls that I dreamed up to compute the data for how many instances are alive and how many resources they consume are just too slow these days. They never were particularly fast, but things used to work because of a daily cron job that would hit the page with parameters that told it to refill the cache. That job still exists, but because it is implemented as a curl call to the app, I think it also times out more often than it succeeds these days.

The simple "fix" is probably to remove the expensive aggregate stats from the main page. The more involved fix would be to look for ways to either make fewer API calls or to speed up the API calls that are already in use.

bd808 opened https://gitlab.wikimedia.org/toolforge-repos/openstack-browser/-/merge_requests/7

Procfile: increase gunicorn timeout to 5 minutes

CodeReviewBot added a project: Patch-For-Review.Jun 23 2023, 10:09 PM

Maintenance_bot removed a project: Patch-For-Review.Jun 23 2023, 10:10 PM

bd808 merged https://gitlab.wikimedia.org/toolforge-repos/openstack-browser/-/merge_requests/7

Procfile: increase gunicorn timeout to 5 minutes

Mentioned in SAL (#wikimedia-cloud) [2023-06-23T22:28:19Z] <wm-bot> <bd808> Updated to 2aecd1b2 (T337265)

In T337265#8960198, @Stashbot wrote:

Mentioned in SAL (#wikimedia-cloud) [2023-06-23T22:28:19Z] <wm-bot> <bd808> Updated to 2aecd1b2 (T337265)

This restart has made everything worse due to an unrelated bug in a shared config file. https://gerrit.wikimedia.org/r/c/operations/puppet/+/932516 should fix that problem when I can get a root to merge it and then restart the app again once the file is fixed across the kubernetes worker fleet.

bd808 opened https://gitlab.wikimedia.org/toolforge-repos/openstack-browser/-/merge_requests/8

Quick hack to work around /etc/novaobserver.yaml bug

CodeReviewBot added a project: Patch-For-Review.Jun 23 2023, 11:03 PM

bd808 merged https://gitlab.wikimedia.org/toolforge-repos/openstack-browser/-/merge_requests/8

Quick hack to work around /etc/novaobserver.yaml bug

Maintenance_bot removed a project: Patch-For-Review.Jun 23 2023, 11:10 PM

$ time curl -sS 'https://openstack-browser.toolforge.org/?purge'
[...snip...]
real    3m9.844s
user    0m0.028s
sys     0m0.017s

So yeah, a 60s timeout was probably just not enough time to do the work

bd808 mentioned this in R2073:2aecd1b21323: Procfile: increase gunicorn timeout to 5 minutes.Jun 24 2023, 2:32 AM

Giving gunicorn a long timeout seems to have given the cache filling job enough time to run. Let's call this "fixed" for now.

Restricted Application added a project: User-bd808. · View Herald TranscriptJun 26 2023, 3:27 PM

OpenStack browser homepage gives 502 errorClosed, ResolvedPublicBUG REPORTActions

Description

Details

Related Objects

Event Timeline

OpenStack browser homepage gives 502 error
Closed, ResolvedPublicBUG REPORT
Actions