Page MenuHomePhabricator

Unable to open some applications in administrator interface
Closed, ResolvedPublic

Description

When navigating through to applications in the administrator interface, applications take a very long time to load, and some return an Internal Server Error.

https://wikipedialibrary.wmflabs.org/admin/applications/application/83/ loaded, but very slowly.
https://wikipedialibrary.wmflabs.org/admin/applications/application/24/ returned an internal server error the first time, but loaded after a refresh.
https://wikipedialibrary.wmflabs.org/admin/applications/application/435/ returned an internal server error the first time, but loaded after a refresh.

Event Timeline

So, I've been doing some testing on this. When accessing these items, cpu usage for gunicorn spikes and the worker times out.
I adjusted the worker count for green unicorn to reduce the likelihood of a worker timeout, but I'm beginning to think that the system is somewhat undersized for what all is running: an app server, a db server, and a web server.

I've been avoiding mucking with the server configuration too much at this point, since there's still so much dev work to do, but these kinds of things are only going to get worse until we do some shuffling around and redeployment.

Basically, I think we should:
split out the database to a separate system
drop the green unicorn application service in favor of running the app via uwsgi directly on nginx
deploy a fresh production machine, ideally with an additional core

There's nothing I can do to make it go faster until we change the model of where the computation is happening and when. The best I can do with it for now is change some values to allow it to take its time instead of throwing errors.

As a sidenote, I'm not able to reproduce this in my local development environment which is similarly specced, but I think it's just because there's no live user load on the system.

I poked around the various openstack interfaces that we have available to us, hoping to find an option to resize our current vms. Unfortunately, that's not an option, so minting new servers is the only way forward there.

Basically, I think we should:
split out the database to a separate system
drop the green unicorn application service in favor of running the app via uwsgi directly on nginx
deploy a fresh production machine, ideally with an additional core

How long do you estimate this would take?

splitting out the db should be straightforward and should just take a day or so. Minimal effort for very modest gains.

rejiggering the way that we run the app itself to use uwsgi will have implications on the way code updates get delivered, so will need to have research put into it. That one might be a week on its own. I think I might want to kick that can down the road.

I've been doing research on deploying a fresh production machine, as I'd like to be able to use the puppet module that I've written. I'm currently testing several deployment scenarios to try to use it as a locally installed puppet module. If I can get this working as I'd like, I'll just provision a new stack-in-a-box like we currently have, but with enough horsepower to do it all.

One way or the other, I'll get us back to good performance by the end of the upcoming weekend.

Okay, we've got a brand-new production instance provisioned from my puppet module. There was a brief outage during the cutover, but the new system has many more resources. There's still tuning that needs to happen there, but I now have much greater confidence in our ability to recover from major problems.

So, for my own record the steps were:

  1. new instance on horizon, with appropriate host-specific hiera config to allow access to the project share
  2. dump db on current site
  3. Do a local puppet run
puppet module install jsnshrmn/twlight --version x.x.x
puppet apply some-manifest.pp
  1. delete the proxy pointing to the old site
  2. create a proxy pointing to a new site

Looking at the load on the new system. I believe we are good to go for the foreseeable future. I captured my deployment notes and added them to the puppet module.

We talked about it a while ago, but just got around to splitting out the two suggestions that weren't actioned here.