Page MenuHomePhabricator

Setup ORES service in beta cluster
Closed, ResolvedPublic

Description

We are working to move the ores service to prod. It would be great to have ores service in beta. @hashar elaborated more on how we should do this but I think we can use more information

Event Timeline

Now, it's live in https://ores-beta.wmflabs.org/

  • The web node is: deployment-ores-web.deployment-prep.eqiad.wmflabs
  • The worker is setup within SCA: deployment-sca01.deployment-prep.eqiad.wmflabs
  • The redis is: deployment-ores-redis.deployment-prep.eqiad.wmflabs

Here are my notes regarding this deployment.

SSH

I used Keyholder to do the deployment. Using user "deploy-service" in group "deploy-service" which is made by scap::target class. The public key is in the private repo but accessible via puppetmaster. Strangely, public key of that user is in the puppet repo (length of that public key is 399) and it overwrites my public key (that the keyholder uses, length: 739). My guess is, the 399 key is actually for user "deploy-service" in prod. So basically before any kind of deployment you need to arm the keyholder and be sure it's working correctly:sudo SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh-add -l

Puppet

Lots and lots of puppet settings need to be done. I made this patch and manually cherry-picked it in puppet master. I added a ores::scapdeploy calss and a role with the same name (role::labs::ores::scapdeploy) Why not in ores::base directly? I did it initally but I changed it due to two reasons: First is trying to keep compatibility with our setup in labs (ores project). The second reason is more complicated: T131392: "/usr/bin/deploy-local --repo ores/deploy -D log_json:False" in targets run all checks regardless of group. When you add scap::target puppets, it tries to deploy the initial stage and mistakenly tries to run all checks in all groups which mean commands we designed for web nodes will be ran over worker nodes and vice versa, and the puppet fails. That's okay but if we use it in ores::base, all dependencies of ores such as ores::web would be failed without even trying to run

But then it comes the owner issue which took about a day of mine. In ores::base we define /srv/ores and give the owner premssion to user "www-data" in a group with the same name, with mode 775, we won't let "deploy-service" add anything to it. So I tried to define subdirectories and give the premission to them but scap also makes directories such as /srv/ores/deploy.dddd (dd is a date). So I tried by using parameters in ores::base and change them in hiera (so I changed ores::base::user to 'deploy-service') but it causes cyclic dependencies: Error: Could not apply complete catalog: Found 1 dependency cycle: (File[/srv/ores] => Class[Ores::Base] => Class[Ores::Scapdeploy] => Scap::Target[ores/deploy] => Group[deploy-service] => File[/srv/ores]) so I moved the ownership part to the scapdeploy and used the proper user there. so In order to keep compatibility with the labs setup. We need to use hiera and 'www-data' there. Also we need to figure a way out to make the /srv/ores directory for the labs setup

Anyway: We need to merge this

Redis

Our redis setup is good in puppet and a simple change in hiera can configure the setup properly. But our python setup is hard-coded in several cases. so I had to make a branch called prod and put my settings there. See this PR for changes needed

ores/config v. ores/deploy

Our setup in labs uses /srv/ores/config but as a standard, we are switching to /srv/ores/deploy but these settings are scattered all over puppet and ores-config. See my PR and my patch.

Checks timeout

In order to restart services, or setup our virtualenv, we need to run scap checks but unfortunately it's built in a way that if a check takes more than 30 seconds. It stops the check and there is no straightforward way to increase that time. The release engineering team is working on it T131391: Let scap checks have their timeouts ability to change but it might take a while to get that patch merged, packaged and released. Two checks of us need that: 1- setup_virtualenv: Building the virtualenv, uninstalling the deps and reinstalling them take really long time, for me when I run it in a target about 20 seconds but I can imagine why it takes more than 30 seconds when we run it from tin 2- restart_web: Our uwsgi is big and restarting it (lots of workers per node, 28 I guess) takes really long time and with every puppet run it stops the uwsgi but checks out before the service can be brought online. So my guess is due to this bug our web nodes goes offline very soon :( Once the patch is merged we can simply use "timeout" option in checks.yaml.

One quick thought re. uwsgi restarts -- this is a problem we have been struggling with for a while. I just added T131572 so that we'll actually investigate this.

Edit: Also note that this slow restart is a problem with Wikilabels too and wikilabels doesn't have nearly as high a process count or memory usage.

Turns out that T118495 is a much older task that we created for this slow uwsgi issue. I just did some more investigation into where the slowness happens. It looks like the service stop command is where most of the slowness is and the restart takes the same amount of time for Wikilabels as for ORES -- exactly 1 minute and 30 seconds.

Change 280403 had a related patch set uploaded (by Ladsgroup):
ores: Scap3 deployment configurations

https://gerrit.wikimedia.org/r/280403

Everything is okay now. We only need these patches merged to go to prod