We are working to move the ores service to prod. It would be great to have ores service in beta. @hashar elaborated more on how we should do this but I think we can use more information
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | None | T130369 [Epic] Structured deployment of ORES | |||
Resolved | Ladsgroup | T130404 Setup ORES service in beta cluster | |||
Resolved | Ladsgroup | T118495 uwsgi takes a long time to restart (Debian Jessie in labs) |
Event Timeline
Now, it's live in https://ores-beta.wmflabs.org/
- The web node is: deployment-ores-web.deployment-prep.eqiad.wmflabs
- The worker is setup within SCA: deployment-sca01.deployment-prep.eqiad.wmflabs
- The redis is: deployment-ores-redis.deployment-prep.eqiad.wmflabs
Here are my notes regarding this deployment.
SSH
I used Keyholder to do the deployment. Using user "deploy-service" in group "deploy-service" which is made by scap::target class. The public key is in the private repo but accessible via puppetmaster. Strangely, public key of that user is in the puppet repo (length of that public key is 399) and it overwrites my public key (that the keyholder uses, length: 739). My guess is, the 399 key is actually for user "deploy-service" in prod. So basically before any kind of deployment you need to arm the keyholder and be sure it's working correctly:sudo SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh-add -l
Puppet
Lots and lots of puppet settings need to be done. I made this patch and manually cherry-picked it in puppet master. I added a ores::scapdeploy calss and a role with the same name (role::labs::ores::scapdeploy) Why not in ores::base directly? I did it initally but I changed it due to two reasons: First is trying to keep compatibility with our setup in labs (ores project). The second reason is more complicated: T131392: "/usr/bin/deploy-local --repo ores/deploy -D log_json:False" in targets run all checks regardless of group. When you add scap::target puppets, it tries to deploy the initial stage and mistakenly tries to run all checks in all groups which mean commands we designed for web nodes will be ran over worker nodes and vice versa, and the puppet fails. That's okay but if we use it in ores::base, all dependencies of ores such as ores::web would be failed without even trying to run
But then it comes the owner issue which took about a day of mine. In ores::base we define /srv/ores and give the owner premssion to user "www-data" in a group with the same name, with mode 775, we won't let "deploy-service" add anything to it. So I tried to define subdirectories and give the premission to them but scap also makes directories such as /srv/ores/deploy.dddd (dd is a date). So I tried by using parameters in ores::base and change them in hiera (so I changed ores::base::user to 'deploy-service') but it causes cyclic dependencies: Error: Could not apply complete catalog: Found 1 dependency cycle: (File[/srv/ores] => Class[Ores::Base] => Class[Ores::Scapdeploy] => Scap::Target[ores/deploy] => Group[deploy-service] => File[/srv/ores]) so I moved the ownership part to the scapdeploy and used the proper user there. so In order to keep compatibility with the labs setup. We need to use hiera and 'www-data' there. Also we need to figure a way out to make the /srv/ores directory for the labs setup
Anyway: We need to merge this
Redis
Our redis setup is good in puppet and a simple change in hiera can configure the setup properly. But our python setup is hard-coded in several cases. so I had to make a branch called prod and put my settings there. See this PR for changes needed
ores/config v. ores/deploy
Our setup in labs uses /srv/ores/config but as a standard, we are switching to /srv/ores/deploy but these settings are scattered all over puppet and ores-config. See my PR and my patch.
Checks timeout
In order to restart services, or setup our virtualenv, we need to run scap checks but unfortunately it's built in a way that if a check takes more than 30 seconds. It stops the check and there is no straightforward way to increase that time. The release engineering team is working on it T131391: Let scap checks have their timeouts ability to change but it might take a while to get that patch merged, packaged and released. Two checks of us need that: 1- setup_virtualenv: Building the virtualenv, uninstalling the deps and reinstalling them take really long time, for me when I run it in a target about 20 seconds but I can imagine why it takes more than 30 seconds when we run it from tin 2- restart_web: Our uwsgi is big and restarting it (lots of workers per node, 28 I guess) takes really long time and with every puppet run it stops the uwsgi but checks out before the service can be brought online. So my guess is due to this bug our web nodes goes offline very soon :( Once the patch is merged we can simply use "timeout" option in checks.yaml.
One quick thought re. uwsgi restarts -- this is a problem we have been struggling with for a while. I just added T131572 so that we'll actually investigate this.
Edit: Also note that this slow restart is a problem with Wikilabels too and wikilabels doesn't have nearly as high a process count or memory usage.
Turns out that T118495 is a much older task that we created for this slow uwsgi issue. I just did some more investigation into where the slowness happens. It looks like the service stop command is where most of the slowness is and the restart takes the same amount of time for Wikilabels as for ORES -- exactly 1 minute and 30 seconds.
Change 280403 had a related patch set uploaded (by Ladsgroup):
ores: Scap3 deployment configurations