Page MenuHomePhabricator

Cache ORES virtualenv within versioned source
Closed, ResolvedPublic

Description

This lets us roll back much faster, since we won't need to rebuild the python environment.

Initial deployment needs to be done carefully, see notes in the operations/puppet patch. We need the puppet patch to be safe for both the new and old service scripts, and new and old path hierarchy.

Current migration plan:

Revisions and Commits

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Do not merge patches, currently blocked on scap 3.8 deployment to production.

awight renamed this task from Cache ORES virtualenv within versioned source to [Blocked] Cache ORES virtualenv within versioned source.Feb 26 2018, 11:34 PM
awight changed the task status from Open to Stalled.

@awight: Scap 3.7.7 is deployed now, are you sure you need to wait for 3.8? Which change, specifically, are you waiting for?

Changelog for 3.7.7:

  • Added git-lfs support
  • Added script check type and environment variables
  • Added quick environment variable to disable scap plugins
  • Removed scap l10n-update and scap hhvm-graceful
  • Updated documentation for checks

Cool! Yes it was 3.7.7 I was waiting for, with the script check type.

awight renamed this task from [Blocked] Cache ORES virtualenv within versioned source to Cache ORES virtualenv within versioned source.Mar 21 2018, 9:10 PM
awight changed the task status from Stalled to Open.
awight removed a project: Patch-For-Review.

I probably _should_ have called it 3.8 ;)

Change 392682 merged by Ladsgroup:
[mediawiki/services/ores/deploy@master] Build venv into deployed source dir

https://gerrit.wikimedia.org/r/392682

This has been reverted in https://gerrit.wikimedia.org/r/#/c/421316/, since I can't deploy today and it's a breaking change that requires special deployment steps.

Change 423740 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/services/ores/deploy@master] Build venv into deployed source dir (take 2)

https://gerrit.wikimedia.org/r/423740

Change 423740 merged by Awight:
[mediawiki/services/ores/deploy@master] Build venv into deployed source dir (take 2)

https://gerrit.wikimedia.org/r/423740

Mentioned in SAL (#wikimedia-operations) [2018-04-03T17:25:22Z] <awight@tin> Started deploy [ores/deploy@7701cee]: ORES versioned virtualenv, T181071

Mentioned in SAL (#wikimedia-operations) [2018-04-03T17:25:46Z] <awight@tin> Finished deploy [ores/deploy@7701cee]: ORES versioned virtualenv, T181071 (duration: 00m 27s)

Mentioned in SAL (#wikimedia-operations) [2018-04-03T17:27:28Z] <awight@tin> Started deploy [ores/deploy@7701cee]: ORES versioned virtualenv, T181071

Mentioned in SAL (#wikimedia-operations) [2018-04-03T17:28:55Z] <awight@tin> Finished deploy [ores/deploy@7701cee]: ORES versioned virtualenv, T181071 (duration: 01m 27s)

I discovered in beta cluster testing that we'll have to take the ORES service down to do the migration. The old virtualenv is not compatible with code in a new directory. Will change the task description to reflect the new plan.

As mentioned in my previous comment, the safe migration failed in practice for an arcane reason. I'm trying to decide whether to add complexity and support the redundant virtualenvs during a safe transition phase, or roll out as a breaking change, coordinating the puppet and ORES pieces. @akosiaris If that sounds decent, I'd like to try deploying as a breaking change during 20:00-21:00 UTC today, or another time as you suggest.

Change 425114 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/services/ores/deploy@master] Build venv into deployed source dir (take 3)

https://gerrit.wikimedia.org/r/425114

Change 425115 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/services/ores/deploy@master] [DNM] Remove transitional virtualenv copy

https://gerrit.wikimedia.org/r/425115

Change 425114 merged by Awight:
[mediawiki/services/ores/deploy@master] Build venv into deployed source dir (take 3)

https://gerrit.wikimedia.org/r/425114

Mentioned in SAL (#wikimedia-operations) [2018-04-09T20:12:22Z] <awight@tin> Started deploy [ores/deploy@b61c338]: Transitional virtualenv for ORES, T181071

Mentioned in SAL (#wikimedia-operations) [2018-04-09T20:12:39Z] <awight@tin> Finished deploy [ores/deploy@b61c338]: Transitional virtualenv for ORES, T181071 (duration: 00m 19s)

Mentioned in SAL (#wikimedia-operations) [2018-04-09T20:14:38Z] <awight@tin> Started deploy [ores/deploy@be69c1d]: Transitional virtualenv for ORES, T181071

Mentioned in SAL (#wikimedia-operations) [2018-04-09T20:38:52Z] <awight@tin> Finished deploy [ores/deploy@be69c1d]: Transitional virtualenv for ORES, T181071 (duration: 24m 14s)

Update: I deployed as a safe migration, and the new virtualenvs are not populated correctly. The service is fine, I'm investigating and will probably try a fix tomorrow.

I can't tell whether the fetch check script is failing, since the output doesn't show up in scap deploy-log. The virtualenv in the old location is correctly Running the virtualenv command manually on ores1002 as my normal user fails in a surprising way, with a network timeout to pypi.python.org. Virtualenv initialization isn't supposed to make network connections. dpkg -L python3-virtualenv on an ORES server shows that the virtualenv package was lacking setuptools, so it makes sense that a download is attempted.

How does this work in the normal case? Has this been the behavior all along?

Do we have to install the python3-setuptools package? Release-Engineering-Team help is invited, I don't have the privileges to play with the deploy-service user, for example.

I ran a test on deployment-ores01.deployment-prep.eqiad.wmflabs,

rm -rf ~/.cache/pip
virtualenv -v -p python3 --system-site-packages test-venv
....
Installing setuptools, pkg_resources, pip, wheel...
  Collecting setuptools
    Downloading setuptools-39.0.1-py2.py3-none-any.whl (569kB)
....

There's no .cache/pip in /var/lib/deploy-service however, so I'm not sure where to check to understand what normally happens when we install the virtualenv.

I can't tell whether the fetch check script is failing, since the output doesn't show up in scap deploy-log.

@awight: Did you try scap deploy-log -v

There's no .cache/pip in /var/lib/deploy-service however, so I'm not sure where to check to understand what normally happens when we install the virtualenv.

We probably need to get setuptools installed via puppet, however, I'm not really too sure about it.
I haven't ever tried to deploy a virtualenv. @thcipriani, do you have any advice? Also, maybe @Legoktm or @mobrovac have some experience with scap + virtualenv?

The weird part about this is just that we've been deploying virtualenv in this way for years now, but I can't explain why it ever worked. Looking forward to messing with this tomorrow :)

I can't tell whether the fetch check script is failing, since the output doesn't show up in scap deploy-log.

@awight: Did you try scap deploy-log -v

I just took a look in scap deploy-log -v, unfortunately that still doesn't show script output. I'll try adding -x to my script's shebang.

@mmodell: We're running the fetch check with "bash -x" now, so it should have generated a ton of output. stdout and err are being swallowed by scap? Reading checks.py, we poll stdout but only print it if the job fails. Stderr is copied to stdout.

My issue is probably that a single failed command isn't causing the script to fail, and the successful exit code causes the output to be thrown away. I'll work around by returning a non-zero code from the script.

There's a wealth of surprising results on tin, /srv/deployment/ores/deploy/scap/log/scap-sync-2018-04-10-0001-1-gd35a1e6.log

Some highlights are:

  • As suspected, virtualenv is timing out trying to download setuptools. Suggested solution is to install the python3-setuptools package via puppet. We probably want to install python3-pip and python3-wheel as well, since they're required by this bootstrap phase of virtualenv.
  • We fail to update python in the old virtualenv, because OSError: [Errno 26] Text file busy: '/srv/deployment/ores/venv/bin/python3'

It would be great if scap could log the check script output on success as well as failure, IMO. Otherwise, I could add a lot of error handling in the check script and fail fast whenever a subcommand fails.

It would be great if scap could log the check script output on success as well as failure, IMO. Otherwise, I could add a lot of error handling in the check script and fail fast whenever a subcommand fails.

Ok, I've added logging in {D1026}

mmodell added a revision: Restricted Differential Revision.Apr 10 2018, 7:11 PM

Change 425442 had a related patch set uploaded (by Awight; owner: Awight):
[operations/puppet@production] Include packages needed by virtualenv

https://gerrit.wikimedia.org/r/425442

Change 425442 abandoned by Awight:
Include packages needed by virtualenv

Reason:
Tried this fix on beta and it doesn't work.

https://gerrit.wikimedia.org/r/425442

Back to the drawing board. Including the packaged Python modules doesn't seem to help, virtualenv seems to expect wheels somewhere on the filesystem and won't build the wheels from system site-packages.

@awight: why not build the virtualenv on a developer machine and commit it to the repo instead of building it on tin?

@mmodell: That would be wonderful, but virtualenvs are specific not only to a machine, but tied to a specific filesystem path. There's a --relocatable flag, but nobody trusts that.

So why even use virtualenv if they are based on site packages and have to be built on the target machine?

That's right, we do use the --system-site-packages flag, but all that does is reuse a few packages so that we don't have to bundle built-ins that we should be able to rely on.

In our case, virtualenv is mostly useful for isolating us from the Debian packaging lifecycle, separating privileges so we don't need root to deploy, and for isolating between application versions so we don't have to worry about incomplete cleanup of older packages.

Still finding strangeness... Reading the virtualenv source and packaging, it depends on python-pip-whl which includes all the wheels we need. However, virtualenv doesn't search this path by default. Adding the path, virtualenv -v -p python3 --extra-search-dir=/usr/share/python-wheels test-venv/ I still see virtualenv try to download the same wheels... I'll debug virtualenv now :-(

This sounds surprising and strange. Please ping me on IRC to give me an overview today. You should be able to safely switch between virtualenv folders. That's the whole point!

Change 425580 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/services/ores/deploy@master] Prevent virtualenv from being silly

https://gerrit.wikimedia.org/r/425580

Change 425580 merged by Awight:
[mediawiki/services/ores/deploy@master] Prevent virtualenv from being silly

https://gerrit.wikimedia.org/r/425580

Mentioned in SAL (#wikimedia-operations) [2018-04-11T20:02:09Z] <awight@tin> Started deploy [ores/deploy@b6deb5d]: Transitional virtualenv for ORES (take 2), T181071

Mentioned in SAL (#wikimedia-operations) [2018-04-11T20:20:42Z] <awight@tin> Finished deploy [ores/deploy@b6deb5d]: Transitional virtualenv for ORES (take 2), T181071 (duration: 18m 34s)

@akosiaris We're finally ready to deploy the puppet change, https://gerrit.wikimedia.org/r/#/c/392683/. Both ores services should be restarted on all machines once this new configuration is pushed.

There's a slight urgency, since we have to freeze all of our dependencies until we're migrated over to the new virtualenv path.

Once the puppet change is merged and services restarted, we should remove the /srv/deployment/ores/venv unversioned virtualenv from all ORES servers.

Change 392683 merged by Dzahn:
[operations/puppet@production] Update ORES venv path to use versioned cache

https://gerrit.wikimedia.org/r/392683

Mentioned in SAL (#wikimedia-operations) [2018-04-11T22:28:05Z] <mutante> ores - running puppet on all instances to apply venv path change for T181071

Mentioned in SAL (#wikimedia-operations) [2018-04-11T22:41:40Z] <mutante> ores1002-1009 - deleting old venv dir - rm -f /srv/deployment/ores/venv (T181071)

Mentioned in SAL (#wikimedia-operations) [2018-04-11T22:47:38Z] <mutante> ores2* - puppet ran to change venv config, then 'rm -rf /srv/deployment/ores/venv/' via cumin to clean-up (T181071)

Change 425115 merged by Awight:
[mediawiki/services/ores/deploy@master] Remove transitional virtualenv copy

https://gerrit.wikimedia.org/r/425115

\o/ just did a rollback of one machine in 24 seconds, this seems to be working nicely.