Cache ORES virtualenv within versioned source
Closed, ResolvedPublic
Actions

Description

This lets us roll back much faster, since we won't need to rebuild the python environment.

Initial deployment needs to be done carefully, see notes in the operations/puppet patch. We need the puppet patch to be safe for both the new and old service scripts, and new and old path hierarchy.

Current migration plan:

~~Deploy new ORES code with both venvs, https://gerrit.wikimedia.org/r/#/c/392682/~~
~~Fix new virtualenvs.~~
Deploy service scripts to use the versioned venv: https://gerrit.wikimedia.org/r/#/c/392683/
Reenable service and job.
Smoke test
Remove the shared venv from deployment targets.

Details

Subject	Repo	Branch	Lines +/-
Remove transitional virtualenv copy	mediawiki/services/ores/deploy	master	+0 -7
Update ORES venv path to use versioned cache	operations/puppet	production	+1 -1
Prevent virtualenv from being silly	mediawiki/services/ores/deploy	master	+2 -2
Include packages needed by virtualenv	operations/puppet	production	+9 -2
Build venv into deployed source dir (take 3)	mediawiki/services/ores/deploy	master	+36 -33
Build venv into deployed source dir (take 2)	mediawiki/services/ores/deploy	master	+22 -26
Build venv into deployed source dir	mediawiki/services/ores/deploy	master	+22 -26
Remove unused scripts	mediawiki/services/ores/deploy	master	+0 -12

Customize query in gerrit

Revisions and Commits

Restricted Differential Revision

Related Objects
Search...

Status	Assigned	Task
Resolved	awight	T181010 [Spike] Write reports about why Ext:ORES is helping cause server 500s and write tasks to fix
Resolved	awight	T181071 Cache ORES virtualenv within versioned source
Resolved	fgiunchedi	T189306 SCAP: Upload debian package version 3.7.7-1
Resolved	awight	T180628 Install git-lfs client (at least on scap targets & masters)
Resolved	Dzahn	T193422 Clean up deprecated shared virtualenv directories

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

awight mentioned this in rORESDEPLOY0e7208d9a68f: [DNM] Build venv into deployed source dir.Feb 26 2018, 10:57 PM

awight mentioned this in rORESDEPLOYede2fab0fab0: [DNM] Build venv into deployed source dir.Feb 26 2018, 11:10 PM

awight mentioned this in rORESDEPLOYc8d8a05a7d11: [DNM] Build venv into deployed source dir.Feb 26 2018, 11:22 PM

Do not merge patches, currently blocked on scap 3.8 deployment to production.

awight renamed this task from Cache ORES virtualenv within versioned source to [Blocked] Cache ORES virtualenv within versioned source.Feb 26 2018, 11:34 PM

awight changed the task status from Open to Stalled.

awight added a subtask: T189306: SCAP: Upload debian package version 3.7.7-1.Mar 16 2018, 3:45 AM

awight removed a subtask: T154612: Add DEPLOY_DIR env var to scap command checks.

fgiunchedi closed subtask T189306: SCAP: Upload debian package version 3.7.7-1 as Resolved.Mar 20 2018, 2:57 PM

@awight: Scap 3.7.7 is deployed now, are you sure you need to wait for 3.8? Which change, specifically, are you waiting for?

Changelog for 3.7.7:

Added git-lfs support
Added script check type and environment variables
Added quick environment variable to disable scap plugins
Removed scap l10n-update and scap hhvm-graceful
Updated documentation for checks

Cool! Yes it was 3.7.7 I was waiting for, with the script check type.

awight renamed this task from [Blocked] Cache ORES virtualenv within versioned source to Cache ORES virtualenv within versioned source.Mar 21 2018, 9:10 PM

awight changed the task status from Stalled to Open.

awight removed a project: Patch-For-Review.

I probably _should_ have called it 3.8 ;)

awight updated the task description. (Show Details)Mar 21 2018, 9:42 PM

awight mentioned this in rORESDEPLOY7ab18f824b6e: Build venv into deployed source dir.Mar 21 2018, 9:53 PM

awight mentioned this in rORESDEPLOYc25df37ca22c: [DNM] Build venv into deployed source dir.

Change 392682 merged by Ladsgroup:
[mediawiki/services/ores/deploy@master] Build venv into deployed source dir

https://gerrit.wikimedia.org/r/392682

This has been reverted in https://gerrit.wikimedia.org/r/#/c/421316/, since I can't deploy today and it's a breaking change that requires special deployment steps.

awight moved this task from Review to Pending deployment on the Machine-Learning-Team (Active Tasks) board.Apr 2 2018, 4:47 PM

Change 423740 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/services/ores/deploy@master] Build venv into deployed source dir (take 2)

https://gerrit.wikimedia.org/r/423740

gerritbot added a project: Patch-For-Review.Apr 3 2018, 5:09 PM

Change 423740 merged by Awight:
[mediawiki/services/ores/deploy@master] Build venv into deployed source dir (take 2)

https://gerrit.wikimedia.org/r/423740

Mentioned in SAL (#wikimedia-operations) [2018-04-03T17:25:22Z] <awight@tin> Started deploy [ores/deploy@7701cee]: ORES versioned virtualenv, T181071

Mentioned in SAL (#wikimedia-operations) [2018-04-03T17:25:46Z] <awight@tin> Finished deploy [ores/deploy@7701cee]: ORES versioned virtualenv, T181071 (duration: 00m 27s)

Mentioned in SAL (#wikimedia-operations) [2018-04-03T17:27:28Z] <awight@tin> Started deploy [ores/deploy@7701cee]: ORES versioned virtualenv, T181071

awight mentioned this in rORESDEPLOY7701cee9f6de: Build venv into deployed source dir (take 2).Apr 3 2018, 5:28 PM

Mentioned in SAL (#wikimedia-operations) [2018-04-03T17:28:55Z] <awight@tin> Finished deploy [ores/deploy@7701cee]: ORES versioned virtualenv, T181071 (duration: 01m 27s)

I discovered in beta cluster testing that we'll have to take the ORES service down to do the migration. The old virtualenv is not compatible with code in a new directory. Will change the task description to reflect the new plan.

awight updated the task description. (Show Details)Apr 4 2018, 8:48 PM

As mentioned in my previous comment, the safe migration failed in practice for an arcane reason. I'm trying to decide whether to add complexity and support the redundant virtualenvs during a safe transition phase, or roll out as a breaking change, coordinating the puppet and ORES pieces. @akosiaris If that sounds decent, I'd like to try deploying as a breaking change during 20:00-21:00 UTC today, or another time as you suggest.

Change 425114 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/services/ores/deploy@master] Build venv into deployed source dir (take 3)

https://gerrit.wikimedia.org/r/425114

Change 425115 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/services/ores/deploy@master] [DNM] Remove transitional virtualenv copy

https://gerrit.wikimedia.org/r/425115

awight mentioned this in rORESDEPLOY9b6f31abf752: Build venv into deployed source dir (take 3).Apr 9 2018, 7:15 PM

awight mentioned this in rORESDEPLOY129a96a212f9: [DNM] Remove transitional virtualenv copy.

Change 425114 merged by Awight:
[mediawiki/services/ores/deploy@master] Build venv into deployed source dir (take 3)

https://gerrit.wikimedia.org/r/425114

Mentioned in SAL (#wikimedia-operations) [2018-04-09T20:12:22Z] <awight@tin> Started deploy [ores/deploy@b61c338]: Transitional virtualenv for ORES, T181071

Mentioned in SAL (#wikimedia-operations) [2018-04-09T20:12:39Z] <awight@tin> Finished deploy [ores/deploy@b61c338]: Transitional virtualenv for ORES, T181071 (duration: 00m 19s)

Mentioned in SAL (#wikimedia-operations) [2018-04-09T20:14:38Z] <awight@tin> Started deploy [ores/deploy@be69c1d]: Transitional virtualenv for ORES, T181071

Mentioned in SAL (#wikimedia-operations) [2018-04-09T20:38:52Z] <awight@tin> Finished deploy [ores/deploy@be69c1d]: Transitional virtualenv for ORES, T181071 (duration: 24m 14s)

Update: I deployed as a safe migration, and the new virtualenvs are not populated correctly. The service is fine, I'm investigating and will probably try a fix tomorrow.

I can't tell whether the fetch check script is failing, since the output doesn't show up in scap deploy-log. The virtualenv in the old location is correctly Running the virtualenv command manually on ores1002 as my normal user fails in a surprising way, with a network timeout to pypi.python.org. Virtualenv initialization isn't supposed to make network connections. dpkg -L python3-virtualenv on an ORES server shows that the virtualenv package was lacking setuptools, so it makes sense that a download is attempted.

How does this work in the normal case? Has this been the behavior all along?

awight moved this task from Pending deployment to Parked on the Machine-Learning-Team (Active Tasks) board.Apr 9 2018, 9:55 PM

Do we have to install the python3-setuptools package? Release-Engineering-Team help is invited, I don't have the privileges to play with the deploy-service user, for example.

I ran a test on deployment-ores01.deployment-prep.eqiad.wmflabs,

rm -rf ~/.cache/pip
virtualenv -v -p python3 --system-site-packages test-venv
....
Installing setuptools, pkg_resources, pip, wheel...
  Collecting setuptools
    Downloading setuptools-39.0.1-py2.py3-none-any.whl (569kB)
....

There's no .cache/pip in /var/lib/deploy-service however, so I'm not sure where to check to understand what normally happens when we install the virtualenv.

awight updated the task description. (Show Details)Apr 10 2018, 1:47 AM

In T181071#4118685, @awight wrote:

I can't tell whether the fetch check script is failing, since the output doesn't show up in scap deploy-log.

@awight: Did you try scap deploy-log -v

In T181071#4119091, @awight wrote:

There's no .cache/pip in /var/lib/deploy-service however, so I'm not sure where to check to understand what normally happens when we install the virtualenv.

We probably need to get setuptools installed via puppet, however, I'm not really too sure about it.
I haven't ever tried to deploy a virtualenv. @thcipriani, do you have any advice? Also, maybe @Legoktm or @mobrovac have some experience with scap + virtualenv?

• mmodell added a subscriber: thcipriani.Apr 10 2018, 2:52 AM

The weird part about this is just that we've been deploying virtualenv in this way for years now, but I can't explain why it ever worked. Looking forward to messing with this tomorrow :)

In T181071#4119135, @mmodell wrote:

In T181071#4118685, @awight wrote:

I can't tell whether the fetch check script is failing, since the output doesn't show up in scap deploy-log.

@awight: Did you try scap deploy-log -v

I just took a look in scap deploy-log -v, unfortunately that still doesn't show script output. I'll try adding -x to my script's shebang.

@mmodell: We're running the fetch check with "bash -x" now, so it should have generated a ton of output. stdout and err are being swallowed by scap? Reading checks.py, we poll stdout but only print it if the job fails. Stderr is copied to stdout.

My issue is probably that a single failed command isn't causing the script to fail, and the successful exit code causes the output to be thrown away. I'll work around by returning a non-zero code from the script.

There's a wealth of surprising results on tin, /srv/deployment/ores/deploy/scap/log/scap-sync-2018-04-10-0001-1-gd35a1e6.log

Some highlights are:

As suspected, virtualenv is timing out trying to download setuptools. Suggested solution is to install the python3-setuptools package via puppet. We probably want to install python3-pip and python3-wheel as well, since they're required by this bootstrap phase of virtualenv.
We fail to update python in the old virtualenv, because OSError: [Errno 26] Text file busy: '/srv/deployment/ores/venv/bin/python3'

It would be great if scap could log the check script output on success as well as failure, IMO. Otherwise, I could add a lot of error handling in the check script and fail fast whenever a subcommand fails.

In T181071#4120939, @awight wrote:

It would be great if scap could log the check script output on success as well as failure, IMO. Otherwise, I could add a lot of error handling in the check script and fail fast whenever a subcommand fails.

Ok, I've added logging in {D1026}

• mmodell added a revision: Restricted Differential Revision.Apr 10 2018, 7:11 PM

Change 425442 had a related patch set uploaded (by Awight; owner: Awight):
[operations/puppet@production] Include packages needed by virtualenv

https://gerrit.wikimedia.org/r/425442

Change 425442 abandoned by Awight:
Include packages needed by virtualenv

Reason:
Tried this fix on beta and it doesn't work.

https://gerrit.wikimedia.org/r/425442

Back to the drawing board. Including the packaged Python modules doesn't seem to help, virtualenv seems to expect wheels somewhere on the filesystem and won't build the wheels from system site-packages.

@awight: why not build the virtualenv on a developer machine and commit it to the repo instead of building it on tin?

@mmodell: That would be wonderful, but virtualenvs are specific not only to a machine, but tied to a specific filesystem path. There's a --relocatable flag, but nobody trusts that.

So why even use virtualenv if they are based on site packages and have to be built on the target machine?

That's right, we do use the --system-site-packages flag, but all that does is reuse a few packages so that we don't have to bundle built-ins that we should be able to rely on.

In our case, virtualenv is mostly useful for isolating us from the Debian packaging lifecycle, separating privileges so we don't need root to deploy, and for isolating between application versions so we don't have to worry about incomplete cleanup of older packages.

Still finding strangeness... Reading the virtualenv source and packaging, it depends on python-pip-whl which includes all the wheels we need. However, virtualenv doesn't search this path by default. Adding the path, virtualenv -v -p python3 --extra-search-dir=/usr/share/python-wheels test-venv/ I still see virtualenv try to download the same wheels... I'll debug virtualenv now :-(

This sounds surprising and strange. Please ping me on IRC to give me an overview today. You should be able to safely switch between virtualenv folders. That's the whole point!

Change 425580 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/services/ores/deploy@master] Prevent virtualenv from being silly

https://gerrit.wikimedia.org/r/425580

Change 425580 merged by Awight:
[mediawiki/services/ores/deploy@master] Prevent virtualenv from being silly

https://gerrit.wikimedia.org/r/425580

awight mentioned this in rORESDEPLOYb6deb5d5126b: Prevent virtualenv from being silly.Apr 11 2018, 6:16 PM

Mentioned in SAL (#wikimedia-operations) [2018-04-11T20:02:09Z] <awight@tin> Started deploy [ores/deploy@b6deb5d]: Transitional virtualenv for ORES (take 2), T181071

Mentioned in SAL (#wikimedia-operations) [2018-04-11T20:20:42Z] <awight@tin> Finished deploy [ores/deploy@b6deb5d]: Transitional virtualenv for ORES (take 2), T181071 (duration: 18m 34s)

awight updated the task description. (Show Details)Apr 11 2018, 8:23 PM

@akosiaris We're finally ready to deploy the puppet change, https://gerrit.wikimedia.org/r/#/c/392683/. Both ores services should be restarted on all machines once this new configuration is pushed.

There's a slight urgency, since we have to freeze all of our dependencies until we're migrated over to the new virtualenv path.

Once the puppet change is merged and services restarted, we should remove the /srv/deployment/ores/venv unversioned virtualenv from all ORES servers.

awight triaged this task as High priority.Apr 11 2018, 8:26 PM

awight moved this task from Parked to Pending deployment on the Machine-Learning-Team (Active Tasks) board.

Change 392683 merged by Dzahn:
[operations/puppet@production] Update ORES venv path to use versioned cache

https://gerrit.wikimedia.org/r/392683

Mentioned in SAL (#wikimedia-operations) [2018-04-11T22:28:05Z] <mutante> ores - running puppet on all instances to apply venv path change for T181071

awight mentioned this in rORESDEPLOYadc5c0641729: Remove transitional virtualenv copy.Apr 11 2018, 10:40 PM

Mentioned in SAL (#wikimedia-operations) [2018-04-11T22:41:40Z] <mutante> ores1002-1009 - deleting old venv dir - rm -f /srv/deployment/ores/venv (T181071)

Mentioned in SAL (#wikimedia-operations) [2018-04-11T22:47:38Z] <mutante> ores2* - puppet ran to change venv config, then 'rm -rf /srv/deployment/ores/venv/' via cumin to clean-up (T181071)

awight mentioned this in rORESDEPLOY543901a78df3: Remove transitional virtualenv copy.Apr 11 2018, 11:00 PM

awight moved this task from Pending deployment to Completed on the Machine-Learning-Team (Active Tasks) board.Apr 11 2018, 11:20 PM

Change 425115 merged by Awight:
[mediawiki/services/ores/deploy@master] Remove transitional virtualenv copy

https://gerrit.wikimedia.org/r/425115

awight mentioned this in T192478: Investigate PEP 503 repo for production deployment of python wheels.Apr 18 2018, 6:31 PM

• mmodell mentioned this in rMSCA31b27fd44c6b: Log check output when check succeeds..Apr 18 2018, 8:26 PM

• mmodell mentioned this in rMSCA0531edc1e064: Log check output when check succeeds..Apr 18 2018, 8:28 PM