Session issues on labweb horizon ('newhorizon.wikimedia.org')
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Andrew
	Mar 9 2018, 3:37 AM

Description

Every now and then I log into newhorizon and immediately get bounced back to another login prompt. It doesn't happen regularly, but it does happen.

Details

	Subject	Repo	Branch	Lines +/-
	Try harder to ensure that our various local_settings.py are symlinks	openstack/horizon/deploy	ocata	+2 -2

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		bd808	T166396 Program 1 Outcome 4: VPS hosting
Resolved		Jdforrester-WMF	T172165 Require either PHP 7.0+ or HHVM in MW 1.31
Resolved		RobH	T168559 decom silver (was silver has trouble rebooting)
Resolved		Jclark-ctr	T189921 decom californium
Resolved	PRODUCTION ERROR	Andrew	T168470 Setup wikitech, horizon, and striker on new labweb hardware
Resolved		Andrew	T189278 Session issues on labweb horizon ('newhorizon.wikimedia.org')

Event Timeline

Andrew triaged this task as Medium priority.Mar 9 2018, 3:37 AM

Andrew created this task.

Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptMar 9 2018, 3:37 AM

I assumed that this was some sort of memcached split-brain, but now I'm not so sure. I just did this test:

Logged in to newhorizon, verified I had an active session by reloading the page.
Turned off apache on labweb1001, verified I still had an active session by reloading newhorizon (served by labweb1002)
Turned off apache on labweb1002, verified that everything was breaking
Turned apache back on on labweb1001, verified I still had an active session (served by labweb1001)

So the session info is getting properly shared between the two hosts. As I'd expect, having double- and triple-checked the nutcracker config.

So... I don't know what this is. All I can think is that there's a race between when the session info is written and when the next page load happens, like this:

I log in to labweb1001
Session data is written to memcached
next page load happens on labweb1002
Nutcracker syncs between the two memcaches, but, too late for labweb1002 to know about it!

But... I assumed that nutcracker was a write-through cache so that there's no such race :)

I can cut nutcracker out of the loop entirely, and then this issue goes away:

CACHES = {
   'default': {
       'BACKEND' : 'django.core.cache.backends.memcached.MemcachedCache',
       'LOCATION' : '208.80.154.160:11000',
       'LOCATION' : '208.80.155.109:11000',
   }
}

But, that's wrong too, because Django isn't pooling the two memcaches properly; if I turn off only .160 it works fine but if I turn off only .109 then it just breaks.  So it's really just using the single, second memcached -- not really an improvement.

My mistake, proper syntax is:

'LOCATION' : ['208.80.154.160:11000', '208.80.155.109:11000'],

That produces other, weird behavior but is definitely not an improvement -- it still flakes out in various ways if I kill memcached on one of the servers.

Change 417574 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[openstack/horizon/deploy@ocata] Try harder to ensure that our various local_settings.py are symlinks

https://gerrit.wikimedia.org/r/417574

Change 417574 merged by Andrew Bogott:
[openstack/horizon/deploy@ocata] Try harder to ensure that our various local_settings.py are symlinks

https://gerrit.wikimedia.org/r/417574

Mentioned in SAL (#wikimedia-operations) [2018-03-09T04:57:44Z] <andrew@tin> Started deploy [horizon/deploy@930009e]: rebuilding venvs to avoid rogue configs, as was causing T189278

Mentioned in SAL (#wikimedia-operations) [2018-03-09T05:00:42Z] <andrew@tin> Finished deploy [horizon/deploy@930009e]: rebuilding venvs to avoid rogue configs, as was causing T189278 (duration: 02m 59s)

This turns out to be a problem with a dirty deployment. On labweb1001:

ls -ltra /srv/deployment/horizon/venv/lib/python3.5/site-packages/openstack_dashboard/local/local_settings.py
lrwxrwxrwx 1 deploy-service deploy-service 42 Feb 20 20:16 /srv/deployment/horizon/venv/lib/python3.5/site-packages/openstack_dashboard/local/local_settings.py -> /etc/openstack-dashboard/local_settings.py

On labweb1002:

ls -ltra /srv/deployment/horizon/venv/lib/python3.5/site-packages/openstack_dashboard/local/local_settings.py
-rw-r--r-- 1 deploy-service deploy-service 18236 Feb 8 18:30 /srv/deployment/horizon/venv/lib/python3.5/site-packages/openstack_dashboard/local/local_settings.py

So, there was a rogue local_settings.py on labweb1002.

I wiped out and rebuild the venvs on both systems, and the above patch should produce an explicit error during deployment if the same situation arises again.

Andrew closed this task as Resolved.Mar 9 2018, 5:03 AM

Managing symlinks inside a scap3 deployment is likely to always be fragile. Is there a way we can deploy a file in the git repo that loads the setting that you want managed via puppet instead? Striker does this via reading an ini file and then using the ini provided values in the config.

Is there a way we can deploy a file in the git repo that loads the setting that you want managed via puppet instead?

There are a few problems here, the greatest of which is that different steps in the horizon setup (running vs. string generation vs. gathering static content) all look in different places for the same config file. The symlinking is because I want to be 100% sure that all those processes are getting the same config.

Resolving that 'looks in different places' issue is because of django weirdness and it's quite a rabbithole which I don't look forward to going down again. It might be fixable by manually modifying pythonpath before each command is run...

My patch, above, is intended to notice if things are amiss and just error out during deployment in that case. Do you see specific scenarios where that would be foiled?

In T189278#4038229, @Andrew wrote:

Do you see specific scenarios where that would be foiled?

The check script runs late enough that DEPLOY_DIR=/srv/deployment/horizon/deploy is probably correct, but I think we should get a double check from @thcipriani. The issue here is that on the scap3 target servers this directory is actually a symlink to the active clone of the upstream. I remember in the past that some deployments had trouble with trying to modify files in the symlinked target because of the timing between the modifications being made and scap3 swapping the old symlink for the new one. I saw some commits from Tyler recently that added new environment variables to the script execution to make getting all of this correct easier.

We can save the "why is django looking for the same thing in different places" mess for another time. :) Untangling the rat tail of someone else's custom Django config bootstrapping system is never fun.

Session issues on labweb horizon ('newhorizon.wikimedia.org')Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Session issues on labweb horizon ('newhorizon.wikimedia.org')
Closed, ResolvedPublic
Actions

Related Objects
Search...