Page MenuHomePhabricator

"Loss of session data" on Beta Cluster
Open, HighPublic

Description

The last few days, people have frequently been getting logged out on Beta Cluster with "Sorry! We could not process your edit due to a loss of session data." messages when an action is taken (e.g. after edit, or saving preferences).

This is similar to T151770: Frequent loss of session data in Firefox (since around 2016-11-28), but the current task reportedly happens in Safari and Chrome (as well as Firefox).

Someone could try posting their headers (see https://phabricator.wikimedia.org/T151770#2875829 ) from a case where they hit the bug. If you do so, be sure to log out after capturing the headers and before posting them.

Event Timeline

Fixing the fatal @MarcoAurelio linked to. Don't know if that will fix this issue, though.

Change 370290 had a related patch set uploaded (by Mattflaschen; owner: Mattflaschen):
[mediawiki/core@master] WANCacheReapUpdate: Handle special pages without exception

https://gerrit.wikimedia.org/r/370290

Change 370290 merged by jenkins-bot:
[mediawiki/core@master] WANCacheReapUpdate: Handle special pages without exception

https://gerrit.wikimedia.org/r/370290

Presently in beta, I cannot use Twinkle or 'Mark this page as patrolled' feature.

Using Twinkle to tag a page produces:

Tagging page: Assertion that the user is logged in failed.

AV3L4KyohvX_CkfKdDP1 Loading CentralAuthUser for user MarcoAurelio from cache object

600+ of such logs in the last hour. Each time I switch a page, I'm logged out. I reload, I'm logged-in again but it seems as if everything was new for the system: unread notifications symbol, new recent changes filter invitation, etc...

I'm on Chrome, last version without cookie blocks for Wikimedia domains.

Link to logstash data.

greg added subscribers: Anomie, hoo, Legoktm, greg.

@Anomie Can you take a look at this or point to someone else who can?

Also adding @hoo and @Legoktm because of mw:Dev/Maint listing them for CentralAuth. Same question for you two :)

greg triaged this task as High priority.Aug 10 2017, 4:12 PM

This isn't breaking in production but it's making Beta Cluster not a very useful testing environment. Priority -> High.

@Anomie Can you take a look at this or point to someone else who can?

I can't manage to reproduce this. I tried both with and without the "remember me" checkbox checked, and on both Beta enwiki after logging in there and Beta dewiki after logging in on Beta enwiki.

Captured request and response headers would be useful.

AV3L4KyohvX_CkfKdDP1 Loading CentralAuthUser for user MarcoAurelio from cache object

600+ of such logs in the last hour.

It's expected that the CentralAuthUser would be loaded from cache on every request.

The issue is somewhat much milder today - not 100% reproducible.

  1. Log in to betalabs

2 Go to Special:Notifications page

  1. The page starts verifying your session data numerous times (up yo 9 times ) - checkLoggedIn, createSession, validateSession. And many setCookies?type=1x1&from=enwiki&proto=https

deployment.wikimedia.beta.wmflabs.org/wiki/Special:CentralAutoLogin

  1. The badges for notifications will be displayed differently. Click on 'Alerts' or 'on Notices' - the Login page will be briefly displayed, and then a user will be logged in again automatically.

Screen Shot 2017-08-14 at 10.27.38 AM.png (454×1 px, 111 KB)

I am not sure which request/response headers will be of interest - they seem to be OK, the cycle fro validating if a user is logged in repeats from the beginning (even by redirecting to the Login page) many times.

Request URL:https://login.wikimedia.beta.wmflabs.org/wiki/Special:CentralAutoLogin/checkLoggedIn?wikiid=enwikinews&proto=https&type=1x1&from=enwiki
Request Method:GET
Status Code:302 
Remote Address:208.80.155.135:443
Referrer Policy:origin-when-cross-origin
Response Headers
age:0
backend-timing:D=147699 t=1502731625840553
cache-control:private, s-maxage=0, max-age=0, must-revalidate
content-encoding:gzip
content-length:20
content-type:text/html; charset=utf-8
date:Mon, 14 Aug 2017 17:27:05 GMT
expires:Thu, 01 Jan 1970 00:00:00 GMT
location:https://en.wikinews.beta.wmflabs.org/wiki/Special:CentralAutoLogin/createSession?token=9737b868cd1cdf95c7ad250458231f2c&type=1x1&from=enwiki&proto=https
p3p:CP="This is not a P3P policy! See https://login.wikimedia.beta.wmflabs.org/wiki/Special:CentralAutoLogin/P3P for more info."
pragma:no-cache
server:deployment-mediawiki04.deployment-prep.eqiad.wmflabs
status:302
vary:Accept-Encoding,X-Forwarded-Proto,Cookie,Authorization
via:1.1 varnish-v4, 1.1 varnish-v4
x-analytics:WMF-Last-Access=14-Aug-2017;WMF-Last-Access-Global=14-Aug-2017;https=1
x-cache:deployment-cache-text04 pass, deployment-cache-text04 pass
x-cache-status:pass
x-client-ip:198.73.209.1
x-content-type-options:nosniff
x-powered-by:HHVM/3.18.2
x-varnish:100615178, 42143361
Request Headers
:authority:login.wikimedia.beta.wmflabs.org
:method:GET
:path:/wiki/Special:CentralAutoLogin/checkLoggedIn?wikiid=enwikinews&proto=https&type=1x1&from=enwiki
:scheme:https
accept:image/webp,image/apng,image/*,*/*;q=0.8
accept-encoding:gzip, deflate, br
accept-language:en-US,en;q=0.8,ru;q=0.6
cookie:GeoIP=US:CA:San_Francisco:37.79:-122.39:v4; CP=H2; WMF-Last-Access=14-Aug-2017; WMF-Last-Access-Global=14-Aug-2017; loginwikiSession=l6brrls4d5jd8i6unt9ufdofc5cnjk3j; loginwikiUserID=10031; loginwikiUserName=ET1; forceHTTPS=true; centralauth_User=ET1; centralauth_Session=d15467b315b787c2a2e77777a0b6df52
referer:https://en.wikipedia.beta.wmflabs.org/
user-agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36

I am not sure which request/response headers will be of interest - they seem to be OK

It's probably best to capture all of them for the whole process from logging in to seeing the bug. Browsers will typically save the capture as a "har" file. Then make sure your password isn't included (or change your password after) and upload the whole har file.

the cycle fro validating if a user is logged in repeats from the beginning (even by redirecting to the Login page) many times.

What's happening there is that the page is including 1x1 pixel images to set cookies for you on all the different domains (meta.wikimedia.beta.wmflabs.org, (en).wikipedia.beta.wmflabs.org, (en).wiktionary.beta.wmflabs.org, and so on). So it has to do one set of back-and-forth requests for each domain.

The attached har is for the case when an edit (in wikitext mode) cannot be saved due to the loss of session data:

Screen Shot 2017-08-24 at 12.53.49 PM.png (571×1 px, 59 KB)

Is there any chance it might be caused by redis misbehaviour? I vaguely recall sessions being stored in object cache, in which case this might be another symptom of the problem in T173646 ?

Is there any chance it might be caused by redis misbehaviour? I vaguely recall sessions being stored in object cache, in which case this might be another symptom of the problem in T173646 ?

Good call.

I don't know if that is the cause, but they are indeed stored in Redis at least on Beta (and as I recall production too): P5918

Is there any chance it might be caused by redis misbehaviour? I vaguely recall sessions being stored in object cache, in which case this might be another symptom of the problem in T173646 ?

If redis is misbehaving, then that is very likely the cause.

The attached har is for the case when an edit (in wikitext mode) cannot be saved due to the loss of session data:

I see in that paste one attempt to save a page. The attempt does not contain any login or session cookies. It then does the CentralAuth auto-login thing and successfully logs the user back in.

That is consistent with some earlier request running into a redis error and serving the user a response that clears the auth cookies since the session it claimed couldn't be authenticated.

Is there any chance it might be caused by redis misbehaviour? I vaguely recall sessions being stored in object cache, in which case this might be another symptom of the problem in T173646 ?

Good call.

I don't know if that is the cause, but they are indeed stored in Redis at least on Beta (and as I recall production too): P5918

Is there any chance it might be caused by redis misbehaviour? I vaguely recall sessions being stored in object cache, in which case this might be another symptom of the problem in T173646 ?

If redis is misbehaving, then that is very likely the cause.

So I guess the question is whether redis (or nutcracker I guess) is indeed misbehaving. I already did a bit of digging in that ticket so if someone else wants to take a look and confirm that it shouldn't be doing that, that'd be great.

Can someone please address this ticket? I am completely unable to use the beta site. Like others, I keep getting logged out whenever I try to do anything like save a change or add something to Watchlist. E.g.,

  1. In google chrome on a Mac, log in to the beta site.
  2. Go to a random article.
  3. Click to follow that article.
  4. Click Random Article again.

Expected result: you'll stay logged in and go to the new article.
Actual result: I go to the new article but am logged out, with the user message below. It doesn't happen every time, but often enough to make beta pretty frustrating. Also, it's pretty confusing the message in the screenshot informs me that i am "centrally logged in" when I'm visibly logged out? What does that mean?

I just had a thought and disabled Privacy Badger for beta. Could that be an issue?

Screen Shot 2017-08-28 at 12.13.15 PM.png (558×1 px, 181 KB)

Also, it's pretty confusing the message in the screenshot informs me that i am "centrally logged in" when I'm visibly logged out? What does that mean?

It means what it says: it detected that you're centrally logged in already, updated your cookies (or tried to, anyway) to make you be locally logged in, and wants you to reload the page so it can use the correct skin and show you as logged in.

I just had a thought and disabled Privacy Badger for beta. Could that be an issue?

If it's blocking the cookies, that could certainly be an issue. Although I'd expect it to be happening every time, unless Privacy Badger is flapping between "block cookies" and "don't block cookies" states for some reason.

This seems to be happening too consistently to some people for it to be a redis issue. (Nutcracker is using consistent hashing based on key hashes, and session IDs are completely random, so everyone should end up on all Redis boxes with equal probability.) Are we using IP-based balancing on beta? If so, maybe it's a nutcracker problem and some people consistently end up on the appserver with the misbehaving nutcracker instance while others consistently don't.

I just had a thought and disabled Privacy Badger for beta. Could that be an issue?

Well, did it fix the problem? Privacy Badger does break CentralAuth, yes (T108505). The production sites are on a whitelist now but Beta probably isn't.
Other potential client-side issues with CentralAuth: Firefox limiting the number of cookies (T151770 - I doubt that people would have that many cookies on wmflabs.org, though), Webkit's new Intelligent Tracking Prevention feature (no one here seems to be using Webkit though).

The thing about Privacy Badger is, I've always had it. And I've always used the beta site. So it doesn't seem right that Privacy Badger would be the culprit, because what would have changed?

I am also seeing this issue consistently every day. After I cleared out all my cookies and re-logged in, I can view different pages without any issue and stay logged in, but whenever I click Edit on any page, I immediately get logged out.
If I then go to another page I remain logged out, but it seems from the change in the personal toolbar that the central login system logs me back in, but if I click Edit again, I will get the same message about not being logged in and basically I cannot make an edit without exposing my IP.

From logstash-beta (for @Ryasmeen, with her consent). Might help with debugging.

=> Permalink: Logstash query for this error message

level=WARNING channel=session http_method=GET url=/w/api.php?action=visualeditor& ... ...
message =
Session "... CentralAuthSessionProvider ... RYasmeen (WMF) ...": Metadata merge failed: exception 'MediaWiki\Session\MetadataMergeException' with message 'Key "CentralAuthSource" changed' in /srv/mediawiki/php-master/includes/session/SessionProvider.php:205
Stack trace:
#0 /srv/mediawiki/php-master/includes/session/SessionManager.php(637): MediaWiki\Session\SessionProvider->mergeMetadata()
#1 /srv/mediawiki/php-master/includes/session/SessionManager.php(506): MediaWiki\Session\SessionManager->loadSessionInfoFromStore()
#2 /srv/mediawiki/php-master/includes/session/SessionManager.php(190): MediaWiki\Session\SessionManager->getSessionInfoForRequest()
#3 /srv/mediawiki/php-master/includes/WebRequest.php(735): MediaWiki\Session\SessionManager->getSessionForRequest()
#4 /srv/mediawiki/php-master/includes/session/SessionManager.php(129): WebRequest->getSession()
#5 /srv/mediawiki/php-master/includes/Setup.php(762): MediaWiki\Session\SessionManager::getGlobalSession()
#6 /srv/mediawiki/php-master/includes/WebStart.php(114): include()
#7 /srv/mediawiki/php-master/api.php(38): include()
#8 /srv/mediawiki/w/api.php(3): include()
#9 {main}

normalized_message = Session "{session}": Metadata merge failed: {exception}
old_value = CentralAuth
new_value = Local
level=WARNING channel=session http_method=POST url=/w/api.php
message =
Session "... CentralAuthSessionProvider ... RYasmeen (WMF) ..": Metadata merge failed: exception 'MediaWiki\Session\MetadataMergeException' with message 'Key "CentralAuthSource" changed' in /srv/mediawiki/php-master/includes/session/SessionProvider.php:205
Stack trace:
#0 /srv/mediawiki/php-master/includes/session/SessionManager.php(637): MediaWiki\Session\SessionProvider->mergeMetadata()
#1 /srv/mediawiki/php-master/includes/session/SessionManager.php(506): MediaWiki\Session\SessionManager->loadSessionInfoFromStore()
#2 /srv/mediawiki/php-master/includes/session/SessionManager.php(190): MediaWiki\Session\SessionManager->getSessionInfoForRequest()
#3 /srv/mediawiki/php-master/includes/WebRequest.php(735): MediaWiki\Session\SessionManager->getSessionForRequest()
#4 /srv/mediawiki/php-master/includes/session/SessionManager.php(129): WebRequest->getSession()
#5 /srv/mediawiki/php-master/includes/Setup.php(762): MediaWiki\Session\SessionManager::getGlobalSession()
#6 /srv/mediawiki/php-master/includes/WebStart.php(114): include()
#7 /srv/mediawiki/php-master/api.php(38): include()
#8 /srv/mediawiki/w/api.php(3): include()
#9 {main}

normalized_message = Session "{session}": Metadata merge failed: {exception}
old_value = CentralAuth
new_value = Local

Note the CentralAuthSource merge failure doesn't mean much. It just means that the session ID was valid but CentralAuth didn't validate it as a valid session for the central user. Which could be due to a redis issue as mentioned earlier.

@Ryasmeen, you describe perfectly what I was seeing. Do you have Privacy Badger? I disabled it for beta, and I think it may have fixed the problem, though it's too soon to tell for sure. But I have been able to save changes the last two times I went to beta since. Worth a try...

The issue is intermittent and gets triggered not only by attempts to save edits but every time when a user credentials should get verified - clicking on Notifications badges, going to RC page, user preferences etc.

Today I had to reload like 20 times in order to get a single page saved. Any idea what might be the cause of this?

It just happened to me again, but with a different error message. I went to an article and attempted to put it on my watchlist. I got the following error.

Screen Shot 2017-09-19 at 8.17.45 AM.png (628×1 px, 145 KB)

I tried Watchlisting subsequent articles, and got the same results.

It just happened to me again, but with a different error message. I went to an article and attempted to put it on my watchlist. I got the following error.

You were probably considered logged out for the attempt to put it on your watchlist too. The error that occurred was probably "Please log in to view or edit items on your watchlist."

@Anomie In my case, there was first a "lost of session data", then I was logged out, then logged-in again; and cycle restarts (loss data, logged out, etc.).

This issue is really inconvenient for me to do any testing on Beta cluster since it happens every time I try to save something from VE. I do not have any extension installed on my browser, And I can consistently reproduce this for both Chrome and Firefox, but not on Safari.

@Anomie You weren't able to reproduce this? I get it sporadically with edit saves or looking at notifications.

A simple example I just did was:

  • started out not logged in
  • logged in
  • random article
  • edit
  • save -> warned me session data was lost and my IP will be recorded.
  • am now logged out
  • click login and it just logs me in (no password prompt)

I can't get it to happen every time/edit, though.

Same as @greg above, but sadly more frequently.

I get the issue far more often in Chrome than in Firefox-esr, for the record.

Yep, me too. I'm using Chrome (gave up with Firefox some time ago).

If you can reproduce it, please capture the relevant HTTP requests as mentioned at https://www.mediawiki.org/wiki/Manual:How_to_debug/Login_problems.

If you can reproduce it, please capture the relevant HTTP requests as mentioned at https://www.mediawiki.org/wiki/Manual:How_to_debug/Login_problems.

P6346 (private to only Brad and I).

Steps taken (in Chrome):

  • login
  • random article
  • edit (with VE)
  • save -> got the "lost session data" warning
  • (edit saved)
  • random article (UI shows logged out, got notification that I was centrally logged in)
  • clicked login, auto logged in (per above)

All I'm seeing in P6346 is the "clicked login, auto logged in" step. But, in the middle of that step, it looks like you might have hit the bug again. I don't see anything wrong with the cookies your browser is sending there. Unfortunately there's not too much logging (e.g. the session channel seems to be completely missing), but what I do see is consistent with redis being screwy as shown in T173646.

Unfortunately there's not too much logging (e.g. the session channel seems to be completely missing), but what I do see is consistent with redis being screwy as shown in T173646.

How can we move this forward?

One thing to try would be to restart redis and nutcracker on the relevant Beta Cluster hosts and see if the problem goes away. I don't know if there's a less manual way to do it than sshing everywhere.

The test cases in T173646#3546939 would be a good quick test to see if that may have actually worked.

It was happening in the morning (PDT) on Aug 21 again - but got self-fixed in two hours.