Page MenuHomePhabricator

Cannot login to beta cluster: "There seems to be a problem with your login session..."
Closed, ResolvedPublicBUG REPORT

Description

What is the problem?

If I try to login to beta cluster, I see the error message:

There seems to be a problem with your login session; this action has been canceled as a precaution against session hijacking. Please resubmit the form.

I have tried clearing my cookies, using different browsers, even using Browserstack emulator.

This has occurred so far on https://en.wikipedia.beta.wmflabs.org, https://en.wiktionary.beta.wmflabs.org, https://es.wikibooks.beta.wmflabs.org.

I cannot reproduce this on my local environment, which is the same MediaWiki version as beta.

Steps to reproduce problem
  1. Go to https://en.wikipedia.beta.wmflabs.org/w/index.php?title=Special:UserLogin
  2. Attempt to login
Environment

Browser: Firefox, Chromium, Safari
Wiki(s): MediaWiki 1.36.0-alpha (a01f759) 06:56, 23 September 2020

Event Timeline

Logging in fails even via the API. From selenium-daily-beta-MediaWiki/782/console:

[0-1] [07:40:51] [E] [MWBOT] Login failed: Selenium user@https://en.wikipedia.beta.wmflabs.org/w

mwbot is an npm package we use to automate the API.

I just got the same problem, and noticed it because the AC/DC browser tests failed (Travis build). But the tests run each day and only started failing today, so it’s a bit confusing to me that this problem was already reported back in September…

Screenshot_2020-10-18 Log in - Wikipedia, the free encyclopedia.png (390×296 px, 18 KB)

Two additional problems which I assume have the same underlying problem as this one:

  • creating an account on CS beta wikipedia fails with "There seems to be a problem with your login session; this action has been canceled as a precaution against session hijacking. Please resubmit the form."
  • with an account logged-in to English beta wikipedia, calling new mw.Api().saveOptions( {"some-pref": "some-val" } ) fails with "Invalid CSRF token."

Tried to restart cassandra as noted in T243123#6431183, but it still doesn't work.

I can log in in the browser now; the AC/DC browser tests fail (Travis build) with a different message:

Cannot log in when using MediaWiki\Session\BotPasswordSessionProvider sessions.

I can log in in the browser now; the AC/DC browser tests fail (Travis build) with a different message:

Cannot log in when using MediaWiki\Session\BotPasswordSessionProvider sessions.

That seems unrelated. You can't log in with a bot password via the normal interface; it is for API only. If, in the same browser, you are logged in with a bot password, and trying to use the login page, you get this error, since you have an active session that cannot interact with the login page.

Well, I’m getting that error from the API (action=login). It seems to happen when the session was already logged in (i. e., the first browser test that tries to login succeeds, the second one fails). Is this a new restriction? These browser tests run each day and I haven’t touched the login part in months.

I don't think the login code has changed since 2015. Maybe it's due to the browser failing to log out of the bot session? Logout problems were mentioned above. (In which case, either the Selenium code does not check for logout success status, or the API gives it incorrectly; either way it's a bug, although one only triggered in unusual circumstances.)

There are some big spikes of sessionstore-caused errors like Failed to delete enwikisource:MWSession:abcdef1234567890abcdef1234567890 : (500):

beta sessionstore errors.png (281×837 px, 38 KB)

No appname:sessionstore errors though. So there seems to be a whole string of problems here:

  • beta sessionstore seems unstable
  • beta sessionstore does not seem to report to logstash-beta
  • the template for that error message is Failed to delete <key> : ({code}) {error}; sessionstore is apparently returning empty 500 error pages with no message, or MediaWiki fails to parse the error message.
  • <key> in the above message should be a parameter so error counting works properly.

Also interesting that there are tens of thousands of errors with the same session ID, suggesting that some bots went crazy. That could be a consequence of the sessionstore outage, or maybe even the cause.

I don't think the login code has changed since 2015. Maybe it's due to the browser failing to log out of the bot session? Logout problems were mentioned above. (In which case, either the Selenium code does not check for logout success status, or the API gives it incorrectly; either way it's a bug, although one only triggered in unusual circumstances.)

But my tests never even try to log out… maybe something also changed in the test setups that causes the session cookie to survive when it used to be thrown away, or something. Anyways, fixed by checking if we’re already logged in.

  • beta sessionstore does not seem to report to logstash-beta

Possibly T233134? (Though you seem to have some data in that screenshot.)

MediaWiki seems to log normally, it's the beta sessionstore service that has zero logs, despite apparently having had an outage.

But my tests never even try to log out… maybe something also changed in the test setups that causes the session cookie to survive when it used to be thrown away, or something. Anyways, fixed by checking if we’re already logged in.

I'm a bit confused about how those tests are supposed to work. Bot passwords only work for API requests, and presumably the browser test involves loading web pages? Are there two different kinds of logins? For the API, you can just use assert=user and catch assertion errors if you want to ensure you are logged in.

In any case, the error message you mention comes from action=clientlogin, not action=login (which is where bot passwords can be used), I think. So it seems like the browser is trying to do a normal login without terminating the bot login. I have no idea what has changed to trigger that.

This started happening again intermittently on Friday. I have just tried now and it is still happening.

I think it also caused the latest Selenium failures (e.g. https://integration.wikimedia.org/ci/view/Selenium/job/selenium-daily-beta-MediaWiki/979/)

Mentioned in SAL (#wikimedia-releng) [2021-03-29T15:16:26Z] <Majavah> manually run puppet on deployment-sessionstore03, starting Cassandra (which was stopped) T263617

deployment-sessionstore03 is running sessionstore in Docker, the containers logs are full of something like that:

{"msg":"Error reading from storage (gocql: no hosts available in the pool)","appname":"sessions","time":"2021-03-29T15:08:48Z","level":"ERROR","request_id":"00000000-0000-0000-0000-000000000000"}

Cassandra is not running:

taavi@deployment-sessionstore03:~$ systemctl status cassandra.service
● cassandra.service - distributed storage system for structured data
   Loaded: loaded (/lib/systemd/system/cassandra.service; static; vendor preset: enabled)
   Active: failed (Result: signal) since Mon 2021-03-29 14:41:45 UTC; 27min ago
  Process: 10702 ExecStart=/usr/sbin/cassandra -p /var/run/cassandra/cassandra.pid (code=killed, signal=KILL)
 Main PID: 10702 (code=killed, signal=KILL)
      CPU: 4.957s

I manually ran puppet which started Cassandra. The instance stopped responding shortly afterwards.

Mentioned in SAL (#wikimedia-releng) [2021-03-29T15:37:01Z] <Majavah> hard reboot deployment-sessionstore03 T263617

Session storage itself appears to be working. However now the VM does not allow me to use sudo at all :/ works now, just logged out and back in

taavi claimed this task.
Ryasmeen added a subscriber: Ryasmeen.

Re-opening this task since I am having this issue again on Beta cluster. Seeing the same error as @dom_walden mentioned in the description.

taavi@deployment-sessionstore03:~$ systemctl status cassandra.service
● cassandra.service - distributed storage system for structured data
   Loaded: loaded (/lib/systemd/system/cassandra.service; static; vendor preset: enabled)
   Active: failed (Result: signal) since Tue 2021-04-13 17:42:19 UTC; 25min ago
  Process: 1029 ExecStart=/usr/sbin/cassandra -p /var/run/cassandra/cassandra.pid (code=killed, signal=KILL)
 Main PID: 1029 (code=killed, signal=KILL)
      CPU: 17min 43.228s

Cassandra had crashed yet again, I restarted it again since I don't have time to properly investigate at this exact moment.

In T263617#6995736, @Majavah wrote:
taavi@deployment-sessionstore03:~$ systemctl status cassandra.service
● cassandra.service - distributed storage system for structured data
   Loaded: loaded (/lib/systemd/system/cassandra.service; static; vendor preset: enabled)
   Active: failed (Result: signal) since Tue 2021-04-13 17:42:19 UTC; 25min ago
  Process: 1029 ExecStart=/usr/sbin/cassandra -p /var/run/cassandra/cassandra.pid (code=killed, signal=KILL)
 Main PID: 1029 (code=killed, signal=KILL)
      CPU: 17min 43.228s

Cassandra had crashed yet again, I restarted it again since I don't have time to properly investigate at this exact moment.

Thanks! I think that worked :)

So this is happening again, and looks like Cassandra is running out of memory:

root@deployment-sessionstore03:/var/lib/cassandra# ls -la hs_err_pid*
-rw-r--r-- 1 cassandra cassandra 21252 Apr 27 05:17 hs_err_pid11768.log
-rw-r--r-- 1 cassandra cassandra 21248 Apr 27 05:28 hs_err_pid12013.log
-rw-r--r-- 1 cassandra cassandra 21248 Apr 27 05:29 hs_err_pid12093.log

root@deployment-sessionstore03:/var/lib/cassandra# cat hs_err_pid12093.log
There is insufficient memory for the Java Runtime Environment to continue.
Native memory allocation (mmap) failed to map 1048576000 bytes for committing reserved memory.
Possible reasons:
  The system is out of physical RAM or swap space
[...]

/etc/cassandra/jvm.options doesn't override the allocated memory from default, which is:

# Heap size is automatically calculated by cassandra-env based on this
# formula: max(min(1/2 ram, 1024MB), min(1/4 ram, 8GB))
# That is:
# - calculate 1/2 ram and cap to 1024MB
# - calculate 1/4 ram and cap to 8192MB
# - pick the max

The box has 2G ram total:

root@deployment-sessionstore03:/etc/cassandra# free -h
              total        used        free      shared  buff/cache   available
Mem:           1.9G        893M        496M         22M        604M        930M
Swap:            0B          0B          0B

Given that's a Stretch server that will need replacing soon anyways, I'll just make a new instance with more resources allocated to it.

Mentioned in SAL (#wikimedia-releng) [2021-04-27T06:40:34Z] <Majavah> installing deployment-sessionstore04 T263617

Change 682888 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/mediawiki-config@master] beta: Switchover to deployment-sessionstore04

https://gerrit.wikimedia.org/r/682888

@Majavah This seems to be happening again. Looking on deployment-sessionstore04, however, Cassandra appears to be running:

dwalden@deployment-sessionstore04:~$ systemctl status cassandra.service
● cassandra.service - distributed storage system for structured data
   Loaded: loaded (/lib/systemd/system/cassandra.service; static; vendor preset: enabled)
   Active: active (running) since Tue 2021-04-27 06:42:42 UTC; 1h 48min ago
    Tasks: 61 (limit: 4718)
   Memory: 1.2G
   CGroup: /system.slice/cassandra.service
...

@Majavah This seems to be happening again. Looking on deployment-sessionstore04, however, Cassandra appears to be running:

dwalden@deployment-sessionstore04:~$ systemctl status cassandra.service
● cassandra.service - distributed storage system for structured data
   Loaded: loaded (/lib/systemd/system/cassandra.service; static; vendor preset: enabled)
   Active: active (running) since Tue 2021-04-27 06:42:42 UTC; 1h 48min ago
    Tasks: 61 (limit: 4718)
   Memory: 1.2G
   CGroup: /system.slice/cassandra.service
...

The mediawiki-config patch to switch over to sessionstore04 has not been merged or deployed yet, since I have not yet found anyone who can do that.

In T263617#7037139, @Majavah wrote:

The mediawiki-config patch to switch over to sessionstore04 has not been merged or deployed yet, since I have not yet found anyone who can do that.

Ah, of course, silly me. I will ask around to see if I can find someone to review it.

Change 682888 merged by jenkins-bot:

[operations/mediawiki-config@master] beta: Switchover to deployment-sessionstore04

https://gerrit.wikimedia.org/r/682888

Mentioned in SAL (#wikimedia-releng) [2021-04-27T17:52:03Z] <Majavah> delete deployment-sessionstore03 T263617 T278641

Appears to work again, thanks Majavah!

Hopefully fixed for good this time.

We are seeing this problem again.

cassandra does not appear to be running on deployment-sessionstore04

root@deployment-sessionstore04:/var/log/cassandra# systemctl status cassandra.service
● cassandra.service - distributed storage system for structured data
   Loaded: loaded (/lib/systemd/system/cassandra.service; static; vendor preset: enabled)
   Active: inactive (dead)

Jun 07 14:29:39 deployment-sessionstore04 systemd[1]: Condition check resulted in distributed storage system for structured data being skipped.
Jun 07 14:29:42 deployment-sessionstore04 systemd[1]: /lib/systemd/system/cassandra.service:7: PIDFile= references path below legacy directory /var/run/, updating /var/run/cassandra/cassandra.pid → /run/cassandr
Jun 07 14:30:08 deployment-sessionstore04 systemd[1]: /lib/systemd/system/cassandra.service:7: PIDFile= references path below legacy directory /var/run/, updating /var/run/cassandra/cassandra.pid → /run/cassandr
Jun 07 14:30:08 deployment-sessionstore04 systemd[1]: Condition check resulted in distributed storage system for structured data being skipped.
Jun 07 14:30:10 deployment-sessionstore04 systemd[1]: /lib/systemd/system/cassandra.service:7: PIDFile= references path below legacy directory /var/run/, updating /var/run/cassandra/cassandra.pid → /run/cassandr
Jun 07 14:32:51 deployment-sessionstore04 systemd[1]: /lib/systemd/system/cassandra.service:7: PIDFile= references path below legacy directory /var/run/, updating /var/run/cassandra/cassandra.pid → /run/cassandr
Jun 07 14:32:58 deployment-sessionstore04 systemd[1]: /lib/systemd/system/cassandra.service:7: PIDFile= references path below legacy directory /var/run/, updating /var/run/cassandra/cassandra.pid → /run/cassandr
Jun 07 14:32:58 deployment-sessionstore04 systemd[1]: Condition check resulted in distributed storage system for structured data being skipped.
Jun 07 14:32:59 deployment-sessionstore04 systemd[1]: /lib/systemd/system/cassandra.service:7: PIDFile= references path below legacy directory /var/run/, updating /var/run/cassandra/cassandra.pid → /run/cassandr
Jun 07 14:36:09 deployment-sessionstore04 systemd[1]: /lib/systemd/system/cassandra.service:7: PIDFile= references path below legacy directory /var/run/, updating /var/run/cassandra/cassandra.pid → /run/cassandr

I have tried starting/restarting it, but to no avail.

Mentioned in SAL (#wikimedia-releng) [2021-06-07T22:02:35Z] <urbanecm> urbanecm@deployment-sessionstore04:~$ sudo touch /etc/cassandra/service-enabled #T263617

Mentioned in SAL (#wikimedia-releng) [2021-06-07T22:02:53Z] <urbanecm> urbanecm@deployment-sessionstore04:~$ sudo service cassandra start # T263617

@dom_walden I temporarily "fixed" this by manually creating /etc/cassandra/service-enabled, listed as ConditionPathExists in systemd file for cassandra. Then, starting cassandra manually worked as intended. Not sure a) is this file supposed to exist at beta b) if it is, did it exist before c) if it did, why it disappeared.

@majavah Maybe you can offer your opinion about the questions above?

@dom_walden I temporarily "fixed" this by manually creating /etc/cassandra/service-enabled, listed as ConditionPathExists in systemd file for cassandra. Then, starting cassandra manually worked as intended. Not sure a) is this file supposed to exist at beta b) if it is, did it exist before c) if it did, why it disappeared.

Thank you. That is greatly appreciated.