Page MenuHomePhabricator

Cannot log in or perform any actions on Beta Cluster wikis
Closed, ResolvedPublicBUG REPORT

Description

Steps to reproduce:

  1. Go to https://en.wikipedia.beta.wmflabs.org/w/index.php?title=Special:UserLogin and login
  2. It shows "There seems to be a problem with your login session; this action has been canceled as a precaution against session hijacking. Please resubmit the form."

Event Timeline

Restricted Application changed the subtype of this task from "Task" to "Production Error". · View Herald TranscriptSun, Jan 12, 6:17 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Peachey88 renamed this task from Cannot log in or perform any actions on Beta Cluster to Cannot log in or perform any actions on Beta Cluster wikis.Sun, Jan 12, 7:39 AM

I tried editing anonymously (i.e. temporary account). It worked although I saw this error: "No active login attempt is in progress for your session."

Bugreporter triaged this task as Unbreak Now! priority.Sun, Jan 12, 8:41 AM
taavi changed the subtype of this task from "Production Error" to "Bug Report".Sun, Jan 12, 9:07 AM
taavi removed a project: Wikimedia-production-error.

User creation too.
Last success: https://login.wikimedia.beta.wmflabs.org/w/index.php?title=Special:Log&logid=221335

Something deployed between that time and creation of this task.

Please look at sessionstorage.

Beta cluster Logstash data says that object cache has been unhappy since 11 January.

https://beta-logs.wmcloud.org/goto/66424e93da0bb0ff7e8b3b3aebca3441

image.png (522×2 px, 101 KB)

Full of errors like Failed to store enwiki:MWSession:<hash> : (500) .

Log for an example request: https://beta-logs.wmcloud.org/goto/a404432dceca139889a469eb0434efc5

They also include: Error reading from storage (gocql: no hosts available in the pool)

Beta cluster Logstash data says that object cache has been unhappy since 11 January.

Specifically, it started on 9:05 AM January 11. Nothing merged around that time seems relevant. Nothing in SAL either.

systemctl says

Jan 11 08:09:58 deployment-sessionstore06 systemd[1]: cassandra.service: Main process exited, code=killed, status=9/KILL
Jan 11 08:10:07 deployment-sessionstore06 nodetool[922509]: nodetool: Found unexpected parameters: [disablethrift]
Jan 11 08:10:07 deployment-sessionstore06 nodetool[922509]: See 'nodetool help' or 'nodetool help <command>'.
Jan 11 08:10:09 deployment-sessionstore06 nodetool[923332]: nodetool: Failed to connect to '127.0.0.1:7199' - ConnectException: 'Connection refused (Connection refused)'.
Jan 11 08:10:10 deployment-sessionstore06 nodetool[923396]: nodetool: Failed to connect to '127.0.0.1:7199' - ConnectException: 'Connection refused (Connection refused)'.
Jan 11 08:10:12 deployment-sessionstore06 nodetool[923458]: nodetool: Failed to connect to '127.0.0.1:7199' - ConnectException: 'Connection refused (Connection refused)'.
Jan 11 08:10:13 deployment-sessionstore06 nodetool[923520]: nodetool: Failed to connect to '127.0.0.1:7199' - ConnectException: 'Connection refused (Connection refused)'.
Jan 11 08:10:13 deployment-sessionstore06 systemd[1]: cassandra.service: Control process exited, code=exited, status=1/FAILURE
Jan 11 08:10:13 deployment-sessionstore06 systemd[1]: cassandra.service: Failed with result 'oom-kill'.
Jan 11 08:10:13 deployment-sessionstore06 systemd[1]: cassandra.service: Consumed 3min 8.148s CPU time.

free says there's almost 1.5G available, which seems decent. A restart seems to work, with some complaints about free space (but seems to be about disk rather than memory):

Jan 13 19:57:36 deployment-sessionstore06 cassandra[1055000]: WARN  [main] 2025-01-13 19:57:36,983 DatabaseDescriptor.java:1034 - Small commitlog volume detected at '/var/lib/cassandra/commitlog'; setting commitlog_total_space to 4997.  You can override this in cassandra.yaml
Jan 13 19:57:36 deployment-sessionstore06 cassandra[1055000]: WARN  [main] 2025-01-13 19:57:36,987 DatabaseDescriptor.java:650 - Only 13.541GiB free across all data volumes. Consider adding more capacity to your cluster or removing obsolete snapshots
Jan 13 19:57:39 deployment-sessionstore06 cassandra[1055000]: WARN  [main] 2025-01-13 19:57:39,176 StartupChecks.java:257 - JMX is not enabled to receive remote connections. Please see cassandra-env.sh for more info.
Jan 13 19:57:39 deployment-sessionstore06 cassandra[1055000]: WARN  [main] 2025-01-13 19:57:39,211 SigarLibrary.java:172 - Cassandra server running in degraded mode. Is swap disabled? : true,  Address space adequate? : true,  nofile limit adequate? : true, nproc limit adequate? : false

No idea if that's bad.

Tgr claimed this task.

Optimistically closing, maybe Cassandra just needs a reboot every couple months or something. We'll see whether it repeats.