Page MenuHomePhabricator

Cassandra killed by oom-killer and prometheus scrapes failing intermittently on deployment-sessionstore06
Closed, ResolvedPublic

Description

Common information

  • summary: Project deployment-prep instance deployment-sessionstore06 is down
  • alertname: InstanceDown
  • instance: deployment-sessionstore06
  • job: node
  • project: deployment-prep
  • severity: warning

Firing alerts


  • summary: Project deployment-prep instance deployment-sessionstore06 is down
  • alertname: InstanceDown
  • instance: deployment-sessionstore06
  • job: node
  • project: deployment-prep
  • severity: warning
  • Source

Event Timeline

bd808 triaged this task as Medium priority.Jan 20 2026, 7:17 PM
bd808 moved this task from To Triage to Backlog on the Beta-Cluster-Infrastructure board.
bd808 added subscribers: Eevans, bd808.

Looks to be a duplicate of the behavior from T412774: Project deployment-prep instance deployment-sessionstore06 is down. The instance is up, but something spiked it's load to a point where Prometheus scrapes failed causing a down time alert.

https://grafana.wmcloud.org/goto/K3zanJIvR?orgId=1

Screenshot 2026-01-20 at 12.15.49.png (1×2 px, 204 KB)

@Eevans Any new ideas what to look for as a possible cause for the load spike?

sudo journalctl --since "2026-01-20 10:05:00" --until "2026-01-20 10:15:00" turned up the kernel oom-killer going after cassandra in the time range where we had a data collection gap.

Jan 20 10:14:17 deployment-sessionstore06 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=init.scope,mems_allowed=0,global_oom,task_memcg=/system.slice/cassandra.service,task=java,pid=4036288,uid=115
Jan 20 10:14:17 deployment-sessionstore06 kernel: Out of memory: Killed process 4036288 (java) total-vm:3150900kB, anon-rss:1353600kB, file-rss:30016kB, shmem-rss:0kB, UID:115 pgtables:3384kB oom_score_adj:0

sudo journalctl --since "2026-01-20 10:05:00" --until "2026-01-20 10:15:00" turned up the kernel oom-killer going after cassandra in the time range where we had a data collection gap.

Jan 20 10:14:17 deployment-sessionstore06 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=init.scope,mems_allowed=0,global_oom,task_memcg=/system.slice/cassandra.service,task=java,pid=4036288,uid=115
Jan 20 10:14:17 deployment-sessionstore06 kernel: Out of memory: Killed process 4036288 (java) total-vm:3150900kB, anon-rss:1353600kB, file-rss:30016kB, shmem-rss:0kB, UID:115 pgtables:3384kB oom_score_adj:0

This feels like some sort of misconfiguration, but I'm not sure if it is a lack of blocking oom-killer from considering Cassandra, the instance being undersized, or Cassandra being over provisioned somehow. Maybe @Eevans will have some ideas when he is back from the SRE Summit in a week or so.

bd808 renamed this task from Project deployment-prep instance deployment-sessionstore06 is down to Caassandra killed by oom-killer and prometheus scrapes failing intermittently on deployment-sessionstore06.Jan 26 2026, 11:35 PM

Mentioned in SAL (#wikimedia-releng) [2026-03-12T21:23:38Z] <bd808> Hard reboot deployment-sessionstore06 (T415021)

Something strange looking in the sudo journalctl --since "2026-03-12 15:35:00" --until "2026-03-12 21:23:00" output is sssd_nss flapping a lot.

bd808@deployment-sessionstore06:~$ sudo journalctl --since "2026-03-12 15:35:00" --until "2026-03-12 21:23:00" | grep sssd | grep "Starting up" | wc -l
36
bd808@deployment-sessionstore06:~$ sudo journalctl --since "2026-03-12 15:35:00" --until "2026-03-12 21:23:00" | grep sssd | grep "Starting up"
Mar 12 15:35:01 deployment-sessionstore06 sssd_nss[2827723]: Starting up
Mar 12 15:44:09 deployment-sessionstore06 sssd_be[2827963]: Starting up
Mar 12 15:44:09 deployment-sessionstore06 sssd_nss[2827968]: Starting up
Mar 12 15:44:15 deployment-sessionstore06 sssd_nss[2828060]: Starting up
Mar 12 15:55:01 deployment-sessionstore06 sssd_nss[2828563]: Starting up
Mar 12 16:04:53 deployment-sessionstore06 sssd_nss[2829173]: Starting up
Mar 12 16:15:01 deployment-sessionstore06 sssd_nss[2829453]: Starting up
Mar 12 16:25:02 deployment-sessionstore06 sssd_nss[2829512]: Starting up
Mar 12 16:35:01 deployment-sessionstore06 sssd_nss[2830046]: Starting up
Mar 12 16:55:01 deployment-sessionstore06 sssd_nss[2830469]: Starting up
Mar 12 17:04:45 deployment-sessionstore06 sssd_nss[2831081]: Starting up
Mar 12 17:13:58 deployment-sessionstore06 sssd_nss[2831344]: Starting up
Mar 12 17:25:01 deployment-sessionstore06 sssd_nss[2831414]: Starting up
Mar 12 17:34:29 deployment-sessionstore06 sssd_nss[2832024]: Starting up
Mar 12 17:45:02 deployment-sessionstore06 sssd_nss[2832302]: Starting up
Mar 12 17:55:01 deployment-sessionstore06 sssd_nss[2832367]: Starting up
Mar 12 18:04:31 deployment-sessionstore06 sssd_nss[2832993]: Starting up
Mar 12 18:15:01 deployment-sessionstore06 sssd_nss[2833271]: Starting up
Mar 12 18:25:01 deployment-sessionstore06 sssd_nss[2833336]: Starting up
Mar 12 18:35:01 deployment-sessionstore06 sssd_nss[2833853]: Starting up
Mar 12 18:44:38 deployment-sessionstore06 sssd_nss[2834206]: Starting up
Mar 12 18:55:01 deployment-sessionstore06 sssd_nss[2834320]: Starting up
Mar 12 19:05:01 deployment-sessionstore06 sssd_nss[2834388]: Starting up
Mar 12 19:14:28 deployment-sessionstore06 sssd_nss[2835194]: Starting up
Mar 12 19:25:01 deployment-sessionstore06 sssd_nss[2835266]: Starting up
Mar 12 19:45:02 deployment-sessionstore06 sssd_nss[2836140]: Starting up
Mar 12 19:55:01 deployment-sessionstore06 sssd_nss[2836199]: Starting up
Mar 12 20:04:49 deployment-sessionstore06 sssd_nss[2836809]: Starting up
Mar 12 20:15:01 deployment-sessionstore06 sssd_nss[2837066]: Starting up
Mar 12 20:25:01 deployment-sessionstore06 sssd_nss[2837137]: Starting up
Mar 12 20:34:30 deployment-sessionstore06 sssd_nss[2837741]: Starting up
Mar 12 20:43:59 deployment-sessionstore06 sssd_nss[2838006]: Starting up
Mar 12 20:55:01 deployment-sessionstore06 sssd_nss[2838086]: Starting up
Mar 12 21:04:46 deployment-sessionstore06 sssd_nss[2838147]: Starting up
Mar 12 21:04:48 deployment-sessionstore06 sssd_pam[2838151]: Starting up
Mar 12 21:13:57 deployment-sessionstore06 sssd_nss[2842329]: Starting up
bd808@deployment-sessionstore06:~$ w
 20:18:51 up 22:55,  2 users,  load average: 32.13, 31.59, 19.11
USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
root     ttyS0    -                Thu21   22:55m  0.02s  0.00s -bash
bd808    pts/0    172.16.17.143    20:18    1.00s  0.02s  0.00s w
[?2004hroot@deployment-sessionstore06:~# [71610.740180] Out of memory: Killed process 777565 (java) total-vm:3136180kB, anon-rss:1364528kB, file-rss:177032kB, shmem-rss:0kB, UID:115 pgtables:3264kB oom_score_adj:0
[73201.664631] Out of memory: Killed process 782365 (java) total-vm:3119024kB, anon-rss:1333632kB, file-rss:177512kB, shmem-rss:0kB, UID:115 pgtables:3208kB oom_score_adj:0
[74774.116212] Out of memory: Killed process 784511 (java) total-vm:3122044kB, anon-rss:1343836kB, file-rss:177284kB, shmem-rss:0kB, UID:115 pgtables:3188kB oom_score_adj:0
[76560.202796] Out of memory: Killed process 786259 (java) total-vm:3147136kB, anon-rss:1334368kB, file-rss:177592kB, shmem-rss:0kB, UID:115 pgtables:3200kB oom_score_adj:0
[78132.891912] Out of memory: Killed process 788192 (java) total-vm:3156520kB, anon-rss:1342624kB, file-rss:178096kB, shmem-rss:0kB, UID:115 pgtables:3200kB oom_score_adj:0
[80016.356936] Out of memory: Killed process 789717 (java) total-vm:3156900kB, anon-rss:1336640kB, file-rss:177236kB, shmem-rss:0kB, UID:115 pgtables:3236kB oom_score_adj:0

https://grafana.wmcloud.org/goto/dffxbhb1qcav4b?orgId=1

Screenshot 2026-03-13 at 14.17.52.png (1×2 px, 292 KB)

Aklapper renamed this task from Caassandra killed by oom-killer and prometheus scrapes failing intermittently on deployment-sessionstore06 to Cassandra killed by oom-killer and prometheus scrapes failing intermittently on deployment-sessionstore06.Mar 14 2026, 10:27 AM

It lines up with puppet agent running every 30 minutes on the host possibly to a fact being collected. If I run it with puppet agent -tv --debug, the last lines look like:

Debug: Loading external facts from /var/lib/puppet/facts.d
Info: Loading facts
...
Debug: Facter: Loading all internal facts
...
Debug: Facter: Loading external facts
Debug: Facter: Searching fact: networking in all custom facts
Debug: Facter: Loading custom facts
Debug: Facter: Executing command: /bin/uname -r
Debug: Facter: Executing command: /bin/uname -v
<hang until java get oom killed>

The custom facts are:

/var/lib/puppet/lib/facter
blockdevices.rb       lldp.rb		       puppet_settings.rb
cas_version.rb	      logical_volumes.rb       puppetdb.rb
ceph_disks.rb	      lshw.rb		       python.rb
cephadm.rb	      lvm_support.rb	       raid.rb
cpu_details.rb	      net_driver.rb	       replication.rb
disk_type.rb	      nfscommon_version.rb     root_home.rb
efi.rb		      numa.rb		       service_provider.rb
etcd_version.rb       openstack_project_id.rb  spark_version.rb
firmware.rb	      package_provider.rb      ssh_ca_host_certificate.rb
ganeti.rb	      pe_version.rb	       swift_disks.rb
interface_primary.rb  physical_volumes.rb      uniqueid.rb
ipmi.rb		      physicalcorecount.rb     util
java_version.rb       puppet_config.rb	       volume_groups.rb
kernel_details.rb     puppet_config_dir.rb     wmflib.rb

And the issue can then be reproduced when using:

$ facter -d -l trace --custom-dir /var/lib/puppet/lib/facter
<hang>

(there is no debug/trace logs showing up bah)

I am going to leave T420227: Project deployment-prep instance deployment-sessionstore06 is down open rather than merging here because apparently the alerting system will just keep making more and more of these tasks. I have 19 threads, the most recent with 54 emails in it about this instance being up/down/up/down/up/down as the monitoring flaps due to system overload.

$ facter -d -l trace --custom-dir /var/lib/puppet/lib/facter
<hang>

That is a red hearing. I ran this again this morning and it works just fine. Thus I guess last time I ran it the puppet agent happened to have run at the same time. If I run puppet agent -tv --noop, that brings the instance to an halt. I am assuming because that is enough to overload the instance memory.

The instance only has 2G so maybe that is simply not enough for Cassandra. deployment-echostore02.deployment-prep.eqiad1.wikimedia.cloud is 2G as well it has Cassandra and it got OOM killed a few months ago.

Mentioned in SAL (#wikimedia-releng) [2026-03-17T20:20:03Z] <bd808> Resize deployment-sessionstore06 from g4.cores1.ram2.disk20 to g4.cores2.ram4.disk20 (T415021)

Before:

bd808@deployment-sessionstore06:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:           1.9Gi       1.6Gi        72Mi       0.0Ki       267Mi        46Mi
Swap:             0B          0B          0B

After:

bd808@deployment-sessionstore06:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:           3.8Gi       271Mi       3.0Gi       0.0Ki       601Mi       3.4Gi
Swap:             0B          0B          0B

Let's see if throwing hardware at the problem provides any relief.

bd808 claimed this task.

So here is my sense making: Before the Cassandra and Java upgrade things were just fitting into ram most of the time; usually it was fine, but occasionally it died. After the updates things didn't fit well at all; things died all of the time. Adding ram fixed it.

Screenshot 2026-03-18 at 10.02.27.png (1×2 px, 488 KB)

@bd808 thank you so much for having resolved that issue! :-]