Page MenuHomePhabricator

Cassandra OOMs on restbase1009-a, restbase1011-c and restbase1013-c
Closed, ResolvedPublic

Description

Around 11:45 UTC on 2016-12-18 three Cassandra instances OOMed: restbase1009-a, restbasr1011-c and restbase1013-c. This caused RESTBase to start returning 5xx responses for a portion of the requests (local quorum couldn't be achieved). We need to investigate what is the reason for the OOMs.

$ for i in 1009 1013 1011; do echo "$i: "; ssh restbase$i.eqiad.wmnet -- "sudo find /srv/cassandra-* -maxdepth 1 -name '*.hprof' -exec ls -lh {} \;"; done
1009: 
-rw------- 1 cassandra cassandra 9.4G Dec 18 16:52 /srv/cassandra-a/java_pid93714.hprof
-rw------- 1 cassandra cassandra 10G Dec 18 11:19 /srv/cassandra-a/java_pid132816.hprof
-rw------- 1 cassandra cassandra 9.1G Dec 18 08:57 /srv/cassandra-a/java_pid67908.hprof
-rw------- 1 cassandra cassandra 11G Dec 18 10:03 /srv/cassandra-a/java_pid79638.hprof
-rw------- 1 cassandra cassandra 8.8G Dec 18 10:59 /srv/cassandra-a/java_pid119356.hprof
-rw------- 1 cassandra cassandra 11G Dec 18 15:35 /srv/cassandra-a/java_pid17899.hprof
-rw------- 1 cassandra cassandra 11G Dec 18 10:19 /srv/cassandra-a/java_pid106595.hprof
-rw------- 1 cassandra cassandra 11G Dec 18 12:07 /srv/cassandra-a/java_pid146258.hprof
-rw------- 1 cassandra cassandra 11G Dec 18 08:24 /srv/cassandra-a/java_pid50562.hprof
-rw------- 1 cassandra cassandra 11G Dec 18 08:44 /srv/cassandra-a/java_pid56179.hprof
1013: 
-rw------- 1 cassandra cassandra 11G Dec 18 12:08 /srv/cassandra-c/java_pid1824.hprof
-rw------- 1 cassandra cassandra 9.9G Dec 18 10:11 /srv/cassandra-c/java_pid2732.hprof
-rw------- 1 cassandra cassandra 11G Dec 18 10:04 /srv/cassandra-c/java_pid6892.hprof
-rw------- 1 cassandra cassandra 7.9G Dec 18 01:29 /srv/cassandra-c/java_pid13269.hprof
-rw------- 1 cassandra cassandra 11G Dec 18 10:59 /srv/cassandra-c/java_pid18131.hprof
-rw------- 1 cassandra cassandra 11G Dec 18 08:38 /srv/cassandra-c/java_pid17470.hprof
-rw------- 1 cassandra cassandra 11G Dec 18 11:44 /srv/cassandra-c/java_pid18013.hprof
-rw------- 1 cassandra cassandra 8.7G Dec 18 08:58 /srv/cassandra-c/java_pid29353.hprof
-rw------- 1 cassandra cassandra 11G Dec 18 08:25 /srv/cassandra-c/java_pid7954.hprof
-rw------- 1 cassandra cassandra 11G Dec 18 11:20 /srv/cassandra-c/java_pid1880.hprof
1011: 
-rw------- 1 cassandra cassandra 9.9G Dec 18 15:35 /srv/cassandra-c/java_pid3363.hprof
-rw------- 1 cassandra cassandra 7.3G Dec 18 01:28 /srv/cassandra-c/java_pid6380.hprof
-rw------- 1 cassandra cassandra 11G Dec 18 11:43 /srv/cassandra-c/java_pid23170.hprof
-rw------- 1 cassandra cassandra 11G Dec 18 16:58 /srv/cassandra-c/java_pid6608.hprof
-rw------- 1 cassandra cassandra 11G Dec 18 08:38 /srv/cassandra-c/java_pid21659.hprof
-rw------- 1 cassandra cassandra 11G Dec 18 10:19 /srv/cassandra-c/java_pid7690.hprof
-rw------- 1 cassandra cassandra 8.4G Dec 18 08:44 /srv/cassandra-c/java_pid30320.hprof
-rw------- 1 cassandra cassandra 9.1G Dec 18 16:52 /srv/cassandra-c/java_pid1220.hprof
ACTION: Cleanup these heap dumps after it determined that they are no longer needed for analysis, but before closing this issue.

Event Timeline

I highly suspect the culprits might be wide rows as in T148516 and T140946.

FYI, I have disabled puppet on those nodes and will keep it that way for a while (at least until we sure the same will not happen in the immediate future).

Mentioned in SAL (#wikimedia-operations) [2016-12-18T16:45:41Z] <elukey> starting cassandra instances on restbase1009, restbase1011 and restbase1013 (one at the time) - T153588

I've opened T153711: Revert increased quota for services-test labs project, to obtain a VM with enough memory to perform a heap analysis.

Eevans lowered the priority of this task from High to Medium.Dec 19 2016, 9:08 PM

As a quick follow-up, we have blacklisted in RESTBase the following pages:

  • for commons.wm: User:J_budissin/Uploads/BiH/2016_December_11-20, User:OgreBot/Uploads_by_new_users and User:Oxyman/Buildings_in_London/2014_October_11-20
  • for en.wp: User:Cyberpower678/RfX_Report

We have also made Change Prop respect RESTBase's blacklists: if a page is blacklisted, CP will not look for transclusion pages for the given page.

Having a look at java_pid79638.hprof (restbase1009-a.eqiad.wmnet):

cap01.png (776×1 px, 222 KB)

The top 12 items by retention constitute almost the entire heap.

cap04-1.png (776×1 px, 195 KB)

Each of these shared worker threads are involved in a read, and in each you can see that almost all of the retention can be attributed to an instance of o.a.c.db.ArrayBackedSortedColumns. These are collections of cells, specifically cells named value that contain the contents of https://commons.wikimedia.org/wiki/User:J_budissin/Uploads/BiH/2016_December_11-20 (beware, this page is enormous, if you open it your browser may become unresponsive).

cap04-2.png (776×1 px, 197 KB)

Here it is again. Same value, https://commons.wikimedia.org/wiki/User:J_budissin/Uploads/BiH/2016_December_11-20, blacklisted by @mobrovac on Dec. 18 (the day of these OOMs).

TL;DR This is partition data (same partition, multiple reads) materialized on heap, and there is simply more of it than will fit.

I'm going to check at least one other of these heaps (different machine); Stay tuned...

@Eevans, do we know why that much partition *data* is materialized on heap? Is there a lot of tombstoned content that is still materialized in full, or are there queries that request a lot of values?

@Eevans, do we know why that much partition *data* is materialized on heap? Is there a lot of tombstoned content that is still materialized in full, or are there queries that request a lot of values?

It's the former; Most (but not all) of these are o.a.c.db.BufferExpiringCell, aka tombstoned data.

java_pid23170.hprof from restbase1011-c, and java_pid1824.hprof from restbase1013-c both look the same (i.e. slices of the same document).

I'm not sure what else there is to learn from these dumps, but I will leave them in place until Friday in case any new questions come up that would require taking a second look.

Eevans claimed this task.
Eevans updated the task description. (Show Details)

Files removed; Resolving.