Page MenuHomePhabricator

Cassandra OOMs on restbase200{4,7}.codfw.wmnet
Closed, ResolvedPublic

Description

At approximately 17:27 UTC on Jan 23, 2017 Cassandra on restbase200{4,7}.codfw.wmnet exited with OOM exceptions.

$ for i in 2004 2007; do echo "$i: "; ssh restbase$i.codfw.wmnet -- "sudo find /srv/cassandra-* -maxdepth 1 -name '*.hprof' -exec ls -lh {} \;"; done
2004: 
-rw------- 1 cassandra cassandra 8.3G Jan 23 17:27 /srv/cassandra-a/java_pid6608.hprof
2007: 
-rw-r--r-- 1 cassandra cassandra 9.0G Jan 23 17:26 /srv/cassandra-b/java_pid22594.hprof
ACTION: Remove these heap dumps before closing this issue.

Event Timeline

The heap dumps from these machines are quite small (<= 9.0G), which is usually an indicator that the retained numbers won't make sense. However, from the dump on 2007 it would appear that the conditions mirror that of what was found in T153588: Cassandra OOMs on restbase1009-a, restbase1011-c and restbase1013-c (that reads generated by the retention policy are materializing a large number of tombstoned values onto the heap).

The page in question here is: https://commons.wikimedia.org/wiki/Commons:Open_Access_File_of_the_Day/recent_uploads/2016_October_27-31.

WARNING: That page is 26M in size and may cause your browser to become unresponsive if opened.

Screenshot from 2017-01-24 11-34-31.png (912×1 px, 464 KB)

Eevans triaged this task as Medium priority.Jan 24 2017, 5:43 PM
Eevans updated the task description. (Show Details)

I am way past due to return the resources allocated to the VM I used in this analysis (T153711: Revert increased quota for services-test labs project), and am planning to decommission it by COB today. I will however leave these heaps in place for a few more days, just in case any questions arise that would require more exploration to answer (we could get another VM if we had to).

Eevans updated the task description. (Show Details)

Files removed; Closing.