A number of OOM exceptions have occurred in the dev environment in the last day. With all of the legacy tables in place, it is entirely possible that this is unrelated to the refactor, but it should be thoroughly investigated nonetheless.
|Resolved||Eevans||T169936 Services 2017/18 Q1 goal: Start gradual roll-out of Cassandra 3 & new schema to resolve storage scaling issues and OOM errors.|
|Resolved||Eevans||T172384 OOM exceptions in dev environment|
Thus far, I have not been able to glean anything useful from the heap dumps produced. I'm not certain if this is because the source of memory utilization is less obvious than in past cases, or if there is something wrong with the heap dumps being produced, but it has become apparent that Cassandra is racing the JVM to create to create a heapdump of its own, (so corruption of them definitely seems possible).
I am going to put together a patched build that disables Cassandra's generation of heap dumps, and try again.
At this point I'm fairly confident that this is a memory leak in Cassandra introduced by CASSANDRA-13034. I patched the build to revert the corresponding change yesterday, and things have looked quite good ever since (more than enough time based on past experience).
I will follow up with an upstream issue, and put together a proper fork for patched Cassandra builds.