A number of OOM exceptions have occurred in the dev environment in the last day. With all of the legacy tables in place, it is entirely possible that this is unrelated to the refactor, but it should be thoroughly investigated nonetheless.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Eevans | T169936 Services 2017/18 Q1 goal: Start gradual roll-out of Cassandra 3 & new schema to resolve storage scaling issues and OOM errors. | |||
Resolved | Eevans | T172384 OOM exceptions in dev environment |
Event Timeline
Mentioned in SAL (#wikimedia-operations) [2017-08-03T15:43:58Z] <urandom> T172384: lower tombstone failure threshold in RESTBase dev to 1000
Mentioned in SAL (#wikimedia-operations) [2017-08-04T21:24:48Z] <urandom> T172384: Disabling Puppet in dev environment to prevent unattended Cassandra restarts
Thus far, I have not been able to glean anything useful from the heap dumps produced. I'm not certain if this is because the source of memory utilization is less obvious than in past cases, or if there is something wrong with the heap dumps being produced, but it has become apparent that Cassandra is racing the JVM to create to create a heapdump of its own, (so corruption of them definitely seems possible).
I am going to put together a patched build that disables Cassandra's generation of heap dumps, and try again.
Mentioned in SAL (#wikimedia-operations) [2017-08-07T21:28:59Z] <urandom> T172384: Upgrading Cassandra to 3.11.0-wmf1 in dev environment (build patched to disable in-built heap dumping)
Mentioned in SAL (#wikimedia-operations) [2017-08-09T16:29:49Z] <urandom> T172384: Upgrading Cassandra in RESTBase dev to 3.11.0-wmf2 (patched to disable use of FastThreadLocal)
At this point I'm fairly confident that this is a memory leak in Cassandra introduced by CASSANDRA-13034. I patched the build to revert the corresponding change yesterday, and things have looked quite good ever since (more than enough time based on past experience).
I will follow up with an upstream issue, and put together a proper fork for patched Cassandra builds.
Mentioned in SAL (#wikimedia-operations) [2017-09-13T19:11:11Z] <urandom> T172384: Upgrading restbase-dev1004 to Cassandra 3.11.0-wmf4 (canary)
Mentioned in SAL (#wikimedia-operations) [2017-09-13T19:21:26Z] <urandom> T172384: Upgrading restbase-dev100[5-6] to Cassandra 3.11.0-wmf4
Change 377829 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] Upgrade dev hosts to wmf4 release
Change 377830 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] Introduce a new target_version for dev-only builds
Change 377831 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] Use target_version: dev in dev environment
Change 377829 abandoned by Eevans:
Upgrade dev hosts to wmf4 release
Reason:
Should have squashed with another commit.
Change 377830 merged by Dzahn:
[operations/puppet@production] Introduce a new target_version for dev-only builds
Change 377831 merged by Dzahn:
[operations/puppet@production] Use target_version: dev in dev environment
CASSANDRA-13034 and CASSANDRA-13754 have been closed upstream, and the corresponding changes merged into our build. We have not seen a recurrence of these OOMS; Closing this as done.