OOM exceptions in dev environment
Closed, ResolvedPublic

Description

A number of OOM exceptions have occurred in the dev environment in the last day. With all of the legacy tables in place, it is entirely possible that this is unrelated to the refactor, but it should be thoroughly investigated nonetheless.

See also

Eevans created this task.Aug 3 2017, 1:40 PM

Mentioned in SAL (#wikimedia-operations) [2017-08-03T15:43:58Z] <urandom> T172384: lower tombstone failure threshold in RESTBase dev to 1000

Mentioned in SAL (#wikimedia-operations) [2017-08-04T21:24:48Z] <urandom> T172384: Disabling Puppet in dev environment to prevent unattended Cassandra restarts

Eevans updated the task description. (Show Details)Aug 7 2017, 7:45 PM
Eevans added a comment.Aug 7 2017, 7:53 PM

Thus far, I have not been able to glean anything useful from the heap dumps produced. I'm not certain if this is because the source of memory utilization is less obvious than in past cases, or if there is something wrong with the heap dumps being produced, but it has become apparent that Cassandra is racing the JVM to create to create a heapdump of its own, (so corruption of them definitely seems possible).

I am going to put together a patched build that disables Cassandra's generation of heap dumps, and try again.

Mentioned in SAL (#wikimedia-operations) [2017-08-07T21:28:59Z] <urandom> T172384: Upgrading Cassandra to 3.11.0-wmf1 in dev environment (build patched to disable in-built heap dumping)

GWicke raised the priority of this task from Normal to High.Aug 8 2017, 5:49 PM

Mentioned in SAL (#wikimedia-operations) [2017-08-09T16:29:49Z] <urandom> T172384: Upgrading Cassandra in RESTBase dev to 3.11.0-wmf2 (patched to disable use of FastThreadLocal)

At this point I'm fairly confident that this is a memory leak in Cassandra introduced by CASSANDRA-13034. I patched the build to revert the corresponding change yesterday, and things have looked quite good ever since (more than enough time based on past experience).

I will follow up with an upstream issue, and put together a proper fork for patched Cassandra builds.

Eevans updated the task description. (Show Details)Aug 10 2017, 4:18 PM

Mentioned in SAL (#wikimedia-operations) [2017-09-13T19:11:11Z] <urandom> T172384: Upgrading restbase-dev1004 to Cassandra 3.11.0-wmf4 (canary)

Mentioned in SAL (#wikimedia-operations) [2017-09-13T19:21:26Z] <urandom> T172384: Upgrading restbase-dev100[5-6] to Cassandra 3.11.0-wmf4

Change 377829 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] Upgrade dev hosts to wmf4 release

https://gerrit.wikimedia.org/r/377829

Change 377830 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] Introduce a new target_version for dev-only builds

https://gerrit.wikimedia.org/r/377830

Change 377831 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] Use target_version: dev in dev environment

https://gerrit.wikimedia.org/r/377831

Change 377829 abandoned by Eevans:
Upgrade dev hosts to wmf4 release

Reason:
Should have squashed with another commit.

https://gerrit.wikimedia.org/r/377829

Change 377830 merged by Dzahn:
[operations/puppet@production] Introduce a new target_version for dev-only builds

https://gerrit.wikimedia.org/r/377830

Change 377831 merged by Dzahn:
[operations/puppet@production] Use target_version: dev in dev environment

https://gerrit.wikimedia.org/r/377831

Eevans closed this task as Resolved.Oct 2 2017, 3:50 PM

CASSANDRA-13034 and CASSANDRA-13754 have been closed upstream, and the corresponding changes merged into our build. We have not seen a recurrence of these OOMS; Closing this as done.