Page MenuHomePhabricator

Restablish RESTBase dev environment with Cassandra 3.11.2
Closed, ResolvedPublic

Description

During the transition from the legacy storage system, to the new current revision strategy, the restbase-dev environment became out-of-sync with current production.

I propose we reset the Cassandra cluster, upgrade RESTBase, and re-enable sampled changeprop.

  • Baseline Cassandra using version 3.11.0-wmf5
  • (Re)deploy RESTBase
  • Re-establish sampled changeprop
  • Upgrade to Cassandra 3.11.2
NOTE: restbase-dev1006 has a hardware failure that needs addressing (a reimage may be necessary as a result)

Event Timeline

Eevans created this task.Feb 7 2018, 8:56 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 7 2018, 8:56 PM
Eevans triaged this task as Normal priority.Feb 7 2018, 8:57 PM
Eevans added projects: Services, User-Eevans.
Eevans updated the task description. (Show Details)
RobH changed the status of subtask T185494: Degraded RAID on restbase-dev1006 from Open to Stalled.Feb 14 2018, 8:02 PM
Eevans updated the task description. (Show Details)Mar 14 2018, 3:46 PM

Mentioned in SAL (#wikimedia-operations) [2018-03-19T18:33:12Z] <eevans@tin> Started deploy [restbase/deploy@8dbc93c] (dev-cluster): bring dev environment current w/ production (T186751)

Mentioned in SAL (#wikimedia-operations) [2018-03-19T18:43:28Z] <eevans@tin> Finished deploy [restbase/deploy@8dbc93c] (dev-cluster): bring dev environment current w/ production (T186751) (duration: 10m 16s)

Mentioned in SAL (#wikimedia-operations) [2018-03-19T19:05:48Z] <eevans@tin> Started deploy [restbase/deploy@8dbc93c] (dev-cluster): update dev environment to current production (T186751)

Mentioned in SAL (#wikimedia-operations) [2018-03-19T19:14:17Z] <eevans@tin> Finished deploy [restbase/deploy@8dbc93c] (dev-cluster): update dev environment to current production (T186751) (duration: 08m 30s)

Mentioned in SAL (#wikimedia-operations) [2018-03-19T19:31:52Z] <eevans@tin> Started deploy [restbase/deploy@8dbc93c] (dev-cluster): update dev environment to current production (T186751)

Mentioned in SAL (#wikimedia-operations) [2018-03-19T19:42:20Z] <eevans@tin> Finished deploy [restbase/deploy@8dbc93c] (dev-cluster): update dev environment to current production (T186751) (duration: 10m 28s)

Change 420416 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[operations/puppet@production] RESTBase: Add the correct seeds for the dev environment

https://gerrit.wikimedia.org/r/420416

Change 420416 merged by Dzahn:
[operations/puppet@production] RESTBase: Add the correct seeds for the dev environment

https://gerrit.wikimedia.org/r/420416

Mentioned in SAL (#wikimedia-operations) [2018-03-20T18:18:42Z] <eevans@tin> Started deploy [restbase/deploy@8dbc93c] (dev-cluster): Update Dev environment to current production (T186751)

Mentioned in SAL (#wikimedia-operations) [2018-03-20T18:24:40Z] <eevans@tin> Finished deploy [restbase/deploy@8dbc93c] (dev-cluster): Update Dev environment to current production (T186751) (duration: 05m 58s)

Mentioned in SAL (#wikimedia-operations) [2018-03-20T18:27:53Z] <eevans@tin> Started deploy [restbase/deploy@8dbc93c] (dev-cluster): Update Dev environment to current production (T186751)

Mentioned in SAL (#wikimedia-operations) [2018-03-20T18:30:35Z] <eevans@tin> Finished deploy [restbase/deploy@8dbc93c] (dev-cluster): Update Dev environment to current production (T186751) (duration: 02m 43s)

Eevans updated the task description. (Show Details)Mar 20 2018, 6:34 PM
Eevans updated the task description. (Show Details)Mar 22 2018, 2:48 PM

Mentioned in SAL (#wikimedia-operations) [2018-04-03T16:10:23Z] <urandom> rebooting restbase-dev1004 - T186751

Mentioned in SAL (#wikimedia-operations) [2018-04-03T16:33:21Z] <urandom> rebooting restbase-dev1006 - T186751

Mentioned in SAL (#wikimedia-operations) [2018-04-03T18:13:26Z] <urandom> upgrading restbase-dev1004-a to cassandra 3.11.2 - T186751

Mentioned in SAL (#wikimedia-operations) [2018-04-03T18:15:40Z] <urandom> upgrading restbase-dev1004-b to cassandra 3.11.2 - T186751

Mentioned in SAL (#wikimedia-operations) [2018-04-03T18:18:39Z] <urandom> upgrading restbase-dev1005-a to cassandra 3.11.2 - T186751

Mentioned in SAL (#wikimedia-operations) [2018-04-03T18:20:28Z] <urandom> upgrading restbase-dev1005-b to cassandra 3.11.2 - T186751

Mentioned in SAL (#wikimedia-operations) [2018-04-03T18:23:41Z] <urandom> upgrading restbase-dev1006-a to cassandra 3.11.2 - T186751

Mentioned in SAL (#wikimedia-operations) [2018-04-03T18:25:37Z] <urandom> upgrading restbase-dev1006-b to cassandra 3.11.2 - T186751

Eevans added a comment.Apr 3 2018, 6:32 PM

The cluster is now running 3.11.2 (release).

$ cdsh -c restbase-dev -- c-foreach-nt version
restbase-dev1005.eqiad.wmnet: a: ReleaseVersion: 3.11.2
restbase-dev1005.eqiad.wmnet: b: ReleaseVersion: 3.11.2
restbase-dev1006.eqiad.wmnet: a: ReleaseVersion: 3.11.2
restbase-dev1006.eqiad.wmnet: b: ReleaseVersion: 3.11.2
restbase-dev1004.eqiad.wmnet: a: ReleaseVersion: 3.11.2
restbase-dev1004.eqiad.wmnet: b: ReleaseVersion: 3.11.2
$

Mentioned in SAL (#wikimedia-operations) [2018-04-03T19:46:33Z] <urandom> restarting restbase-dev1004-{a,b} to enable patched cassandra 3.11.2 build - T186751

Mentioned in SAL (#wikimedia-operations) [2018-04-03T21:13:49Z] <urandom> (re)starting restbase-dev1004-{a,b} (ooms), and enabling alternately patched cassandra 3.11.2 build - T186751

Eevans added a comment.EditedApr 4 2018, 4:58 PM

Status update:

As detailed in the description, the intended sequence here was to install 3.11.0-wmf5 (what currently runs in production), reestablish our simulated load, and then test an upgrade to 3.11.2. However, after getting the cluster back up and sending traffic to it, we started to experience regular OOM exceptions, for what appears to be a leak of thread-local map entries.

What makes these exceptions interesting is that we have the identical software running in production, where it does not exhibit this behavior. And this cluster in fact, once ran without issue on the same version of Cassandra, (though using an older, development snapshot, of RESTBase). This is almost certainly a bug in Cassandra, but one that seems somehow bound to this environment.

To make matters more bewildering interesting, for a brief period I also encountered a spate of shutdowns at the hands of the kernel's out-of-memory killer (suggesting a leak of native memory). These seem to have gone away after rebooting the hosts though.

Since at this point we're in bug hunting mode, the cluster was been upgraded to 3.11.2 (to eliminate the complexities/uncertainties associated with reporting bugs and submitting patches against an older (patched) build).

Since we have previously experienced a thread-local memory leak associated with Netty's FastThreadLocalThreads (ala CASSANDRA-13754), I patched out their use as a test, and the memory leak persists (it just leaks the standard ThreadLocal instead of FastThreadLocal).

I have opened CASSANDRA-14355 upstream, and will begin moving some of my findings there.

More to follow....

Mentioned in SAL (#wikimedia-operations) [2018-04-10T20:13:29Z] <urandom> increasing sample change-prop sample rate to 20% (from 10) in dev environment -- T186751

Mentioned in SAL (#wikimedia-operations) [2018-04-11T16:11:31Z] <urandom> restarting cassandra, dev environment (testing default GC settings) -- T186751

Dzahn added a subscriber: Dzahn.Apr 11 2018, 5:33 PM

fyi: T189050 / T189050#4124163 means you should not have worry anymore about scheduling downtimes for these services when on the dev environment.

Mentioned in SAL (#wikimedia-operations) [2018-04-11T18:47:26Z] <urandom> restarting cassandra, dev environment (set -XX:+PerfDisableSharedMem) -- T186751

fyi: T189050 / T189050#4124163 means you should not have worry anymore about scheduling downtimes for these services when on the dev environment.

Oh, that's awesome; Thanks @Dzahn !

Mentioned in SAL (#wikimedia-operations) [2018-04-11T20:36:52Z] <urandom> increase change-prop sample rate in dev env to 40% (from 20) -- T186751

Mentioned in SAL (#wikimedia-operations) [2018-04-12T14:03:20Z] <urandom> increase change-prop sample rate in dev env to 60% (from 40) -- T186751

Mentioned in SAL (#wikimedia-operations) [2018-04-12T16:59:25Z] <urandom> increase change-prop sample rate in dev env to 80% (from 60) -- T186751

Mentioned in SAL (#wikimedia-operations) [2018-04-12T20:37:57Z] <urandom> increase change-prop sample rate in dev env to 100% (from 80) -- T186751

Mentioned in SAL (#wikimedia-operations) [2018-04-13T13:18:59Z] <urandom> increasing heap size to 16G -- T186751

At this point I'm fairly certain that this isn't a memory leak in the conventional sense. A bug in change-propagation had prevented sampling from working, and the dev cluster (with < 20% the capacity of production), was seeing throughput levels in excess of 5x what have in production. The excessive heap utilization would seem to be the result of additional per-thread state associated with this higher throughput. This is still worth pursuing upstream, since this is not how an application should degrade in the face of high load, but since it effects 3.11.0 as well, I think we can remove this as a blocker to a 3.11.2 upgrade

Eevans updated the task description. (Show Details)Apr 13 2018, 1:30 PM
Eevans renamed this task from Reset RESTBase dev environment to Restablish RESTBase dev environment with Cassandra 3.11.2.Apr 13 2018, 1:39 PM
Eevans closed this task as Resolved.Jun 20 2018, 2:52 PM
Eevans edited projects, added Services (done); removed Services (doing).