Restablish RESTBase dev environment with Cassandra 3.11.2
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Eevans
	Feb 7 2018, 8:56 PM

Description

During the transition from the legacy storage system, to the new current revision strategy, the restbase-dev environment became out-of-sync with current production.

I propose we reset the Cassandra cluster, upgrade RESTBase, and re-enable sampled changeprop.

Baseline Cassandra using version 3.11.0-wmf5
(Re)deploy RESTBase
Re-establish sampled changeprop
Upgrade to Cassandra 3.11.2

NOTE: ~~restbase-dev1006 has a hardware failure that needs addressing (a reimage may be necessary as a result)~~

Details

	Subject	Repo	Branch	Lines +/-
	RESTBase: Add the correct seeds for the dev environment	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Stalled	None	T324931 Clean up open RESTBase related tickets
In Progress	None	T262315 <CORE TECHNOLOGY> API Migration & RESTBase Sunset
Resolved	DAlangi_WMF	T324678 Migrate proton (chromium-render) away from restbase
Open	None	T167603 Any Chinese Wiki's projects about "Download as PDF" can not auto change to Simplified Chinese or Traditional Chinese
Resolved	ovasileva	T147553 [EPIC] Page previews broken on many projects
Open	ovasileva	T244262 [Epic] Enable page previews and reference previews as a beta feature on all projects
Open	None	T111231 Page previews for Wikidata
Invalid	None	T148854 Use RESTBase for zhwiki
Resolved	• Pchelolo	T188164 Popups don‘t support language variant conversion and {{lang}} template
Resolved	• mobrovac	T190689 FY17/18 Q4 Program 7 Services Goal: Language variants support
Resolved	Eevans	T186751 Restablish RESTBase dev environment with Cassandra 3.11.2
Resolved	Dzahn	T185494 Degraded RAID on restbase-dev1006
		Unknown Object (Task)
Resolved	Eevans	T224260 restbase-dev1006 has a broken disk
		Unknown Object (Task)

Event Timeline

Eevans created this task.Feb 7 2018, 8:56 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 7 2018, 8:56 PM

Eevans triaged this task as Medium priority.Feb 7 2018, 8:57 PM

Eevans added projects: Services, User-Eevans.

Eevans updated the task description. (Show Details)

Eevans added a subtask: T185494: Degraded RAID on restbase-dev1006.Feb 7 2018, 8:59 PM

RobH changed the status of subtask T185494: Degraded RAID on restbase-dev1006 from Open to Stalled.Feb 14 2018, 8:02 PM

• mobrovac edited projects, added Services (blocked); removed Services.Mar 1 2018, 8:34 PM

Dzahn closed subtask T185494: Degraded RAID on restbase-dev1006 as Resolved.Mar 13 2018, 11:55 PM

Eevans updated the task description. (Show Details)Mar 14 2018, 3:46 PM

Mentioned in SAL (#wikimedia-operations) [2018-03-19T18:33:12Z] <eevans@tin> Started deploy [restbase/deploy@8dbc93c] (dev-cluster): bring dev environment current w/ production (T186751)

Mentioned in SAL (#wikimedia-operations) [2018-03-19T18:43:28Z] <eevans@tin> Finished deploy [restbase/deploy@8dbc93c] (dev-cluster): bring dev environment current w/ production (T186751) (duration: 10m 16s)

Mentioned in SAL (#wikimedia-operations) [2018-03-19T19:05:48Z] <eevans@tin> Started deploy [restbase/deploy@8dbc93c] (dev-cluster): update dev environment to current production (T186751)

Mentioned in SAL (#wikimedia-operations) [2018-03-19T19:14:17Z] <eevans@tin> Finished deploy [restbase/deploy@8dbc93c] (dev-cluster): update dev environment to current production (T186751) (duration: 08m 30s)

Mentioned in SAL (#wikimedia-operations) [2018-03-19T19:31:52Z] <eevans@tin> Started deploy [restbase/deploy@8dbc93c] (dev-cluster): update dev environment to current production (T186751)

Mentioned in SAL (#wikimedia-operations) [2018-03-19T19:42:20Z] <eevans@tin> Finished deploy [restbase/deploy@8dbc93c] (dev-cluster): update dev environment to current production (T186751) (duration: 10m 28s)

Change 420416 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[operations/puppet@production] RESTBase: Add the correct seeds for the dev environment

https://gerrit.wikimedia.org/r/420416

gerritbot added a project: Patch-For-Review.Mar 19 2018, 7:56 PM

Change 420416 merged by Dzahn:
[operations/puppet@production] RESTBase: Add the correct seeds for the dev environment

https://gerrit.wikimedia.org/r/420416

• mobrovac assigned this task to Eevans.Mar 20 2018, 1:32 PM

• mobrovac edited projects, added Services (doing); removed Patch-For-Review, Services (blocked).

Mentioned in SAL (#wikimedia-operations) [2018-03-20T18:18:42Z] <eevans@tin> Started deploy [restbase/deploy@8dbc93c] (dev-cluster): Update Dev environment to current production (T186751)

Mentioned in SAL (#wikimedia-operations) [2018-03-20T18:24:40Z] <eevans@tin> Finished deploy [restbase/deploy@8dbc93c] (dev-cluster): Update Dev environment to current production (T186751) (duration: 05m 58s)

Mentioned in SAL (#wikimedia-operations) [2018-03-20T18:27:53Z] <eevans@tin> Started deploy [restbase/deploy@8dbc93c] (dev-cluster): Update Dev environment to current production (T186751)

Mentioned in SAL (#wikimedia-operations) [2018-03-20T18:30:35Z] <eevans@tin> Finished deploy [restbase/deploy@8dbc93c] (dev-cluster): Update Dev environment to current production (T186751) (duration: 02m 43s)

Eevans updated the task description. (Show Details)Mar 20 2018, 6:34 PM

Eevans updated the task description. (Show Details)Mar 22 2018, 2:48 PM

Mentioned in SAL (#wikimedia-operations) [2018-04-03T16:10:23Z] <urandom> rebooting restbase-dev1004 - T186751

Mentioned in SAL (#wikimedia-operations) [2018-04-03T16:33:21Z] <urandom> rebooting restbase-dev1006 - T186751

Mentioned in SAL (#wikimedia-operations) [2018-04-03T18:13:26Z] <urandom> upgrading restbase-dev1004-a to cassandra 3.11.2 - T186751

Mentioned in SAL (#wikimedia-operations) [2018-04-03T18:15:40Z] <urandom> upgrading restbase-dev1004-b to cassandra 3.11.2 - T186751

Mentioned in SAL (#wikimedia-operations) [2018-04-03T18:18:39Z] <urandom> upgrading restbase-dev1005-a to cassandra 3.11.2 - T186751

Mentioned in SAL (#wikimedia-operations) [2018-04-03T18:20:28Z] <urandom> upgrading restbase-dev1005-b to cassandra 3.11.2 - T186751

Mentioned in SAL (#wikimedia-operations) [2018-04-03T18:23:41Z] <urandom> upgrading restbase-dev1006-a to cassandra 3.11.2 - T186751

Mentioned in SAL (#wikimedia-operations) [2018-04-03T18:25:37Z] <urandom> upgrading restbase-dev1006-b to cassandra 3.11.2 - T186751

The cluster is now running 3.11.2 (release).

$ cdsh -c restbase-dev -- c-foreach-nt version
restbase-dev1005.eqiad.wmnet: a: ReleaseVersion: 3.11.2
restbase-dev1005.eqiad.wmnet: b: ReleaseVersion: 3.11.2
restbase-dev1006.eqiad.wmnet: a: ReleaseVersion: 3.11.2
restbase-dev1006.eqiad.wmnet: b: ReleaseVersion: 3.11.2
restbase-dev1004.eqiad.wmnet: a: ReleaseVersion: 3.11.2
restbase-dev1004.eqiad.wmnet: b: ReleaseVersion: 3.11.2
$

Mentioned in SAL (#wikimedia-operations) [2018-04-03T19:46:33Z] <urandom> restarting restbase-dev1004-{a,b} to enable patched cassandra 3.11.2 build - T186751

Mentioned in SAL (#wikimedia-operations) [2018-04-03T21:13:49Z] <urandom> (re)starting restbase-dev1004-{a,b} (ooms), and enabling alternately patched cassandra 3.11.2 build - T186751

Status update:

As detailed in the description, the intended sequence here was to install 3.11.0-wmf5 (what currently runs in production), reestablish our simulated load, and then test an upgrade to 3.11.2. However, after getting the cluster back up and sending traffic to it, we started to experience regular OOM exceptions, for what appears to be a leak of thread-local map entries.

What makes these exceptions interesting is that we have the identical software running in production, where it does not exhibit this behavior. And this cluster in fact, once ran without issue on the same version of Cassandra, (though using an older, development snapshot, of RESTBase). This is almost certainly a bug in Cassandra, but one that seems somehow bound to this environment.

To make matters more ~~bewildering~~ interesting, for a brief period I also encountered a spate of shutdowns at the hands of the kernel's out-of-memory killer (suggesting a leak of native memory). These seem to have gone away after rebooting the hosts though.

Since at this point we're in bug hunting mode, the cluster was been upgraded to 3.11.2 (to eliminate the complexities/uncertainties associated with reporting bugs and submitting patches against an older (patched) build).

Since we have previously experienced a thread-local memory leak associated with Netty's FastThreadLocalThreads (ala CASSANDRA-13754), I patched out their use as a test, and the memory leak persists (it just leaks the standard ThreadLocal instead of FastThreadLocal).

I have opened CASSANDRA-14355 upstream, and will begin moving some of my findings there.

More to follow....

Mentioned in SAL (#wikimedia-operations) [2018-04-10T20:13:29Z] <urandom> increasing sample change-prop sample rate to 20% (from 10) in dev environment -- T186751

Mentioned in SAL (#wikimedia-operations) [2018-04-11T16:11:31Z] <urandom> restarting cassandra, dev environment (testing default GC settings) -- T186751

fyi: T189050 / T189050#4124163 means you should not have worry anymore about scheduling downtimes for these services when on the dev environment.

Mentioned in SAL (#wikimedia-operations) [2018-04-11T18:47:26Z] <urandom> restarting cassandra, dev environment (set -XX:+PerfDisableSharedMem) -- T186751

In T186751#4124167, @Dzahn wrote:

fyi: T189050 / T189050#4124163 means you should not have worry anymore about scheduling downtimes for these services when on the dev environment.

Oh, that's awesome; Thanks @Dzahn !

Mentioned in SAL (#wikimedia-operations) [2018-04-11T20:36:52Z] <urandom> increase change-prop sample rate in dev env to 40% (from 20) -- T186751

Mentioned in SAL (#wikimedia-operations) [2018-04-12T14:03:20Z] <urandom> increase change-prop sample rate in dev env to 60% (from 40) -- T186751

Mentioned in SAL (#wikimedia-operations) [2018-04-12T16:59:25Z] <urandom> increase change-prop sample rate in dev env to 80% (from 60) -- T186751

Mentioned in SAL (#wikimedia-operations) [2018-04-12T20:37:57Z] <urandom> increase change-prop sample rate in dev env to 100% (from 80) -- T186751

• mobrovac added a parent task: T190689: FY17/18 Q4 Program 7 Services Goal: Language variants support.Apr 12 2018, 9:02 PM

Mentioned in SAL (#wikimedia-operations) [2018-04-13T13:18:59Z] <urandom> increasing heap size to 16G -- T186751

At this point I'm fairly certain that this isn't a memory leak in the conventional sense. A bug in change-propagation had prevented sampling from working, and the dev cluster (with < 20% the capacity of production), was seeing throughput levels in excess of 5x what have in production. The excessive heap utilization would seem to be the result of additional per-thread state associated with this higher throughput. This is still worth pursuing upstream, since this is not how an application should degrade in the face of high load, but since it effects 3.11.0 as well, I think we can remove this as a blocker to a 3.11.2 upgrade

Eevans updated the task description. (Show Details)Apr 13 2018, 1:30 PM

Eevans renamed this task from Reset RESTBase dev environment to Restablish RESTBase dev environment with Cassandra 3.11.2.Apr 13 2018, 1:39 PM

Eevans closed this task as Resolved.Jun 20 2018, 2:52 PM

Eevans edited projects, added Services (done); removed Services (doing).

Restablish RESTBase dev environment with Cassandra 3.11.2Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Restablish RESTBase dev environment with Cassandra 3.11.2
Closed, ResolvedPublic
Actions

Related Objects
Search...